Building a relationship scraper (text → enrichment JSON)

[!TIP] Other docs: docs/README.md · Enrichment contract · What is Cortex

A practical guide: turn transcripts or article text into Cortex-ready { "relationships": [ … ] } JSON — only the relationships array (no video or article root) — without writing to the database from the extractor itself.

This page is self-contained: context, jobs and workers, the relationship scraper, and a full list of Mongo collection fields.

What is Cortex?

Cortex is a content intelligence stack: structured documents live in MongoDB (companies, people, videos, articles, tokens, taxonomies, and more). A parallel Neo4j graph stores how those entities connect—relationship types are fixed in code as EDGE_TYPES (see src/graph/schema.ts). Rows that participate in the graph carry a mongoId bridge so graph nodes line up with Mongo documents.

The app exposes REST APIs under /api/v1/…: read collections by slug or id, admin writes, enrichment worker endpoints, and graph helpers. Classification uses taxonomy terms in Mongo (per taxonomy); many graph edges (e.g. CLASSIFIED_AS) point at taxonomy-terms nodes in Neo4j, not ad-hoc string tags on documents.

Rule of thumb. Mongo answers “what is this record?” Neo4j answers “what is it linked to, and how?” A relationship scraper reads text and proposes links in that graph vocabulary—it does not replace either database.

What you will build later

In the Relationship scraper tab, you will define a tool that turns a transcript or article body into a JSON object that contains only relationships (no subject document block). Importers merge that list with a video, article, or other subject id they already have, then create rows and graph edges—separate from the core Cortex servers unless you wire them yourself.

Jobs → workers → documents → relationship scraper

Enrichment jobs are Mongo documents in the enrichment-jobs collection. Each job references a piece of content by contentType (e.g. company, person) and contentId, and moves through pending → processing → complete or failed. Workers (or HTTP clients) claim pending jobs, load the target document, run enrichment logic, then mark the job finished and persist extracted fields or graph payloads.

Today, dedicated code paths exist for company and person enrichment (src/lib/enrichment/company/worker.ts, src/lib/enrichment/person/worker.ts), with APIs under /api/v1/enrichment/workers/…. Other content types can follow the same pattern: one job row, one worker specialized for that document shape.

A relationship scraper is different in one important way: its input is usually plain text (transcript, article HTML stripped to text), not a single Mongo id. It should output only a relationships list (see the Relationship scraper tab). Downstream, you merge that list with a subject video or article id, enqueue a job for review, or call the graph API after ids exist. Think of it as an upstream or sidecar step—not necessarily the same queue as company/person workers unless you integrate it.

Solid lines: job queue and typed workers. Dashed conceptually: text → JSON → importer is often a separate pipeline that feeds Cortex once you map text to a subject video or article.

What you’re building. A CLI, service, or notebook that stdins (or reads a file of) plain text and stdouts (or writes) one JSON object with a single top-level key — relationships only (no video, article, or other subject payload). It may call Cortex GET to resolve slugs → mongoId. It does not POST documents or graph edges.

Flow: input text to relationship list

Plain text goes through detection and mapping; the only remote calls your extractor should make are GET (optional slug resolution). The artifact is only that list: an object { "relationships": [ … ] } — no subject metadata (title, slug, transcript) in the same output; the pipeline that owns the video or article attaches those separately.

Example: snippet in, relationships out

Input (fictional transcript):

HOST: Today we’re unpacking how Circle issues USDC and what it means for payments.
Later we’ll touch Base as the chain they’re leaning on.

Output (excerpt — only relationships; mongoId only if GET found a row):

{
  "relationships": [
    {
      "type": "MENTIONS",
      "document": { "type": "companies", "slug": "circle", "mongoId": "674a…" },
      "properties": { "snippet": "Circle issues USDC" }
    },
    {
      "type": "MENTIONS",
      "document": { "type": "tokens", "slug": "usdc" },
      "properties": { "snippet": "issues USDC" }
    },
    {
      "type": "MENTIONS",
      "document": { "type": "blockchains", "slug": "base" },
      "properties": { "snippet": "touch Base as the chain" }
    }
  ]
}

1. Read the contracts

Edge names — Only use strings from src/graph/schema.ts → EDGE_TYPES. Not every edge applies to every node pair.
Collection keys — relationships[].document.type uses API kebab-case keys (companies, people, tokens, taxonomy-terms, …). Graph-backed collections are listed in src/lib/graph/mongo-neo4j-mapping.ts (GRAPH_BACKED_COLLECTIONS).
Output shape — Root object with only relationships (array). Each item has type (an EDGE_TYPES string), document with type, slug, optional mongoId, and optional properties (e.g. snippet, confidence). Do not emit video, article, or other subject keys here—the importer binds this list to the subject document.

2. Recommended pipeline

Ingest text — Normalize encoding (UTF-8), strip HTML if needed, optional sentence/paragraph segmentation for evidence spans.
Detect candidates — Companies, people, tokens, chains, taxonomy themes, playlist/show context — using whatever fits your stack (dictionaries, NER, LLM with a strict schema, or hybrid). Output internal candidates: surface form + optional context (sentence index, role hints).
Map to Cortex entities — For each candidate, choose a canonical slug (and type) that matches Mongo. If ambiguous (two people named “Alex”), use surrounding text + GET list/search to disambiguate before locking the slug.
Resolve with GET — GET /api/v1/{collection}?slug=…&limit=1 (or your public read routes). If a row exists, set document.mongoId from data[]._id. If not, omit mongoId and keep slug + hints in properties for downstream EnrichmentJob / create flows.
Taxonomies — For CLASSIFIED_AS or video.classifiedAs / format, load live terms: GET /api/v1/taxonomies and GET /api/v1/taxonomy-terms?taxonomy=…. Never invent term ids from memory.
Choose relationship types — Map intents to EDGE_TYPES: e.g. spoken reference → often MENTIONS; on-screen host/guest → FEATURES; theme → CLASSIFIED_AS → taxonomy-terms; series → IN_PLAYLIST → playlists.
Emit JSON — Output only { "relationships": [ … ] }. Add properties.snippet / confidence when useful; importers may strip them before graph API calls and merge with the subject video or article elsewhere.

3. Cortex APIs the tool uses

Use

Example

Resolve entity by slug

GET /api/v1/people?slug=jane-doe&limit=1

Taxonomy + terms

GET /api/v1/taxonomies?slug=video-kind then GET /api/v1/taxonomy-terms?taxonomy={id}

Optional: existing graph for dedup

GET /api/v1/graph/relationships?mongoId={subjectMongoId}

Auth: use a session cookie or x-api-key like other Cortex API clients. No POST from this tool.

4. Collections & relationships you can use

API collection keys (kebab-case, /api/v1/{collection}) — from src/lib/api/model-map.ts. Graph-backed rows get a Neo4j node with mongoId and can be sourceId / targetId in POST /api/v1/graph/relationships (src/lib/graph/mongo-neo4j-mapping.ts).

All collections

Collection

Graph-backed

Neo4j label (if any)

articles

Yes

Episode (sourceType: article)

blockchains

Yes

Blockchain

companies

Yes

Company

data-sources

Yes

DataSource

events

Yes

Event

investors

Yes

Investor

people

Yes

Person

playlists

Yes

Playlist

products

Yes

Product

shows

Yes

Show

taxonomy-terms

Yes

TaxonomyTerm

tokens

Yes

Asset

videos

Yes

Video

enrichment-jobs

—

locations

—

stablecoin-profiles

—

taxonomies

— (use taxonomy-terms for graph)

Topic exists only in Neo4j (no Mongo collection) — used with edges like COVERS from Episode.

Mongo references between collections (document fields)

These are FK fields on Mongo documents, not the same as Neo4j relationships[] in enrichment JSON — but they define how collections link in the database.

From

Field

products

company

companies (required)

videos

classifiedAs

taxonomy-terms (video-kind)

videos

parentVideo

videos (clips)

taxonomy-terms

taxonomy

taxonomies

stablecoin-profiles

token

tokens

companies / investors / people / events

location

locations (optional)

enrichment-jobs

contentId

polymorphic via contentType

Allowed Neo4j relationship `type` strings

Must be one of EDGE_TYPES in src/graph/schema.ts — not every pair of labels supports every edge; the API/importer validates.

ABOUT · AFFILIATED_WITH · AGREES_WITH · AUDITED_BY · AUTHORED_BY · BELONGS_TO · CLASSIFIED_AS · CLIP_OF
· COMPETES_WITH · CONTRADICTS · COVERS · DISAGREES_WITH · EVOLVED_FROM · FEATURES · FOR_SHOW · FOUNDED
· HOSTS · INCLUDES · IN_PLAYLIST · INVESTED_IN · INVOLVES · ISSUED_BY · MADE · MENTIONS · NATIVE_TOKEN
· OPERATES · ORGANIZED_BY · PART_OF · PREDICTED · PRODUCT_OF · RECOMMENDED · REFERENCES · REGARDING
· RELATED_TO · SAID · SOURCED_FROM · SPONSORED_BY · SPONSORED_VIDEO · SPONSORS · SUPPORTS · TARGET
· TRIGGERED · WORKS_AT

Relationships per collection (Neo4j)

For each graph-backed collection, typical outgoing relationships (from this node → target type) and incoming (from source → this node). Use relationships[].type with document.type pointing at the target collection. Not every EDGE_TYPES pair is valid for every label pair; the API validates.

Collection

Neo4j label

Typical outgoing

Typical incoming

articles

Episode

SOURCED_FROM → DataSource; COVERS → Topic (graph-only); MENTIONS → Company, Person, Asset, …; CLASSIFIED_AS → TaxonomyTerm; FEATURES → Person; AUTHORED_BY → Person; REFERENCES, … (pipeline)

—

videos

Video

CLASSIFIED_AS → TaxonomyTerm (kind/format); CLIP_OF → Video; IN_PLAYLIST → Playlist; MENTIONS, FEATURES, …

CLIP_OF ← Video (child clips)

playlists

Playlist

FOR_SHOW → Show

IN_PLAYLIST ← Video

shows

Show

—

FOR_SHOW ← Playlist

companies

Company

CLASSIFIED_AS → TaxonomyTerm; OPERATES → Blockchain

INVESTED_IN ← Investor; WORKS_AT, FOUNDED, AFFILIATED_WITH ← Person; MENTIONS ← Episode/Video; ISSUED_BY ← Asset (via product bridge); AUDITED_BY ← Asset (bridge)

people

Person

WORKS_AT, AFFILIATED_WITH, FOUNDED → Company; extraction edges (e.g. claims)

FEATURES, AUTHORED_BY ← Episode; MENTIONS ← content

tokens

Asset

CLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company; NATIVE_TOKEN → Blockchain; AUDITED_BY → Company

MENTIONS ← Episode/Video; …

products

Product

—

PRODUCT_OF ← Asset

blockchains

Blockchain

CLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← Asset

OPERATES ← Company

investors

Investor

INVESTED_IN → Company

—

events

Event

ORGANIZED_BY, INVOLVES, … (per pipeline)

—

data-sources

DataSource

—

SOURCED_FROM ← Episode, Claim, …

taxonomy-terms

TaxonomyTerm

—

CLASSIFIED_AS ← Company, Asset, Blockchain, Video, Episode, …

Not graph-backed as primary vertices: locations, stablecoin-profiles, taxonomies, enrichment-jobs — use them via Mongo FKs or jobs, not as document.type endpoints for new graph nodes. Topic is Neo4j-only (no collection). Bridge-synced edges (OPERATES, INVESTED_IN, PRODUCT_OF, …) are often maintained by bridgeSync + maps — enrichment may still emit matching intents for review.

5. What happens after your JSON

A separate importer (or job worker) knows the subject content’s Mongo id (video, article, etc.), merges your relationships list with that subject, creates any missing target rows, runs bridge sync as needed, then POST /api/v1/graph/relationships with Mongo sourceId / targetId. Your scraper outputs relationships only; binding to the subject is downstream.

6. Build checklist

Output is valid JSON (one object per run) with only a relationships array at the top level.
Every relationships[].type ∈ EDGE_TYPES.
Every relationships[].document has type + slug; mongoId only when GET found the row.
Taxonomy fields reference real taxonomy-terms rows.
Tool is read-only with respect to Cortex writes.

7. Code in this repo to study

src/graph/schema.ts — EDGE_TYPES, NODE_LABELS
src/app/api/v1/graph/relationships/route.ts — graph write shape (importer side)
src/lib/graph/mongo-neo4j-mapping.ts — which collections sync to Neo4j

All collection fields

Use kebab-case collection keys in /api/v1/{collection}. Fields below match structured admin forms in src/lib/admin/collection-form-fields.ts, plus model-only fields you may set via JSON in admin or API. Ref fields (location, company, taxonomy, …) are Mongo ObjectIds (24-char hex in APIs).

Every document includes Mongoose createdAt and updatedAt unless noted. The enrichment-jobs model maps to the physical Mongo collection extractionjobs.

Collection

Fields

articles

title, slug, subheadline, format (taxonomy term slug, article-format), externalSourceUrl, externalSourceName, publishedAt, coverImage, publicId, published, featured — plus model content (mixed; JSON editor)

blockchains

name, slug, chainId, chainType, vmType, consensusMechanism, description, logo, explorerUrl, launchDate, socialLinks, published

companies

name, slug, tagline, description, yearFounded, logo, icon, brandColor, websiteUrl, employeeCount, fundingStage, totalFundingUsd, legalName, entityType, registrationNumber, countryOfIncorporation, location, socialLinks, publicId, verified, published, featured

data-sources

name, slug, sourceType, baseUrl, trustLevel, refreshFrequency, isActive, notes

enrichment-jobs

contentType, contentId, status, extractedAt, entityCount, errorLog, retryCount — plus model jobPayload (mixed; audit snapshot). Stored in collection extractionjobs.

events

title, slug, startDate, endDate, location, venue, summary, description, websiteUrl, registrationUrl, playlistUrl, timezone, publicId, coverImage, published

investors

name, slug, investorType, stages, description, logo, aumUsd, portfolioCount, websiteUrl, location, socialLinks, publicId, published

locations

city, country, latitude, longitude, timezone

people

name, slug, avatar, bio, location, socialLinks, publicId, published, contributor

playlists

name, slug, summary, description, coverImage, externalUrl, published

products

company, name, slug, description, productType, productStatus, launchDate, websiteUrl, docsUrl, sourceCodeUrl, isMainProduct, isOpenSource, published

shows

name, slug, summary, description, coverImage, socialLinks, published

stablecoin-profiles

token, pegTargetPrice, launchDate, auditFrequency, yieldSource, mintMinimumUsd, redeemMinimumUsd, redemptionTime, feeMintPct, feeRedeemPct, reserveRatio, riskScore, riskScoreRationale, whitepaperUrl

taxonomies

name, slug, description, appliesTo

taxonomy-terms

taxonomy, name, slug, description, color, displayOrder, isActive

tokens

symbol, name, slug, tokenType, isStablecoin, description, logo, coingeckoId, cmcId, defillamaId, socialLinks, verified, published — plus model metadata (mixed; JSON editor)

videos

title, slug, classifiedAs (taxonomy term id, video-kind), episodeNumber, publishedAt, format (term slug, video-format), parentVideo, clipStartTime, clipEndTime, videoUrl, type, coverImage, thumbnail, duration, transcript, published

Relationships per collection (Neo4j)

Typical outgoing and incoming edge types when this collection’s row is a vertex in Neo4j (graph-backed collections only).

Collection

Neo4j label

Typical outgoing

Typical incoming

articles

Episode

—

videos

Video

CLASSIFIED_AS → TaxonomyTerm (kind/format); CLIP_OF → Video; IN_PLAYLIST → Playlist; MENTIONS, FEATURES, …

CLIP_OF ← Video (child clips)

playlists

Playlist

FOR_SHOW → Show

IN_PLAYLIST ← Video

shows

Show

—

FOR_SHOW ← Playlist

companies

Company

CLASSIFIED_AS → TaxonomyTerm; OPERATES → Blockchain

INVESTED_IN ← Investor; WORKS_AT, FOUNDED, AFFILIATED_WITH ← Person; MENTIONS ← Episode/Video; ISSUED_BY ← Asset (via product bridge); AUDITED_BY ← Asset (bridge)

people

Person

WORKS_AT, AFFILIATED_WITH, FOUNDED → Company; extraction edges (e.g. claims)

FEATURES, AUTHORED_BY ← Episode; MENTIONS ← content

tokens

Asset

CLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company; NATIVE_TOKEN → Blockchain; AUDITED_BY → Company

MENTIONS ← Episode/Video; …

products

Product

—

PRODUCT_OF ← Asset

blockchains

Blockchain

CLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← Asset

OPERATES ← Company

investors

Investor

INVESTED_IN → Company

—

events

Event

ORGANIZED_BY, INVOLVES, … (per pipeline)

—

data-sources

DataSource

—

SOURCED_FROM ← Episode, Claim, …

taxonomy-terms

TaxonomyTerm

—

CLASSIFIED_AS ← Company, Asset, Blockchain, Video, Episode, …

Not graph-backed as primary nodes: locations, stablecoin-profiles, taxonomies, enrichment-jobs. Topic is Neo4j-only (no Mongo collection).

Complex or nested shapes not covered by the form (extra keys on any collection) are edited in the admin JSON tab. Use relationships[].document.type with these collection keys when pointing at targets from your scraper output.

mermaid.initialize({ startOnLoad: false, securityLevel: "loose", theme: "base", themeVariables: { primaryColor: "#ccfbf1", primaryTextColor: "#1c1917", primaryBorderColor: "#0d9488", lineColor: "#57534e", secondaryColor: "#ede8df", tertiaryColor: "#fffcf7", fontFamily: "DM Sans, system-ui, sans-serif" }, flowchart: { curve: "basis", padding: 12 } }); (function () { const tablist = document.querySelector("#guide-tabs [role=tablist]"); if (!tablist) return; const tabs = Array.from(tablist.querySelectorAll('[role="tab"]')); const panels = tabs.map(function (t) { return document.getElementById(t.getAttribute("aria-controls") || ""); }); async function renderMermaidInPanel(panel) { if (!panel) return; const blocks = panel.querySelectorAll('pre.mermaid:not([data-mermaid-done])'); for (const el of blocks) { try { await mermaid.run({ nodes: [el] }); el.setAttribute("data-mermaid-done", "true"); } catch { el.setAttribute("data-mermaid-done", "error"); } } } async function selectTab(index) { const n = tabs.length; const i = Math.max(0, Math.min(index, n - 1)); tabs.forEach(function (tab, j) { const selected = j === i; tab.setAttribute("aria-selected", selected ? "true" : "false"); tab.tabIndex = selected ? 0 : -1; if (panels[j]) panels[j]. = !selected; }); await renderMermaidInPanel(panels[i]); } tabs.forEach(function (tab, i) { tab.addEventListener("click", function () { void selectTab(i); }); }); tablist.addEventListener("keydown", function (e) { const current = tabs.indexOf(document.activeElement); if (current < 0) return; if (e.key === "ArrowRight" || e.key === "ArrowDown") { e.preventDefault(); const next = Math.min(current + 1, tabs.length - 1); void selectTab(next); tabs[next].focus(); } else if (e.key === "ArrowLeft" || e.key === "ArrowUp") { e.preventDefault(); const prev = Math.max(current - 1, 0); void selectTab(prev); tabs[prev].focus(); } else if (e.key === "Home") { e.preventDefault(); void selectTab(0); tabs[0].focus(); } else if (e.key === "End") { e.preventDefault(); void selectTab(tabs.length - 1); tabs[tabs.length - 1].focus(); } }); void selectTab(0); })();

Building a relationship scraper (text → enrichment JSON)

What is Cortex?

What you will build later

Jobs → workers → documents → relationship scraper

Flow: input text to relationship list

Example: snippet in, relationships out

1. Read the contracts

2. Recommended pipeline

3. Cortex APIs the tool uses

4. Collections & relationships you can use

All collections

Mongo references between collections (document fields)

Allowed Neo4j relationship type strings

Relationships per collection (Neo4j)

5. What happens after your JSON

6. Build checklist

7. Code in this repo to study

All collection fields

Relationships per collection (Neo4j)

Allowed Neo4j relationship `type` strings