Building a relationship scraper (text → enrichment JSON)
[!TIP] Other docs: docs/README.md · Enrichment contract · What is Cortex
A practical guide: turn transcripts or article text into Cortex-ready { "relationships": [ … ] } JSON — only the relationships array (no video or article root) — without writing to the database from the extractor itself.
This page is self-contained: context, jobs and workers, the relationship scraper, and a full list of Mongo collection fields.
What is Cortex?
Cortex is a content intelligence stack: structured documents live in MongoDB (companies, people, videos, articles, tokens, taxonomies, and more). A parallel Neo4j graph stores how those entities connect—relationship types are fixed in code as EDGE_TYPES (see src/graph/schema.ts). Rows that participate in the graph carry a mongoId bridge so graph nodes line up with Mongo documents.
The app exposes REST APIs under /api/v1/…: read collections by slug or id, admin writes, enrichment worker endpoints, and graph helpers. Classification uses taxonomy terms in Mongo (per taxonomy); many graph edges (e.g. CLASSIFIED_AS) point at taxonomy-terms nodes in Neo4j, not ad-hoc string tags on documents.
Rule of thumb. Mongo answers “what is this record?” Neo4j answers “what is it linked to, and how?” A relationship scraper reads text and proposes links in that graph vocabulary—it does not replace either database.
What you will build later
In the Relationship scraper tab, you will define a tool that turns a transcript or article body into a JSON object that contains only relationships (no subject document block). Importers merge that list with a video, article, or other subject id they already have, then create rows and graph edges—separate from the core Cortex servers unless you wire them yourself.
Jobs → workers → documents → relationship scraper
Enrichment jobs are Mongo documents in the enrichment-jobs collection. Each job references a piece of content by contentType (e.g. company, person) and contentId, and moves through pending → processing → complete or failed. Workers (or HTTP clients) claim pending jobs, load the target document, run enrichment logic, then mark the job finished and persist extracted fields or graph payloads.
Today, dedicated code paths exist for company and person enrichment (src/lib/enrichment/company/worker.ts, src/lib/enrichment/person/worker.ts), with APIs under /api/v1/enrichment/workers/…. Other content types can follow the same pattern: one job row, one worker specialized for that document shape.
A relationship scraper is different in one important way: its input is usually plain text (transcript, article HTML stripped to text), not a single Mongo id. It should output only a relationships list (see the Relationship scraper tab). Downstream, you merge that list with a subject video or article id, enqueue a job for review, or call the graph API after ids exist. Think of it as an upstream or sidecar step—not necessarily the same queue as company/person workers unless you integrate it.
Solid lines: job queue and typed workers. Dashed conceptually: text → JSON → importer is often a separate pipeline that feeds Cortex once you map text to a subject video or article.
What you’re building. A CLI, service, or notebook that stdins (or reads a file of) plain text and stdouts (or writes) one JSON object with a single top-level key — relationships only (no video, article, or other subject payload). It may call Cortex GET to resolve slugs → mongoId. It does not POST documents or graph edges.
Flow: input text to relationship list
Plain text goes through detection and mapping; the only remote calls your extractor should make are GET (optional slug resolution). The artifact is only that list: an object { "relationships": [ … ] } — no subject metadata (title, slug, transcript) in the same output; the pipeline that owns the video or article attaches those separately.
Example: snippet in, relationships out
Input (fictional transcript):
HOST: Today we’re unpacking how Circle issues USDC and what it means for payments.
Later we’ll touch Base as the chain they’re leaning on.
Output (excerpt — only relationships; mongoId only if GET found a row):
{
"relationships": [
{
"type": "MENTIONS",
"document": { "type": "companies", "slug": "circle", "mongoId": "674a…" },
"properties": { "snippet": "Circle issues USDC" }
},
{
"type": "MENTIONS",
"document": { "type": "tokens", "slug": "usdc" },
"properties": { "snippet": "issues USDC" }
},
{
"type": "MENTIONS",
"document": { "type": "blockchains", "slug": "base" },
"properties": { "snippet": "touch Base as the chain" }
}
]
}
1. Read the contracts
- Edge names — Only use strings from
src/graph/schema.ts→EDGE_TYPES. Not every edge applies to every node pair. - Collection keys —
relationships[].document.typeuses API kebab-case keys (companies,people,tokens,taxonomy-terms, …). Graph-backed collections are listed insrc/lib/graph/mongo-neo4j-mapping.ts(GRAPH_BACKED_COLLECTIONS). - Output shape — Root object with only
relationships(array). Each item hastype(anEDGE_TYPESstring),documentwithtype,slug, optionalmongoId, and optionalproperties(e.g.snippet,confidence). Do not emitvideo,article, or other subject keys here—the importer binds this list to the subject document.
2. Recommended pipeline
- Ingest text — Normalize encoding (UTF-8), strip HTML if needed, optional sentence/paragraph segmentation for evidence spans.
- Detect candidates — Companies, people, tokens, chains, taxonomy themes, playlist/show context — using whatever fits your stack (dictionaries, NER, LLM with a strict schema, or hybrid). Output internal candidates: surface form + optional context (sentence index, role hints).
- Map to Cortex entities — For each candidate, choose a canonical
slug(andtype) that matches Mongo. If ambiguous (two people named “Alex”), use surrounding text +GETlist/search to disambiguate before locking the slug. - Resolve with GET —
GET /api/v1/{collection}?slug=…&limit=1(or your public read routes). If a row exists, setdocument.mongoIdfromdata[]._id. If not, omitmongoIdand keepslug+ hints inpropertiesfor downstreamEnrichmentJob/ create flows. - Taxonomies — For
CLASSIFIED_ASorvideo.classifiedAs/format, load live terms:GET /api/v1/taxonomiesandGET /api/v1/taxonomy-terms?taxonomy=…. Never invent term ids from memory. - Choose relationship types — Map intents to
EDGE_TYPES: e.g. spoken reference → oftenMENTIONS; on-screen host/guest →FEATURES; theme →CLASSIFIED_AS→taxonomy-terms; series →IN_PLAYLIST→playlists. - Emit JSON — Output only
{ "relationships": [ … ] }. Addproperties.snippet/confidencewhen useful; importers may strip them before graph API calls and merge with the subject video or article elsewhere.
3. Cortex APIs the tool uses
Use
Example
Resolve entity by slug
GET /api/v1/people?slug=jane-doe&limit=1
Taxonomy + terms
GET /api/v1/taxonomies?slug=video-kind then GET /api/v1/taxonomy-terms?taxonomy={id}
Optional: existing graph for dedup
GET /api/v1/graph/relationships?mongoId={subjectMongoId}
Auth: use a session cookie or x-api-key like other Cortex API clients. No POST from this tool.
4. Collections & relationships you can use
API collection keys (kebab-case, /api/v1/{collection}) — from src/lib/api/model-map.ts. Graph-backed rows get a Neo4j node with mongoId and can be sourceId / targetId in POST /api/v1/graph/relationships (src/lib/graph/mongo-neo4j-mapping.ts).
All collections
Collection
Graph-backed
Neo4j label (if any)
articles
Yes
Episode (sourceType: article)
blockchains
Yes
Blockchain
companies
Yes
Company
data-sources
Yes
DataSource
events
Yes
Event
investors
Yes
Investor
people
Yes
Person
playlists
Yes
Playlist
products
Yes
Product
shows
Yes
Show
taxonomy-terms
Yes
TaxonomyTerm
tokens
Yes
Asset
videos
Yes
Video
enrichment-jobs
No
—
locations
No
—
stablecoin-profiles
No
—
taxonomies
No
— (use taxonomy-terms for graph)
Topic exists only in Neo4j (no Mongo collection) — used with edges like COVERS from Episode.
Mongo references between collections (document fields)
These are FK fields on Mongo documents, not the same as Neo4j relationships[] in enrichment JSON — but they define how collections link in the database.
From
Field
To
products
company
companies (required)
videos
classifiedAs
taxonomy-terms (video-kind)
videos
parentVideo
videos (clips)
taxonomy-terms
taxonomy
taxonomies
stablecoin-profiles
token
tokens
companies / investors / people / events
location
locations (optional)
enrichment-jobs
contentId
polymorphic via contentType
Allowed Neo4j relationship type strings
Must be one of EDGE_TYPES in src/graph/schema.ts — not every pair of labels supports every edge; the API/importer validates.
ABOUT · AFFILIATED_WITH · AGREES_WITH · AUDITED_BY · AUTHORED_BY · BELONGS_TO · CLASSIFIED_AS · CLIP_OF
· COMPETES_WITH · CONTRADICTS · COVERS · DISAGREES_WITH · EVOLVED_FROM · FEATURES · FOR_SHOW · FOUNDED
· HOSTS · INCLUDES · IN_PLAYLIST · INVESTED_IN · INVOLVES · ISSUED_BY · MADE · MENTIONS · NATIVE_TOKEN
· OPERATES · ORGANIZED_BY · PART_OF · PREDICTED · PRODUCT_OF · RECOMMENDED · REFERENCES · REGARDING
· RELATED_TO · SAID · SOURCED_FROM · SPONSORED_BY · SPONSORED_VIDEO · SPONSORS · SUPPORTS · TARGET
· TRIGGERED · WORKS_AT
Relationships per collection (Neo4j)
For each graph-backed collection, typical outgoing relationships (from this node → target type) and incoming (from source → this node). Use relationships[].type with document.type pointing at the target collection. Not every EDGE_TYPES pair is valid for every label pair; the API validates.
Collection
Neo4j label
Typical outgoing
Typical incoming
articles
Episode
SOURCED_FROM → DataSource; COVERS → Topic (graph-only); MENTIONS → Company, Person, Asset, …; CLASSIFIED_AS → TaxonomyTerm; FEATURES → Person; AUTHORED_BY → Person; REFERENCES, … (pipeline)
—
videos
Video
CLASSIFIED_AS → TaxonomyTerm (kind/format); CLIP_OF → Video; IN_PLAYLIST → Playlist; MENTIONS, FEATURES, …
CLIP_OF ← Video (child clips)
playlists
Playlist
FOR_SHOW → Show
IN_PLAYLIST ← Video
shows
Show
—
FOR_SHOW ← Playlist
companies
Company
CLASSIFIED_AS → TaxonomyTerm; OPERATES → Blockchain
INVESTED_IN ← Investor; WORKS_AT, FOUNDED, AFFILIATED_WITH ← Person; MENTIONS ← Episode/Video; ISSUED_BY ← Asset (via product bridge); AUDITED_BY ← Asset (bridge)
people
Person
WORKS_AT, AFFILIATED_WITH, FOUNDED → Company; extraction edges (e.g. claims)
FEATURES, AUTHORED_BY ← Episode; MENTIONS ← content
tokens
Asset
CLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company; NATIVE_TOKEN → Blockchain; AUDITED_BY → Company
MENTIONS ← Episode/Video; …
products
Product
—
PRODUCT_OF ← Asset
blockchains
Blockchain
CLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← Asset
OPERATES ← Company
investors
Investor
INVESTED_IN → Company
—
events
Event
ORGANIZED_BY, INVOLVES, … (per pipeline)
—
data-sources
DataSource
—
SOURCED_FROM ← Episode, Claim, …
taxonomy-terms
TaxonomyTerm
—
CLASSIFIED_AS ← Company, Asset, Blockchain, Video, Episode, …
Not graph-backed as primary vertices: locations, stablecoin-profiles, taxonomies, enrichment-jobs — use them via Mongo FKs or jobs, not as document.type endpoints for new graph nodes. Topic is Neo4j-only (no collection). Bridge-synced edges (OPERATES, INVESTED_IN, PRODUCT_OF, …) are often maintained by bridgeSync + maps — enrichment may still emit matching intents for review.
5. What happens after your JSON
A separate importer (or job worker) knows the subject content’s Mongo id (video, article, etc.), merges your relationships list with that subject, creates any missing target rows, runs bridge sync as needed, then POST /api/v1/graph/relationships with Mongo sourceId / targetId. Your scraper outputs relationships only; binding to the subject is downstream.
6. Build checklist
- Output is valid JSON (one object per run) with only a
relationshipsarray at the top level. - Every
relationships[].type∈EDGE_TYPES. - Every
relationships[].documenthastype+slug;mongoIdonly whenGETfound the row. - Taxonomy fields reference real
taxonomy-termsrows. - Tool is read-only with respect to Cortex writes.
7. Code in this repo to study
src/graph/schema.ts—EDGE_TYPES,NODE_LABELSsrc/app/api/v1/graph/relationships/route.ts— graph write shape (importer side)src/lib/graph/mongo-neo4j-mapping.ts— which collections sync to Neo4j
All collection fields
Use kebab-case collection keys in /api/v1/{collection}. Fields below match structured admin forms in src/lib/admin/collection-form-fields.ts, plus model-only fields you may set via JSON in admin or API. Ref fields (location, company, taxonomy, …) are Mongo ObjectIds (24-char hex in APIs).
Every document includes Mongoose createdAt and updatedAt unless noted. The enrichment-jobs model maps to the physical Mongo collection extractionjobs.
Collection
Fields
articles
title, slug, subheadline, format (taxonomy term slug, article-format), externalSourceUrl, externalSourceName, publishedAt, coverImage, publicId, published, featured — plus model content (mixed; JSON editor)
blockchains
name, slug, chainId, chainType, vmType, consensusMechanism, description, logo, explorerUrl, launchDate, socialLinks, published
companies
name, slug, tagline, description, yearFounded, logo, icon, brandColor, websiteUrl, employeeCount, fundingStage, totalFundingUsd, legalName, entityType, registrationNumber, countryOfIncorporation, location, socialLinks, publicId, verified, published, featured
data-sources
name, slug, sourceType, baseUrl, trustLevel, refreshFrequency, isActive, notes
enrichment-jobs
contentType, contentId, status, extractedAt, entityCount, errorLog, retryCount — plus model jobPayload (mixed; audit snapshot). Stored in collection extractionjobs.
events
title, slug, startDate, endDate, location, venue, summary, description, websiteUrl, registrationUrl, playlistUrl, timezone, publicId, coverImage, published
investors
name, slug, investorType, stages, description, logo, aumUsd, portfolioCount, websiteUrl, location, socialLinks, publicId, published
locations
city, country, latitude, longitude, timezone
people
name, slug, avatar, bio, location, socialLinks, publicId, published, contributor
playlists
name, slug, summary, description, coverImage, externalUrl, published
products
company, name, slug, description, productType, productStatus, launchDate, websiteUrl, docsUrl, sourceCodeUrl, isMainProduct, isOpenSource, published
shows
name, slug, summary, description, coverImage, socialLinks, published
stablecoin-profiles
token, pegTargetPrice, launchDate, auditFrequency, yieldSource, mintMinimumUsd, redeemMinimumUsd, redemptionTime, feeMintPct, feeRedeemPct, reserveRatio, riskScore, riskScoreRationale, whitepaperUrl
taxonomies
name, slug, description, appliesTo
taxonomy-terms
taxonomy, name, slug, description, color, displayOrder, isActive
tokens
symbol, name, slug, tokenType, isStablecoin, description, logo, coingeckoId, cmcId, defillamaId, socialLinks, verified, published — plus model metadata (mixed; JSON editor)
videos
title, slug, classifiedAs (taxonomy term id, video-kind), episodeNumber, publishedAt, format (term slug, video-format), parentVideo, clipStartTime, clipEndTime, videoUrl, type, coverImage, thumbnail, duration, transcript, published
Relationships per collection (Neo4j)
Typical outgoing and incoming edge types when this collection’s row is a vertex in Neo4j (graph-backed collections only).
Collection
Neo4j label
Typical outgoing
Typical incoming
articles
Episode
SOURCED_FROM → DataSource; COVERS → Topic (graph-only); MENTIONS → Company, Person, Asset, …; CLASSIFIED_AS → TaxonomyTerm; FEATURES → Person; AUTHORED_BY → Person; REFERENCES, … (pipeline)
—
videos
Video
CLASSIFIED_AS → TaxonomyTerm (kind/format); CLIP_OF → Video; IN_PLAYLIST → Playlist; MENTIONS, FEATURES, …
CLIP_OF ← Video (child clips)
playlists
Playlist
FOR_SHOW → Show
IN_PLAYLIST ← Video
shows
Show
—
FOR_SHOW ← Playlist
companies
Company
CLASSIFIED_AS → TaxonomyTerm; OPERATES → Blockchain
INVESTED_IN ← Investor; WORKS_AT, FOUNDED, AFFILIATED_WITH ← Person; MENTIONS ← Episode/Video; ISSUED_BY ← Asset (via product bridge); AUDITED_BY ← Asset (bridge)
people
Person
WORKS_AT, AFFILIATED_WITH, FOUNDED → Company; extraction edges (e.g. claims)
FEATURES, AUTHORED_BY ← Episode; MENTIONS ← content
tokens
Asset
CLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company; NATIVE_TOKEN → Blockchain; AUDITED_BY → Company
MENTIONS ← Episode/Video; …
products
Product
—
PRODUCT_OF ← Asset
blockchains
Blockchain
CLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← Asset
OPERATES ← Company
investors
Investor
INVESTED_IN → Company
—
events
Event
ORGANIZED_BY, INVOLVES, … (per pipeline)
—
data-sources
DataSource
—
SOURCED_FROM ← Episode, Claim, …
taxonomy-terms
TaxonomyTerm
—
CLASSIFIED_AS ← Company, Asset, Blockchain, Video, Episode, …
Not graph-backed as primary nodes: locations, stablecoin-profiles, taxonomies, enrichment-jobs. Topic is Neo4j-only (no Mongo collection).
Complex or nested shapes not covered by the form (extra keys on any collection) are edited in the admin JSON tab. Use relationships[].document.type with these collection keys when pointing at targets from your scraper output.
mermaid.initialize({ startOnLoad: false, securityLevel: "loose", theme: "base", themeVariables: { primaryColor: "#ccfbf1", primaryTextColor: "#1c1917", primaryBorderColor: "#0d9488", lineColor: "#57534e", secondaryColor: "#ede8df", tertiaryColor: "#fffcf7", fontFamily: "DM Sans, system-ui, sans-serif" }, flowchart: { curve: "basis", padding: 12 } }); (function () { const tablist = document.querySelector("#guide-tabs [role=tablist]"); if (!tablist) return; const tabs = Array.from(tablist.querySelectorAll('[role="tab"]')); const panels = tabs.map(function (t) { return document.getElementById(t.getAttribute("aria-controls") || ""); }); async function renderMermaidInPanel(panel) { if (!panel) return; const blocks = panel.querySelectorAll('pre.mermaid:not([data-mermaid-done])'); for (const el of blocks) { try { await mermaid.run({ nodes: [el] }); el.setAttribute("data-mermaid-done", "true"); } catch { el.setAttribute("data-mermaid-done", "error"); } } } async function selectTab(index) { const n = tabs.length; const i = Math.max(0, Math.min(index, n - 1)); tabs.forEach(function (tab, j) { const selected = j === i; tab.setAttribute("aria-selected", selected ? "true" : "false"); tab.tabIndex = selected ? 0 : -1; if (panels[j]) panels[j]. = !selected; }); await renderMermaidInPanel(panels[i]); } tabs.forEach(function (tab, i) { tab.addEventListener("click", function () { void selectTab(i); }); }); tablist.addEventListener("keydown", function (e) { const current = tabs.indexOf(document.activeElement); if (current < 0) return; if (e.key === "ArrowRight" || e.key === "ArrowDown") { e.preventDefault(); const next = Math.min(current + 1, tabs.length - 1); void selectTab(next); tabs[next].focus(); } else if (e.key === "ArrowLeft" || e.key === "ArrowUp") { e.preventDefault(); const prev = Math.max(current - 1, 0); void selectTab(prev); tabs[prev].focus(); } else if (e.key === "Home") { e.preventDefault(); void selectTab(0); tabs[0].focus(); } else if (e.key === "End") { e.preventDefault(); void selectTab(tabs.length - 1); tabs[tabs.length - 1].focus(); } }); void selectTab(0); })();