Enrichment tools

How enrichment workers and CLI tools relate to Cortex.

Enrichment tool output (JSON contract)

[!TIP] Styled HTML: enrichment-tools.html — same contract, full styling. Team overview: what-is-cortex.md · Docs index

How it works

What is Cortex?

Cortex is the content intelligence layer: REST APIs (/api/v1/...), admin UI, and the data model behind editorial content. It separates three jobs across stores (below). External tools (scrapers, importers) integrate by reading via GET and, elsewhere in the stack, writing through the same APIs or internal jobs — not by bypassing the model.

The distinct division (three stores)

None of the three replaces the others. Each answers a different question.

StoreQuestionIn one sentence
Document databaseWhat are things?Canonical documents — source of truth for entities, fields, refs, and admin CRUD. Implementation: MongoDB (Mongoose).
Graph databaseHow do things connect?Typed relationships and traversals — who invested in whom, who works where, bridge-synced and curated edges; nodes link to documents via mongoId; edge names align with EDGE_TYPES in src/graph/schema.ts. Implementation: Neo4j.
Vector database (Pinecone — research)What text is similar to what?Vector similarity — transcript or chunk embeddings and semantic “find like this,” with metadata pointing back to document / graph ids. Not wired in this repo yet — see SCHEMA.md.

One example each

  • Document database: An editor saves a Company — name, slug, website, location ref — in admin; that document is the record everyone trusts for display and APIs.
  • Graph database: A query asks “which companies did this investor fund?” — that is typed edges (e.g. INVESTED_IN) and traversals, not something the vector database or “similar text” answers.
  • Vector database: A user searches “regulatory pressure on stablecoin reserves” — the app returns the most similar transcript chunks, each tagged with content ids so you jump back to the real video in the document store (and optionally the graph).

Why not collapse into one database?

  • A vector database is built for fast similarity search over many vectors; it is not a full profile store or a general relationship map. You still need the document database for records and the graph database when connection structure and multi-hop questions are first-class.
  • The document database holds what each thing is; modeling the whole relationship surface only there gets painful as edges and traversals grow.
  • The graph database holds how things link; it is the wrong primary home for huge embedding indexes at scale — vectors live beside the graph, not instead of it.

Application layer: Admin + APIs — one place to manage data and expose internal (api/v1) and public (api/public/v1) APIs.

One-liner: Cortex is our unified content and knowledge platform: the document database holds what things are, the graph database holds how they connect, and (when wired) the vector database holds semantic search over chunks — plus one admin and APIs — so we’re not rebuilding the same data layer in every project.

Team overview: what-is-cortex.md. Full data rules: SCHEMA.md (bridge sync, CLASSIFIED_AS, etc.).

Workers and stubs

Queued enrichment (e.g. EnrichmentJob, company/person workers under src/lib/enrichment/) runs inside Cortex: claim job → apply patches → optional graph database updates. That path uses different payloads than this page. Treat it as background context so you know the repo layout; it is not the read-only JSON contract below. Stubs or partial implementations may exist alongside those workers — same rule: not what this document specifies.

Scrapers — focus of this document

Scrapers (and similar read-only emitters: LLM extractors, transcript pipelines) do not write to the document database or the graph database. They GET Cortex (and external APIs), resolve slugs and optional mongoId for relationship targets, and output a single JSON object — usually { video, relationships } for video ingest. A separate importer writes to the document and graph stores. Everything after this section is written for that flow.


What to build: Cortex supports multiple enrichment pipelines — different subject types (video, company, person, and more over time). An enrichment tool (scraper, LLM worker, etc.) only reads inputs: Cortex via GET, plus external sources (e.g. YouTube Data API). Its sole write is stdout / a file / a response body: one JSON object matching this contract. It does not POST, PATCH, or DELETE in Cortex, and does not call POST /api/v1/graph/relationships — that is a downstream importer or orchestrator that applies the JSON after optionally creating missing rows and syncing the graph.

Enrichment types (subjects)

SubjectAPI collectionEnrichmentJob.contentTypeOutput / contract
Videovideosvideo{ video, relationships } — primary shape in this document (document-store Video fields + graph edge hints). Use for scrapers (e.g. YouTube → metadata + transcript) and LLM extraction.
CompanycompaniescompanyContext only — worker payload (not this read-only contract): src/lib/enrichment/company/payload.ts, worker.ts; sample scripts/company-enrichment/sample-company-enrichment-json.ts.
PersonpeoplepersonContext only — worker payload (not this contract): src/lib/enrichment/person/payload.ts, worker.ts.
Article (optional)articlesarticleSame pattern as video for content graphs: root key often article + relationships; align fields with src/models/Article.ts and taxonomies whose appliesTo includes article.

Worker jobs (EnrichmentJob, company/person queues) — context only: Listed so you know how other pipelines exist in the repo. They are not part of the read-only JSON contract in this document: no writes, no enqueue from the scraper. E.g. admin POST /api/v1/companies / people with enqueueEnrichmentJob: true (src/lib/api/enrichment-job-on-create.ts) and src/lib/enrichment/enrichment-job-http-service.ts. The read-only tool described here does not enqueue jobs.

Relationship-heavy vs patch-heavy (context): Video (and similar content) uses relationships as in this doc. Company / person workers use different typed payloads and apply patches elsewhere — background only for orientation.

Scraper / ingest goal (YouTube)

Goal: Given a YouTube URL (any common form: watch?v=, youtu.be/, /shorts/, live URLs), the tool normalizes the video id, fetches everything needed to populate or update a Cortex videos row and this JSON contract — not just the watch link.

Typical payload to collect:

SourceMaps to / use
Stable watch URL + video idvideoUrl; dedupe key
Title, descriptiontitle; optional excerpt in pipeline; slug often derived from title
Thumbnails (high-res)coverImage, thumbnail
Duration (seconds)duration
Published timepublishedAt
Captions / transcript (when available)transcript
Channel id, titleoptional relationships or future fields; not a Cortex Video FK today

Cortex-specific: classifiedAs and format still come from your taxonomies (video-kind, video-format) — align with real term rows and ids in Cortex; do not treat YouTube’s category string as a substitute unless you explicitly map it.

Implementation is open: How you fetch metadata (any API, scraping, transcripts, LLM extraction, etc.), how you load taxonomies, and whether you inspect existing graph edges is your pipeline’s choice. The contract below is the output JSON shape — not a required sequence of GETs or tools.


Shape (video — graph JSON contract)

This section is the canonical { video, relationships } shape. For company / person worker payloads, see Enrichment types (subjects) and the src/lib/enrichment/* modules.

  • video (or your team’s root key for the subject document) — object with the Video fields your pipeline can fill. classifiedAs is the document database _id of a taxonomy term (a row in taxonomy-terms) belonging to taxonomy video-kind (structural kind: episode / clip / …). format is the slug of a taxonomy term under video-format (presentation: full / clip / …). Optional type is a free string. Also: title, slug, episodeNumber, publishedAt, videoUrl, duration, transcript, coverImage, thumbnail, published, … See videos in Collections and fields and src/models/Video.ts. Omit keys you do not have.
  • relationships — array of edges from this video to other entities. Each item has:
    • type — Graph edge / relationship name (e.g. SOURCED_FROM, MENTIONS, CLASSIFIED_AS, COVERS). Align with EDGE_TYPES in src/graph/schema.ts when in doubt.
    • document — target reference: type (API collection key), slug (canonical slug for that target — see Scraper responsibility below). Optionally mongoId: 24-char hex string from a GET response when the row already exists in Cortex (same value as document _id and graph node mongoId; use for targetId / sourceId in POST /api/v1/graph/relationships). Omit mongoId when the target does not exist yet (importer creates by slug first).
    • properties — optional key/value metadata on the edge (counts, context, URLs, timestamps, etc.).

Scraper responsibility: slugs and mongoId

The scraper / enrichment tool is responsible for figuring out the right relationship (who to link) and the correct canonical slug per collection so Cortex lookups are unambiguous (e.g. disambiguate taxonomy-terms by taxonomy, pick the right playlist vs similarly named rows). It reads only (GET): after GET /api/v1/{collection}?slug=… (or list + filter) returns a document, include mongoId on that relationship’s document copied from the response’s _id. slug remains the stable human key; mongoId is proof the target was resolved in Cortex and lets the importer skip a second slug→id lookup when calling the graph API (which uses document-store ids, not slugs — see src/app/api/v1/graph/relationships/route.ts).

Data to include in each relationships[] item

FieldRequiredWhat to put
typeYesRelationship name from EDGE_TYPES in src/graph/schema.ts (e.g. SOURCED_FROM, MENTIONS, CLASSIFIED_AS).
documentYestype, slug (required). mongoId (optional) — set when GET returned an existing row; 24-char hex. Tokens use slug on the token document, not symbol.
propertiesNoAny JSON-safe object (strings, numbers, booleans, nested objects). Stored on the graph relationship when integrators call POST /api/v1/graph/relationships. Omit or use {} if you have no edge metadata.

Suggested properties by edge type (conventions for video / content enrichment — all optional; extend as needed):

typeTypical document.typeUseful properties keys
SOURCED_FROMdata-sourcesoriginalUrl, scrapedAt (ISO 8601), optional title / referrer
MENTIONScompanies, people, tokens, blockchains, investors, products, …context (e.g. sentiment or theme), count (approx. mentions), confidence (0–1), snippet (short quote from source)
CLASSIFIED_AStaxonomy-termsconfidence (0–1), optional reason (short string)
COVERSGraph-only Topic nodes are not Mongo documents — omit unless your team has an agreed pipeline
FEATURESpeoplerole (e.g. host, guest, quoted)
IN_PLAYLISTplaylistsposition (order in playlist), optional addedAt
AUTHORED_BYpeopleorder (integer if multiple authors), role (e.g. byline)
REFERENCESarticles, videos, … (same graph-backed collections)citation, url, accessedAt

Use ISO 8601 for datetimes. Do not put secrets or PII you are not allowed to store in the graph.


Relationships: targets (enrichment JSON vs downstream writes)

A relationship applied in the graph database is only valid if the thing at the other end exists in Cortex (document store, then graph). There is no edge to a slug that does not exist — when the importer runs.

Enrichment tool (this JSON only): Resolve each target to the right slug. When a row exists in Cortex, include document.mongoId (from GET). When it does not exist yet (e.g. a new guest), omit mongoId entirely — that is the signal: no id on the relationship means the downstream system should treat the target as to be created. Optional properties can carry hints (e.g. role, proposedDisplayName). The tool does not POST, enqueue jobs, or create EnrichmentJob rows — output JSON only.

Downstream importer / orchestrator (writes Cortex + graph):

  1. Resolve — If mongoId is present, use it for graph targetId / sourceId (optional check against slug). If mongoId is missing, treat the target as not in Cortexno graph edge yet; the pipeline usually records a pending EnrichmentJob in enrichment-jobs so a worker can create the people / companies / … document and finish enrichment.
  2. Create if missingPOST /api/v1/{collection} (see Creating a document (POST)), often with enqueueEnrichmentJob where supported, then ensure graph nodes exist (bridge sync / upsert as you do today).
  3. LinkPOST /api/v1/graph/relationships (or equivalent) only after both ends exist.

Collection keys (use as relationships[].document.type):

Collection keyGraph label (when synced)
companiesCompany
peoplePerson
tokensAsset
blockchainsBlockchain
investorsInvestor
productsProduct
data-sourcesDataSource
taxonomy-termsTaxonomyTerm
videosVideo
playlistsPlaylist
showsShow
articlesEpisode (article)

Topic and a few other graph labels are not in this list; do not emit relationships to them in this JSON unless your team has a separate, agreed process.

Taxonomies and document types (appliesTo)

Yes. Each Taxonomy document has appliesTo: an array of tags naming which document / entity types that scheme is meant to classify. Admins set it when creating or editing a taxonomy (see taxonomies in Collections and fields). Cortex does not enforce appliesTo in API validators today — it is a contract for humans and tools so you pick the right terms (e.g. only use “content theme” terms for articles, “token category” terms for tokens).

Seed examples (scripts/seed-mongo.ts): Token CategoryappliesTo: ['token']; Company Sector['company', 'blockchain']; Content Theme['article', 'event', 'investor'].

Suggested tag → collection alignment (use the same vocabulary your taxonomies use):

appliesTo tagTypical collection key
articlearticles
companycompanies
tokentokens
blockchainblockchains
eventevents
investorinvestors
videovideos
personpeople

Tooling: GET /api/v1/taxonomies returns each taxonomy’s appliesTo. Filter taxonomies (and list terms via GET /api/v1/taxonomy-terms with populated taxonomy) so CLASSIFIED_AS edges only reference taxonomy-terms whose Taxonomy document’s appliesTo includes the content type you are enriching (e.g. tag video → use terms from taxonomies that list video in appliesTo).

Admin dropdowns (no hardcoded enums): Video.classifiedAs stores the Mongo _id of a taxonomy-terms row under taxonomy video-kind (structured form: taxonomyTermRef). Video.format and Article.format still use term slugs from video-format / article-format (taxonomyTermSlug). taxonomy-terms documents use taxonomyRef in the admin form for the taxonomy field (Mongo id of the Taxonomy); there is no term-to-term parent — terms are flat within each taxonomy. Bridge sync writes (Video)-[:CLASSIFIED_AS { aspect: 'videoKind' }]->(TaxonomyTerm) from classifiedAs, and (Episode)-[:CLASSIFIED_AS { aspect: 'articleFormat' }]->(TaxonomyTerm) from Article.format. See SCHEMA.md.


Detailed read flow (real example)

This is one concrete way to fill { video, relationships }: load taxonomies first, then resolve entities by slug, then playlist + show context. Replace BASE with your app origin (e.g. http://localhost:3000). Use x-api-key or session auth as required for /api/v1.

1. Taxonomies (video-kind, video-format, content theme)

StepRequestUse the response for
1aGET BASE/api/v1/taxonomies?slug=video-kindTaxonomy _id
1bGET BASE/api/v1/taxonomy-terms?taxonomy={videoKindId}&limit=100Pick the video kind term → video.classifiedAs = that term’s _id
1cGET BASE/api/v1/taxonomies?slug=video-formatTaxonomy _id
1dGET BASE/api/v1/taxonomy-terms?taxonomy={videoFormatId}&limit=100Pick formatvideo.format = that term’s slug (e.g. full)
1e(optional) GET BASE/api/v1/taxonomies then terms for your content theme taxonomyCLASSIFIED_ASrelationships[].document for taxonomy-terms (slug + mongoId from data[]._id)

2. People, companies, tokens, blockchains

For each slug you want in the graph, resolve the document and copy data[0]._id (or the single hit) into document.mongoId.

RequestTypical edge
GET BASE/api/v1/people?slug=jeremy-allaire&limit=1FEATURES
GET BASE/api/v1/companies?slug=circle&limit=1MENTIONS
GET BASE/api/v1/tokens?slug=usdc&limit=1MENTIONS
GET BASE/api/v1/blockchains?slug=ethereum&limit=1MENTIONS

List routes often match slug / name as substring — use limit=1 when you expect a single canonical row.

3. Playlist and show

StepRequestUse for
3aGET BASE/api/v1/playlists?slug=stablecoin-weekly&limit=1IN_PLAYLISTdocument.mongoId = playlist _id
3bGET BASE/api/v1/graph/relationships?mongoId={playlistMongoId}Find FOR_SHOW → target Show mongoId (playlist show is a graph edge; useful for copy, validation, or UI even if this JSON does not emit a Show relationship on the video)
3c(alternative) GET BASE/api/v1/shows?slug={showSlug}&limit=1Resolve Show directly if you already know the slug from CMS / site

4. Emit one JSON object

Merge into video + relationships as below. Rule of thumb: every mongoId you attach came from a GET response _id for that collection (or from graph targetId when following edges).


Example

The Detailed read flow section walks through GET taxonomies, people / companies / tokens / blockchains, playlist, show (via graph or shows), then this JSON.

Episode-style video with the two fields that resolve as taxonomy terms (Mongo collection taxonomy-terms) in src/models/Video.ts:

  • classifiedAsRequired. Mongo _id of the taxonomy term for video kind — i.e. the term row under taxonomy video-kind whose slug is e.g. episode, clip, or other (see seed data). This field is a ref to a TaxonomyTerm document, not an arbitrary string.
  • formatSlug of the taxonomy term for video format — i.e. the term under taxonomy video-format (e.g. full, clip). Stored as a slug string on Video, but it must match a real taxonomy term in that vocabulary; not a MIME type.
  • type — Optional free string on Video.type (not a taxonomy term unless you add a separate convention).

Plus graph hints: FEATURES (people on camera), MENTIONS (companies, tokens, blockchains, …), IN_PLAYLIST, CLASSIFIED_AS (e.g. content theme).

{
  "video": {
    "_id": "665a1b2c3d4e5f6a7b8c9d0e",
    "title": "Inside USDC Reserves — Q1 2025 with Circle",
    "slug": "inside-usdc-reserves-q1-2025-circle",
    "classifiedAs": "674a1b2c3d4e5f6a7b8c9d0f",
    "episodeNumber": 42,
    "publishedAt": "2025-04-15T18:00:00.000Z",
    "format": "full",
    "type": "interview",
    "videoUrl": "https://cdn.example.com/episodes/inside-usdc-reserves-q1-2025.m3u8",
    "duration": 1842,
    "coverImage": "https://cdn.example.com/covers/usdc-reserves-ep42.jpg",
    "thumbnail": "https://cdn.example.com/thumbs/usdc-reserves-ep42.webp",
    "transcript": null,
    "published": true
  },
  "relationships": [
    {
      "type": "FEATURES",
      "document": {
        "type": "people",
        "slug": "jeremy-allaire",
        "mongoId": "507f1f77bcf86cd799439011"
      },
      "properties": { "role": "host" }
    },
    {
      "type": "MENTIONS",
      "document": {
        "type": "companies",
        "slug": "circle",
        "mongoId": "507f1f77bcf86cd799439012"
      },
      "properties": { "context": "primary", "snippet": "discussion of reserve composition" }
    },
    {
      "type": "MENTIONS",
      "document": {
        "type": "tokens",
        "slug": "usdc",
        "mongoId": "507f1f77bcf86cd799439013"
      },
      "properties": { "context": "analytical", "count": 24 }
    },
    {
      "type": "MENTIONS",
      "document": {
        "type": "blockchains",
        "slug": "ethereum",
        "mongoId": "507f1f77bcf86cd799439014"
      },
      "properties": { "context": "background", "snippet": "EVM settlement" }
    },
    {
      "type": "IN_PLAYLIST",
      "document": {
        "type": "playlists",
        "slug": "stablecoin-weekly",
        "mongoId": "674a1b2c3d4e5f6a7b8c9d1a"
      },
      "properties": { "position": 7 }
    },
    {
      "type": "CLASSIFIED_AS",
      "document": {
        "type": "taxonomy-terms",
        "slug": "deep-dive",
        "mongoId": "507f1f77bcf86cd799439015"
      }
    }
  ]
}

Resolve both as taxonomy terms from live data: use GET /api/v1/taxonomies?slug=video-kind and slug=video-format to get each taxonomy’s _id, then GET /api/v1/taxonomy-terms?taxonomy={id} so you pick the taxonomy-terms row for kind (classifiedAs = that term’s _id) and the taxonomy-terms row for format (format = that term’s slug). For each relationships[].document, after GET resolves the target, copy _id into mongoId as in the example above.

The importer creates missing targets with POST /api/v1/{collection} when needed; taxonomy-terms rows require taxonomy — the Mongo _id of the Taxonomy document. Playlists / shows links in the graph may also follow PlaylistFOR_SHOWShow outside this JSON — see SCHEMA.md.

Example: guest not in Cortex yet (output only)

There is no people row for this guest. The tool still emits type + slug on document and does not put mongoId on that relationship — no id is the whole signal.

Example output fragment (inside relationships[]):

{
  "type": "FEATURES",
  "document": {
    "type": "people",
    "slug": "jane-doe-guest-analyst"
  },
  "properties": {
    "role": "guest",
    "proposedDisplayName": "Jane Doe"
  }
}

Downstream: Ingestion sees document without mongoId, treats it as create this target, and records a pending EnrichmentJob (e.g. contentType: person, payload with slug + hints) so a worker can POST the people document and later attach graph edges. The read-only tool never creates that job — only the pipeline that consumes the JSON does.

Rule: The enrichment tool never invents taxonomy classifiedAs / format values without aligning to Cortex vocabulary; free-text hints in properties for people/companies you are about to create are OK.


Helpers: API reference (read vs write)

Enrichment tools use GET (and public read routes) below to validate vocabulary and optionally check whether targets exist. POST / PATCH / DELETE and POST …/graph/relationships are for importers and other services, not for the read-only JSON emitter.

Replace BASE with your app origin (e.g. http://localhost:3000). Authenticated /api/v1 routes accept a browser session or x-api-key: <CORTEX_API_KEY> (see README.md). Public routes require x-api-key only.

Creating a document (POST) — importer only

POST /api/v1/{collection} — send a JSON object whose keys match the collection schema (see Collections and fields). Response: 201 Created with the new document (including assigned _id).

RequirementNotes
HeadersContent-Type: application/json; auth as for other v1 routes (x-api-key or session).
ObjectId fieldsRefs (location, taxonomy, company, token, …) are 24-character hex strings.
DatesPrefer ISO 8601 strings (e.g. "2025-04-15T14:30:00.000Z").
Errors400 if validation fails (missing required field, duplicate slug, invalid type); body { "error": "…" }. 404 if {collection} is not in modelMap.

Optional — enqueue enrichment after create: include "enqueueEnrichmentJob": true in the JSON only for POST /api/v1/companies or POST /api/v1/people. That flag is not saved on the document; it only creates a pending EnrichmentJob for the new row. For all other collections, omit it.

Updates and deletes: PATCH /api/v1/{collection}/{id} — merge fields (graph-backed collections may sync the vertex in the graph database after save). DELETE /api/v1/{collection}/{id} — remove the document.

Public API (/api/public/v1/...) does not expose POST for creates; use /api/v1/... with a key or session.

Example (data source):

curl -sS -X POST "$BASE/api/v1/data-sources" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $CORTEX_API_KEY" \
  -d '{
    "name": "CoinDesk",
    "slug": "coindesk",
    "sourceType": "news",
    "baseUrl": "https://www.coindesk.com",
    "isActive": true
  }'

Implementation: src/app/api/v1/[collection]/route.ts.

List + filter (internal, full data)

GET /api/v1/{collection} — paginated list; pass query params to narrow results.

ParamBehavior
name, slug, titleCase-insensitive substring match (regex), good for “find slug containing …”
Other fields (e.g. symbol, taxonomy, isActive)Exact match on that field
page, limit (≤ 100), sortPagination and sort (default -createdAt)

Examples:

# Company by slug fragment
curl -sS "$BASE/api/v1/companies?slug=circle&limit=5" -H "x-api-key: $CORTEX_API_KEY"

# Token by slug (tokens collection = Asset in graph)
curl -sS "$BASE/api/v1/tokens?slug=usdc&limit=5" -H "x-api-key: $CORTEX_API_KEY"

# Taxonomy term by slug (narrow further if the same slug can exist in two taxonomies)
curl -sS "$BASE/api/v1/taxonomy-terms?slug=market-analysis&limit=10" -H "x-api-key: $CORTEX_API_KEY"

# Data sources
curl -sS "$BASE/api/v1/data-sources?slug=coindesk&limit=5" -H "x-api-key: $CORTEX_API_KEY"

Implementation: src/app/api/v1/[collection]/route.ts. Collections available are those in src/lib/api/model-map.ts (companies, people, tokens, articles, taxonomies, taxonomy-terms, data-sources, …).

One document by Mongo _id

GET /api/v1/{collection}/{id} — returns one document; population of refs matches populateMap (e.g. taxonomy terms include taxonomy). Use after you have an id from a list response.

One document by slug (published only)

GET /api/public/v1/{collection}/{slug}published: true only. Same x-api-key as other public routes. Handy for read-only checks against live content; not suitable if drafts must be linked.

Cross-collection search (published, text / autocomplete)

GET /api/public/v1/search?q=...&collections=companies,people&limit=10

  • mode=autocomplete — prefix match on name (regex ^q), published docs only.
  • Default — $text search where indexes exist; collections is comma-separated and must be keys in modelMap.

Requires q. See src/app/api/public/v1/search/route.ts.

Taxonomies and terms (vocabulary)

Required context for any enrichment that emits taxonomy-backed fields or CLASSIFIED_AS edges. Do not ship a tool that guesses term slugs without calling these.

  • GET /api/v1/taxonomies?limit=100 (paginate if needed) — each taxonomy’s appliesTo (which document types it applies to), slug, name, _id; see Taxonomies and document types.
  • GET /api/v1/taxonomy-terms?limit=100 (paginate; filter with ?taxonomy=<taxonomyMongoId> when you know the vocabulary) — terms with populated taxonomy so you can filter by taxonomy slug or id and build a slug → _id map that matches production.

Graph: what is already linked?

Required when you add or update relationships for graph-backed collections (companies, people, videos, articles, tokens, taxonomy-terms, … — see the Graph edges by collection table under Relationships). For each relevant Mongo _id, call:

GET /api/v1/graph/relationships?mongoId={mongoId} — returns existing relationships for that node in the graph (if synced). Useful when you want to align with bridge sync or deduplicate edges; optional for your pipeline. See src/app/api/v1/graph/relationships/route.ts.

Admin UI

/admin/data/{collection} — browse and search lists in the browser when you prefer not to script.

Scripts and direct DB access

Repo scripts under scripts/ use MONGODB_URI with Mongoose; you can run one-off findOne({ slug }) / findById in a script for bulk or offline resolution. Same data as the API, no HTTP.


Collections and fields (API keys)

Use these kebab-case names in /api/v1/{collection} and admin URLs. Fields below match the structured admin forms (src/lib/admin/collection-form-fields.ts) plus a few model-only fields you may set via JSON in admin or API.

CollectionFields
companiesname, slug, tagline, description, yearFounded, logo, icon, brandColor, websiteUrl, employeeCount, fundingStage, totalFundingUsd, legalName, entityType, registrationNumber, countryOfIncorporation, location, socialLinks, publicId, verified, published, featured
peoplename, slug, avatar, bio, location, socialLinks, publicId, published, contributor
productscompany, name, slug, description, productType, productStatus, launchDate, websiteUrl, docsUrl, sourceCodeUrl, isMainProduct, isOpenSource, published
tokenssymbol, name, slug, tokenType, isStablecoin, description, logo, coingeckoId, cmcId, defillamaId, socialLinks, verified, published — plus model metadata (mixed object; JSON editor)
blockchainsname, slug, chainId, chainType, vmType, consensusMechanism, description, logo, explorerUrl, launchDate, socialLinks, published
investorsname, slug, investorType, stages, description, logo, aumUsd, portfolioCount, websiteUrl, location, socialLinks, publicId, published
locationscity, country, latitude, longitude, timezone
articlestitle, slug, subheadline, format, externalSourceUrl, externalSourceName, publishedAt, coverImage, publicId, published, featured — plus model content (mixed; JSON editor)
videostitle, slug, classifiedAs, episodeNumber, publishedAt, format, parentVideo, clipStartTime, clipEndTime, videoUrl, type, coverImage, thumbnail, duration, transcript, published
showsname, slug, summary, description, coverImage, socialLinks, published
playlistsname, slug, summary, description, coverImage, externalUrl, published
eventstitle, slug, startDate, endDate, location, venue, summary, description, websiteUrl, registrationUrl, playlistUrl, timezone, publicId, coverImage, published
stablecoin-profilestoken, pegTargetPrice, launchDate, auditFrequency, yieldSource, mintMinimumUsd, redeemMinimumUsd, redemptionTime, feeMintPct, feeRedeemPct, reserveRatio, riskScore, riskScoreRationale, whitepaperUrl
taxonomiesname, slug, description, appliesTo
taxonomy-termstaxonomy, name, slug, description, color, displayOrder, isActive
data-sourcesname, slug, sourceType, baseUrl, trustLevel, refreshFrequency, isActive, notes
enrichment-jobscontentType, contentId, status, extractedAt, entityCount, errorLog, retryCount — plus model jobPayload (mixed; audit snapshot)

All of the above collections use Mongoose createdAt / updatedAt timestamps except where the physical collection name differs (e.g. enrichment jobs store in extractionjobs). Ref fields (location, company, …) are Mongo ObjectIds (hex strings in APIs).


Relationships

A) Mongo document references (foreign keys)

These are stored on the document in MongoDB (not in the graph as the primary store). Resolve ids before writing.

From collectionFieldTo collection
productscompanycompanies (required)
videosclassifiedAstaxonomy-terms (required; taxonomy video-kind)
videosparentVideovideos (optional; clip → parent)
stablecoin-profilestokentokens (required)
taxonomy-termstaxonomytaxonomies (required)
companieslocationlocations (optional)
investorslocationlocations (optional)
peoplelocationlocations (optional)
eventslocationlocations (optional)
enrichment-jobscontentIdpolymorphic — contentType names the model (e.g. article, video, company)

Standalone (no outgoing FK in this schema): articles, blockchains, data sources, shows, playlists, taxonomies, tokens (except being referenced), and others not listed above. Show ↔ playlist ↔ episode wiring is graph-only (see below).

B) Graph edges by collection (graph-backed rows)

Only collections in GRAPH_BACKED_COLLECTIONS (src/lib/graph/mongo-neo4j-mapping.ts) have nodes with mongoId suitable for POST /api/v1/graph/relationships. Allowed relationship type names are exactly EDGE_TYPES in src/graph/schema.ts (not every type applies to every pair).

CollectionGraph labelTypical outgoing (from this node)Typical incoming (to this node)
articlesEpisode (sourceType: article)SOURCED_FROM → DataSource; COVERS → Topic; MENTIONS → Company / Person / Asset / …; CLASSIFIED_AS → TaxonomyTerm; FEATURES → Person; AUTHORED_BY → Person; knowledge edges per pipeline (REFERENCES, …)
videosVideoCLASSIFIED_AS → TaxonomyTerm (from classifiedAs; aspect: 'videoKind'); CLIP_OF → Video (clips); IN_PLAYLIST → Playlist; MENTIONS, FEATURES, …CLIP_OF ← Video
playlistsPlaylistFOR_SHOW → ShowIN_PLAYLIST ← Video
showsShowFOR_SHOW ← Playlist
companiesCompanyCLASSIFIED_AS → TaxonomyTerm; OPERATES → BlockchainINVESTED_IN ← Investor; WORKS_AT / FOUNDED ← Person; MENTIONS ← Episode/Video/…; AUDITED_BY ← Asset (curated bridge); ISSUED_BY ← Asset (via product)
peoplePersonWORKS_AT / AFFILIATED_WITH → Company; FOUNDED → Company; MENTIONS, extraction edges (e.g. to Claim)FEATURES ← Episode; AUTHORED_BY ← Episode
tokensAssetCLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company (bridge); NATIVE_TOKEN → Blockchain (curated)MENTIONS ← …; AUDITED_BY → Company
productsProductPRODUCT_OF ← Asset; company link in Mongo drives issuer edges in bridge sync
blockchainsBlockchainCLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← AssetOPERATES ← Company
investorsInvestorINVESTED_IN → Company
eventsEventORGANIZED_BY, INVOLVES, … (per pipeline)
data-sourcesDataSourceSOURCED_FROM ← Episode (article) / Claim / …
taxonomy-termsTaxonomyTermCLASSIFIED_AS ← Company, Asset, Blockchain, …

Not graph-backed as primary nodes (no mongoId vertex for these in the same way): locations, stablecoin-profiles, taxonomies (Taxonomy has no bridge node; only taxonomy-terms sync to the graph), enrichment-jobs.

Topic (COVERS, etc.) is graph-only (no Mongo collection); see SCHEMA.md.

Bridge / seed edges (curated maps, not Mongo FKs): src/graph/bridge-maps.ts and src/graph/bridgeSync.ts — e.g. TOKEN_PRODUCT_SLUG, OPERATES_EDGES, PERSON_WORKS_AT_EDGES, INVESTED_IN_EDGES, NATIVE_TOKEN_EDGES, AUDITED_BY_EDGES.


Checklist

  • Output must be valid JSON (parseable; use UTF-8).
  • The enrichment tool does not write to Cortex’s document or graph stores — only emits JSON (stdout, file, or HTTP body).
  • Prefer ISO 8601 strings for datetimes.
  • If there are no relationships, use "relationships": [].
  • Taxonomy-backed fields (classifiedAs, format, etc.) should match Cortex’s controlled vocabulary (how you discover valid ids/slugs is up to you).
  • Relationship targets: correct slug per collection (scraper’s job); include document.mongoId when GET returned an existing row.
  • relationships may omit mongoId when the target does not exist yet — downstream typically EnrichmentJob + create row, then edges (see Relationships: targets).
  • Importer (separate step): before graph edges, every relationships[].document must resolve in Mongo (create via POST if needed).
  • Importer: may query graph/relationships when merging edges to avoid duplicates (optional dedup strategy).
  • Deliverable from the tool: that JSON object and nothing else required from an API write perspective.

For data model details, see SCHEMA.md.