Enrichment tool output (JSON contract)
[!TIP] Styled HTML: enrichment-tools.html — same contract, full styling. Team overview: what-is-cortex.md · Docs index
How it works
What is Cortex?
Cortex is the content intelligence layer: REST APIs (/api/v1/...), admin UI, and the data model behind editorial content. It separates three jobs across stores (below). External tools (scrapers, importers) integrate by reading via GET and, elsewhere in the stack, writing through the same APIs or internal jobs — not by bypassing the model.
The distinct division (three stores)
None of the three replaces the others. Each answers a different question.
| Store | Question | In one sentence |
|---|---|---|
| Document database | What are things? | Canonical documents — source of truth for entities, fields, refs, and admin CRUD. Implementation: MongoDB (Mongoose). |
| Graph database | How do things connect? | Typed relationships and traversals — who invested in whom, who works where, bridge-synced and curated edges; nodes link to documents via mongoId; edge names align with EDGE_TYPES in src/graph/schema.ts. Implementation: Neo4j. |
| Vector database (Pinecone — research) | What text is similar to what? | Vector similarity — transcript or chunk embeddings and semantic “find like this,” with metadata pointing back to document / graph ids. Not wired in this repo yet — see SCHEMA.md. |
One example each
- Document database: An editor saves a Company — name, slug, website, location ref — in admin; that document is the record everyone trusts for display and APIs.
- Graph database: A query asks “which companies did this investor fund?” — that is typed edges (e.g.
INVESTED_IN) and traversals, not something the vector database or “similar text” answers. - Vector database: A user searches “regulatory pressure on stablecoin reserves” — the app returns the most similar transcript chunks, each tagged with content ids so you jump back to the real video in the document store (and optionally the graph).
Why not collapse into one database?
- A vector database is built for fast similarity search over many vectors; it is not a full profile store or a general relationship map. You still need the document database for records and the graph database when connection structure and multi-hop questions are first-class.
- The document database holds what each thing is; modeling the whole relationship surface only there gets painful as edges and traversals grow.
- The graph database holds how things link; it is the wrong primary home for huge embedding indexes at scale — vectors live beside the graph, not instead of it.
Application layer: Admin + APIs — one place to manage data and expose internal (api/v1) and public (api/public/v1) APIs.
One-liner: Cortex is our unified content and knowledge platform: the document database holds what things are, the graph database holds how they connect, and (when wired) the vector database holds semantic search over chunks — plus one admin and APIs — so we’re not rebuilding the same data layer in every project.
Team overview: what-is-cortex.md. Full data rules: SCHEMA.md (bridge sync, CLASSIFIED_AS, etc.).
Workers and stubs
Queued enrichment (e.g. EnrichmentJob, company/person workers under src/lib/enrichment/) runs inside Cortex: claim job → apply patches → optional graph database updates. That path uses different payloads than this page. Treat it as background context so you know the repo layout; it is not the read-only JSON contract below. Stubs or partial implementations may exist alongside those workers — same rule: not what this document specifies.
Scrapers — focus of this document
Scrapers (and similar read-only emitters: LLM extractors, transcript pipelines) do not write to the document database or the graph database. They GET Cortex (and external APIs), resolve slugs and optional mongoId for relationship targets, and output a single JSON object — usually { video, relationships } for video ingest. A separate importer writes to the document and graph stores. Everything after this section is written for that flow.
What to build: Cortex supports multiple enrichment pipelines — different subject types (video, company, person, and more over time). An enrichment tool (scraper, LLM worker, etc.) only reads inputs: Cortex via GET, plus external sources (e.g. YouTube Data API). Its sole write is stdout / a file / a response body: one JSON object matching this contract. It does not POST, PATCH, or DELETE in Cortex, and does not call POST /api/v1/graph/relationships — that is a downstream importer or orchestrator that applies the JSON after optionally creating missing rows and syncing the graph.
Enrichment types (subjects)
| Subject | API collection | EnrichmentJob.contentType | Output / contract |
|---|---|---|---|
| Video | videos | video | { video, relationships } — primary shape in this document (document-store Video fields + graph edge hints). Use for scrapers (e.g. YouTube → metadata + transcript) and LLM extraction. |
| Company | companies | company | Context only — worker payload (not this read-only contract): src/lib/enrichment/company/payload.ts, worker.ts; sample scripts/company-enrichment/sample-company-enrichment-json.ts. |
| Person | people | person | Context only — worker payload (not this contract): src/lib/enrichment/person/payload.ts, worker.ts. |
| Article (optional) | articles | article | Same pattern as video for content graphs: root key often article + relationships; align fields with src/models/Article.ts and taxonomies whose appliesTo includes article. |
Worker jobs (EnrichmentJob, company/person queues) — context only: Listed so you know how other pipelines exist in the repo. They are not part of the read-only JSON contract in this document: no writes, no enqueue from the scraper. E.g. admin POST /api/v1/companies / people with enqueueEnrichmentJob: true (src/lib/api/enrichment-job-on-create.ts) and src/lib/enrichment/enrichment-job-http-service.ts. The read-only tool described here does not enqueue jobs.
Relationship-heavy vs patch-heavy (context): Video (and similar content) uses relationships as in this doc. Company / person workers use different typed payloads and apply patches elsewhere — background only for orientation.
Scraper / ingest goal (YouTube)
Goal: Given a YouTube URL (any common form: watch?v=, youtu.be/, /shorts/, live URLs), the tool normalizes the video id, fetches everything needed to populate or update a Cortex videos row and this JSON contract — not just the watch link.
Typical payload to collect:
| Source | Maps to / use |
|---|---|
| Stable watch URL + video id | videoUrl; dedupe key |
| Title, description | title; optional excerpt in pipeline; slug often derived from title |
| Thumbnails (high-res) | coverImage, thumbnail |
| Duration (seconds) | duration |
| Published time | publishedAt |
| Captions / transcript (when available) | transcript |
| Channel id, title | optional relationships or future fields; not a Cortex Video FK today |
Cortex-specific: classifiedAs and format still come from your taxonomies (video-kind, video-format) — align with real term rows and ids in Cortex; do not treat YouTube’s category string as a substitute unless you explicitly map it.
Implementation is open: How you fetch metadata (any API, scraping, transcripts, LLM extraction, etc.), how you load taxonomies, and whether you inspect existing graph edges is your pipeline’s choice. The contract below is the output JSON shape — not a required sequence of GETs or tools.
Shape (video — graph JSON contract)
This section is the canonical { video, relationships } shape. For company / person worker payloads, see Enrichment types (subjects) and the src/lib/enrichment/* modules.
video(or your team’s root key for the subject document) — object with the Video fields your pipeline can fill.classifiedAsis the document database_idof a taxonomy term (a row intaxonomy-terms) belonging to taxonomyvideo-kind(structural kind: episode / clip / …).formatis theslugof a taxonomy term undervideo-format(presentation: full / clip / …). Optionaltypeis a free string. Also:title,slug,episodeNumber,publishedAt,videoUrl,duration,transcript,coverImage,thumbnail,published, … Seevideosin Collections and fields andsrc/models/Video.ts. Omit keys you do not have.relationships— array of edges from this video to other entities. Each item has:type— Graph edge / relationship name (e.g.SOURCED_FROM,MENTIONS,CLASSIFIED_AS,COVERS). Align withEDGE_TYPESinsrc/graph/schema.tswhen in doubt.document— target reference:type(API collection key),slug(canonical slug for that target — see Scraper responsibility below). OptionallymongoId: 24-char hex string from aGETresponse when the row already exists in Cortex (same value as document_idand graph nodemongoId; use fortargetId/sourceIdinPOST /api/v1/graph/relationships). OmitmongoIdwhen the target does not exist yet (importer creates byslugfirst).properties— optional key/value metadata on the edge (counts, context, URLs, timestamps, etc.).
Scraper responsibility: slugs and mongoId
The scraper / enrichment tool is responsible for figuring out the right relationship (who to link) and the correct canonical slug per collection so Cortex lookups are unambiguous (e.g. disambiguate taxonomy-terms by taxonomy, pick the right playlist vs similarly named rows). It reads only (GET): after GET /api/v1/{collection}?slug=… (or list + filter) returns a document, include mongoId on that relationship’s document copied from the response’s _id. slug remains the stable human key; mongoId is proof the target was resolved in Cortex and lets the importer skip a second slug→id lookup when calling the graph API (which uses document-store ids, not slugs — see src/app/api/v1/graph/relationships/route.ts).
Data to include in each relationships[] item
| Field | Required | What to put |
|---|---|---|
type | Yes | Relationship name from EDGE_TYPES in src/graph/schema.ts (e.g. SOURCED_FROM, MENTIONS, CLASSIFIED_AS). |
document | Yes | type, slug (required). mongoId (optional) — set when GET returned an existing row; 24-char hex. Tokens use slug on the token document, not symbol. |
properties | No | Any JSON-safe object (strings, numbers, booleans, nested objects). Stored on the graph relationship when integrators call POST /api/v1/graph/relationships. Omit or use {} if you have no edge metadata. |
Suggested properties by edge type (conventions for video / content enrichment — all optional; extend as needed):
type | Typical document.type | Useful properties keys |
|---|---|---|
SOURCED_FROM | data-sources | originalUrl, scrapedAt (ISO 8601), optional title / referrer |
MENTIONS | companies, people, tokens, blockchains, investors, products, … | context (e.g. sentiment or theme), count (approx. mentions), confidence (0–1), snippet (short quote from source) |
CLASSIFIED_AS | taxonomy-terms | confidence (0–1), optional reason (short string) |
COVERS | Graph-only Topic nodes are not Mongo documents — omit unless your team has an agreed pipeline | — |
FEATURES | people | role (e.g. host, guest, quoted) |
IN_PLAYLIST | playlists | position (order in playlist), optional addedAt |
AUTHORED_BY | people | order (integer if multiple authors), role (e.g. byline) |
REFERENCES | articles, videos, … (same graph-backed collections) | citation, url, accessedAt |
Use ISO 8601 for datetimes. Do not put secrets or PII you are not allowed to store in the graph.
Relationships: targets (enrichment JSON vs downstream writes)
A relationship applied in the graph database is only valid if the thing at the other end exists in Cortex (document store, then graph). There is no edge to a slug that does not exist — when the importer runs.
Enrichment tool (this JSON only): Resolve each target to the right slug. When a row exists in Cortex, include document.mongoId (from GET). When it does not exist yet (e.g. a new guest), omit mongoId entirely — that is the signal: no id on the relationship means the downstream system should treat the target as to be created. Optional properties can carry hints (e.g. role, proposedDisplayName). The tool does not POST, enqueue jobs, or create EnrichmentJob rows — output JSON only.
Downstream importer / orchestrator (writes Cortex + graph):
- Resolve — If
mongoIdis present, use it for graphtargetId/sourceId(optional check againstslug). IfmongoIdis missing, treat the target as not in Cortex — no graph edge yet; the pipeline usually records apendingEnrichmentJobinenrichment-jobsso a worker can create thepeople/companies/ … document and finish enrichment. - Create if missing —
POST /api/v1/{collection}(see Creating a document (POST)), often withenqueueEnrichmentJobwhere supported, then ensure graph nodes exist (bridge sync / upsert as you do today). - Link —
POST /api/v1/graph/relationships(or equivalent) only after both ends exist.
Collection keys (use as relationships[].document.type):
| Collection key | Graph label (when synced) |
|---|---|
companies | Company |
people | Person |
tokens | Asset |
blockchains | Blockchain |
investors | Investor |
products | Product |
data-sources | DataSource |
taxonomy-terms | TaxonomyTerm |
videos | Video |
playlists | Playlist |
shows | Show |
articles | Episode (article) |
Topic and a few other graph labels are not in this list; do not emit relationships to them in this JSON unless your team has a separate, agreed process.
Taxonomies and document types (appliesTo)
Yes. Each Taxonomy document has appliesTo: an array of tags naming which document / entity types that scheme is meant to classify. Admins set it when creating or editing a taxonomy (see taxonomies in Collections and fields). Cortex does not enforce appliesTo in API validators today — it is a contract for humans and tools so you pick the right terms (e.g. only use “content theme” terms for articles, “token category” terms for tokens).
Seed examples (scripts/seed-mongo.ts): Token Category → appliesTo: ['token']; Company Sector → ['company', 'blockchain']; Content Theme → ['article', 'event', 'investor'].
Suggested tag → collection alignment (use the same vocabulary your taxonomies use):
appliesTo tag | Typical collection key |
|---|---|
article | articles |
company | companies |
token | tokens |
blockchain | blockchains |
event | events |
investor | investors |
video | videos |
person | people |
Tooling: GET /api/v1/taxonomies returns each taxonomy’s appliesTo. Filter taxonomies (and list terms via GET /api/v1/taxonomy-terms with populated taxonomy) so CLASSIFIED_AS edges only reference taxonomy-terms whose Taxonomy document’s appliesTo includes the content type you are enriching (e.g. tag video → use terms from taxonomies that list video in appliesTo).
Admin dropdowns (no hardcoded enums): Video.classifiedAs stores the Mongo _id of a taxonomy-terms row under taxonomy video-kind (structured form: taxonomyTermRef). Video.format and Article.format still use term slugs from video-format / article-format (taxonomyTermSlug). taxonomy-terms documents use taxonomyRef in the admin form for the taxonomy field (Mongo id of the Taxonomy); there is no term-to-term parent — terms are flat within each taxonomy. Bridge sync writes (Video)-[:CLASSIFIED_AS { aspect: 'videoKind' }]->(TaxonomyTerm) from classifiedAs, and (Episode)-[:CLASSIFIED_AS { aspect: 'articleFormat' }]->(TaxonomyTerm) from Article.format. See SCHEMA.md.
Detailed read flow (real example)
This is one concrete way to fill { video, relationships }: load taxonomies first, then resolve entities by slug, then playlist + show context. Replace BASE with your app origin (e.g. http://localhost:3000). Use x-api-key or session auth as required for /api/v1.
1. Taxonomies (video-kind, video-format, content theme)
| Step | Request | Use the response for |
|---|---|---|
| 1a | GET BASE/api/v1/taxonomies?slug=video-kind | Taxonomy _id |
| 1b | GET BASE/api/v1/taxonomy-terms?taxonomy={videoKindId}&limit=100 | Pick the video kind term → video.classifiedAs = that term’s _id |
| 1c | GET BASE/api/v1/taxonomies?slug=video-format | Taxonomy _id |
| 1d | GET BASE/api/v1/taxonomy-terms?taxonomy={videoFormatId}&limit=100 | Pick format → video.format = that term’s slug (e.g. full) |
| 1e | (optional) GET BASE/api/v1/taxonomies then terms for your content theme taxonomy | CLASSIFIED_AS → relationships[].document for taxonomy-terms (slug + mongoId from data[]._id) |
2. People, companies, tokens, blockchains
For each slug you want in the graph, resolve the document and copy data[0]._id (or the single hit) into document.mongoId.
| Request | Typical edge |
|---|---|
GET BASE/api/v1/people?slug=jeremy-allaire&limit=1 | FEATURES |
GET BASE/api/v1/companies?slug=circle&limit=1 | MENTIONS |
GET BASE/api/v1/tokens?slug=usdc&limit=1 | MENTIONS |
GET BASE/api/v1/blockchains?slug=ethereum&limit=1 | MENTIONS |
List routes often match slug / name as substring — use limit=1 when you expect a single canonical row.
3. Playlist and show
| Step | Request | Use for |
|---|---|---|
| 3a | GET BASE/api/v1/playlists?slug=stablecoin-weekly&limit=1 | IN_PLAYLIST → document.mongoId = playlist _id |
| 3b | GET BASE/api/v1/graph/relationships?mongoId={playlistMongoId} | Find FOR_SHOW → target Show mongoId (playlist → show is a graph edge; useful for copy, validation, or UI even if this JSON does not emit a Show relationship on the video) |
| 3c | (alternative) GET BASE/api/v1/shows?slug={showSlug}&limit=1 | Resolve Show directly if you already know the slug from CMS / site |
4. Emit one JSON object
Merge into video + relationships as below. Rule of thumb: every mongoId you attach came from a GET response _id for that collection (or from graph targetId when following edges).
Example
The Detailed read flow section walks through GET taxonomies, people / companies / tokens / blockchains, playlist, show (via graph or shows), then this JSON.
Episode-style video with the two fields that resolve as taxonomy terms (Mongo collection taxonomy-terms) in src/models/Video.ts:
classifiedAs— Required. Mongo_idof the taxonomy term for video kind — i.e. the term row under taxonomyvideo-kindwhose slug is e.g.episode,clip, orother(see seed data). This field is a ref to a TaxonomyTerm document, not an arbitrary string.format— Slug of the taxonomy term for video format — i.e. the term under taxonomyvideo-format(e.g.full,clip). Stored as a slug string onVideo, but it must match a real taxonomy term in that vocabulary; not a MIME type.type— Optional free string onVideo.type(not a taxonomy term unless you add a separate convention).
Plus graph hints: FEATURES (people on camera), MENTIONS (companies, tokens, blockchains, …), IN_PLAYLIST, CLASSIFIED_AS (e.g. content theme).
{
"video": {
"_id": "665a1b2c3d4e5f6a7b8c9d0e",
"title": "Inside USDC Reserves — Q1 2025 with Circle",
"slug": "inside-usdc-reserves-q1-2025-circle",
"classifiedAs": "674a1b2c3d4e5f6a7b8c9d0f",
"episodeNumber": 42,
"publishedAt": "2025-04-15T18:00:00.000Z",
"format": "full",
"type": "interview",
"videoUrl": "https://cdn.example.com/episodes/inside-usdc-reserves-q1-2025.m3u8",
"duration": 1842,
"coverImage": "https://cdn.example.com/covers/usdc-reserves-ep42.jpg",
"thumbnail": "https://cdn.example.com/thumbs/usdc-reserves-ep42.webp",
"transcript": null,
"published": true
},
"relationships": [
{
"type": "FEATURES",
"document": {
"type": "people",
"slug": "jeremy-allaire",
"mongoId": "507f1f77bcf86cd799439011"
},
"properties": { "role": "host" }
},
{
"type": "MENTIONS",
"document": {
"type": "companies",
"slug": "circle",
"mongoId": "507f1f77bcf86cd799439012"
},
"properties": { "context": "primary", "snippet": "discussion of reserve composition" }
},
{
"type": "MENTIONS",
"document": {
"type": "tokens",
"slug": "usdc",
"mongoId": "507f1f77bcf86cd799439013"
},
"properties": { "context": "analytical", "count": 24 }
},
{
"type": "MENTIONS",
"document": {
"type": "blockchains",
"slug": "ethereum",
"mongoId": "507f1f77bcf86cd799439014"
},
"properties": { "context": "background", "snippet": "EVM settlement" }
},
{
"type": "IN_PLAYLIST",
"document": {
"type": "playlists",
"slug": "stablecoin-weekly",
"mongoId": "674a1b2c3d4e5f6a7b8c9d1a"
},
"properties": { "position": 7 }
},
{
"type": "CLASSIFIED_AS",
"document": {
"type": "taxonomy-terms",
"slug": "deep-dive",
"mongoId": "507f1f77bcf86cd799439015"
}
}
]
}
Resolve both as taxonomy terms from live data: use GET /api/v1/taxonomies?slug=video-kind and slug=video-format to get each taxonomy’s _id, then GET /api/v1/taxonomy-terms?taxonomy={id} so you pick the taxonomy-terms row for kind (classifiedAs = that term’s _id) and the taxonomy-terms row for format (format = that term’s slug). For each relationships[].document, after GET resolves the target, copy _id into mongoId as in the example above.
The importer creates missing targets with POST /api/v1/{collection} when needed; taxonomy-terms rows require taxonomy — the Mongo _id of the Taxonomy document. Playlists / shows links in the graph may also follow Playlist → FOR_SHOW → Show outside this JSON — see SCHEMA.md.
Example: guest not in Cortex yet (output only)
There is no people row for this guest. The tool still emits type + slug on document and does not put mongoId on that relationship — no id is the whole signal.
Example output fragment (inside relationships[]):
{
"type": "FEATURES",
"document": {
"type": "people",
"slug": "jane-doe-guest-analyst"
},
"properties": {
"role": "guest",
"proposedDisplayName": "Jane Doe"
}
}
Downstream: Ingestion sees document without mongoId, treats it as create this target, and records a pending EnrichmentJob (e.g. contentType: person, payload with slug + hints) so a worker can POST the people document and later attach graph edges. The read-only tool never creates that job — only the pipeline that consumes the JSON does.
Rule: The enrichment tool never invents taxonomy classifiedAs / format values without aligning to Cortex vocabulary; free-text hints in properties for people/companies you are about to create are OK.
Helpers: API reference (read vs write)
Enrichment tools use GET (and public read routes) below to validate vocabulary and optionally check whether targets exist. POST / PATCH / DELETE and POST …/graph/relationships are for importers and other services, not for the read-only JSON emitter.
Replace BASE with your app origin (e.g. http://localhost:3000). Authenticated /api/v1 routes accept a browser session or x-api-key: <CORTEX_API_KEY> (see README.md). Public routes require x-api-key only.
Creating a document (POST) — importer only
POST /api/v1/{collection} — send a JSON object whose keys match the collection schema (see Collections and fields). Response: 201 Created with the new document (including assigned _id).
| Requirement | Notes |
|---|---|
| Headers | Content-Type: application/json; auth as for other v1 routes (x-api-key or session). |
| ObjectId fields | Refs (location, taxonomy, company, token, …) are 24-character hex strings. |
| Dates | Prefer ISO 8601 strings (e.g. "2025-04-15T14:30:00.000Z"). |
| Errors | 400 if validation fails (missing required field, duplicate slug, invalid type); body { "error": "…" }. 404 if {collection} is not in modelMap. |
Optional — enqueue enrichment after create: include "enqueueEnrichmentJob": true in the JSON only for POST /api/v1/companies or POST /api/v1/people. That flag is not saved on the document; it only creates a pending EnrichmentJob for the new row. For all other collections, omit it.
Updates and deletes: PATCH /api/v1/{collection}/{id} — merge fields (graph-backed collections may sync the vertex in the graph database after save). DELETE /api/v1/{collection}/{id} — remove the document.
Public API (/api/public/v1/...) does not expose POST for creates; use /api/v1/... with a key or session.
Example (data source):
curl -sS -X POST "$BASE/api/v1/data-sources" \
-H "Content-Type: application/json" \
-H "x-api-key: $CORTEX_API_KEY" \
-d '{
"name": "CoinDesk",
"slug": "coindesk",
"sourceType": "news",
"baseUrl": "https://www.coindesk.com",
"isActive": true
}'
Implementation: src/app/api/v1/[collection]/route.ts.
List + filter (internal, full data)
GET /api/v1/{collection} — paginated list; pass query params to narrow results.
| Param | Behavior |
|---|---|
name, slug, title | Case-insensitive substring match (regex), good for “find slug containing …” |
Other fields (e.g. symbol, taxonomy, isActive) | Exact match on that field |
page, limit (≤ 100), sort | Pagination and sort (default -createdAt) |
Examples:
# Company by slug fragment
curl -sS "$BASE/api/v1/companies?slug=circle&limit=5" -H "x-api-key: $CORTEX_API_KEY"
# Token by slug (tokens collection = Asset in graph)
curl -sS "$BASE/api/v1/tokens?slug=usdc&limit=5" -H "x-api-key: $CORTEX_API_KEY"
# Taxonomy term by slug (narrow further if the same slug can exist in two taxonomies)
curl -sS "$BASE/api/v1/taxonomy-terms?slug=market-analysis&limit=10" -H "x-api-key: $CORTEX_API_KEY"
# Data sources
curl -sS "$BASE/api/v1/data-sources?slug=coindesk&limit=5" -H "x-api-key: $CORTEX_API_KEY"
Implementation: src/app/api/v1/[collection]/route.ts. Collections available are those in src/lib/api/model-map.ts (companies, people, tokens, articles, taxonomies, taxonomy-terms, data-sources, …).
One document by Mongo _id
GET /api/v1/{collection}/{id} — returns one document; population of refs matches populateMap (e.g. taxonomy terms include taxonomy). Use after you have an id from a list response.
One document by slug (published only)
GET /api/public/v1/{collection}/{slug} — published: true only. Same x-api-key as other public routes. Handy for read-only checks against live content; not suitable if drafts must be linked.
Cross-collection search (published, text / autocomplete)
GET /api/public/v1/search?q=...&collections=companies,people&limit=10
mode=autocomplete— prefix match onname(regex^q), published docs only.- Default —
$textsearch where indexes exist;collectionsis comma-separated and must be keys inmodelMap.
Requires q. See src/app/api/public/v1/search/route.ts.
Taxonomies and terms (vocabulary)
Required context for any enrichment that emits taxonomy-backed fields or CLASSIFIED_AS edges. Do not ship a tool that guesses term slugs without calling these.
GET /api/v1/taxonomies?limit=100(paginate if needed) — each taxonomy’sappliesTo(which document types it applies to),slug,name,_id; see Taxonomies and document types.GET /api/v1/taxonomy-terms?limit=100(paginate; filter with?taxonomy=<taxonomyMongoId>when you know the vocabulary) — terms with populatedtaxonomyso you can filter by taxonomy slug or id and build a slug →_idmap that matches production.
Graph: what is already linked?
Required when you add or update relationships for graph-backed collections (companies, people, videos, articles, tokens, taxonomy-terms, … — see the Graph edges by collection table under Relationships). For each relevant Mongo _id, call:
GET /api/v1/graph/relationships?mongoId={mongoId} — returns existing relationships for that node in the graph (if synced). Useful when you want to align with bridge sync or deduplicate edges; optional for your pipeline. See src/app/api/v1/graph/relationships/route.ts.
Admin UI
/admin/data/{collection} — browse and search lists in the browser when you prefer not to script.
Scripts and direct DB access
Repo scripts under scripts/ use MONGODB_URI with Mongoose; you can run one-off findOne({ slug }) / findById in a script for bulk or offline resolution. Same data as the API, no HTTP.
Collections and fields (API keys)
Use these kebab-case names in /api/v1/{collection} and admin URLs. Fields below match the structured admin forms (src/lib/admin/collection-form-fields.ts) plus a few model-only fields you may set via JSON in admin or API.
| Collection | Fields |
|---|---|
| companies | name, slug, tagline, description, yearFounded, logo, icon, brandColor, websiteUrl, employeeCount, fundingStage, totalFundingUsd, legalName, entityType, registrationNumber, countryOfIncorporation, location, socialLinks, publicId, verified, published, featured |
| people | name, slug, avatar, bio, location, socialLinks, publicId, published, contributor |
| products | company, name, slug, description, productType, productStatus, launchDate, websiteUrl, docsUrl, sourceCodeUrl, isMainProduct, isOpenSource, published |
| tokens | symbol, name, slug, tokenType, isStablecoin, description, logo, coingeckoId, cmcId, defillamaId, socialLinks, verified, published — plus model metadata (mixed object; JSON editor) |
| blockchains | name, slug, chainId, chainType, vmType, consensusMechanism, description, logo, explorerUrl, launchDate, socialLinks, published |
| investors | name, slug, investorType, stages, description, logo, aumUsd, portfolioCount, websiteUrl, location, socialLinks, publicId, published |
| locations | city, country, latitude, longitude, timezone |
| articles | title, slug, subheadline, format, externalSourceUrl, externalSourceName, publishedAt, coverImage, publicId, published, featured — plus model content (mixed; JSON editor) |
| videos | title, slug, classifiedAs, episodeNumber, publishedAt, format, parentVideo, clipStartTime, clipEndTime, videoUrl, type, coverImage, thumbnail, duration, transcript, published |
| shows | name, slug, summary, description, coverImage, socialLinks, published |
| playlists | name, slug, summary, description, coverImage, externalUrl, published |
| events | title, slug, startDate, endDate, location, venue, summary, description, websiteUrl, registrationUrl, playlistUrl, timezone, publicId, coverImage, published |
| stablecoin-profiles | token, pegTargetPrice, launchDate, auditFrequency, yieldSource, mintMinimumUsd, redeemMinimumUsd, redemptionTime, feeMintPct, feeRedeemPct, reserveRatio, riskScore, riskScoreRationale, whitepaperUrl |
| taxonomies | name, slug, description, appliesTo |
| taxonomy-terms | taxonomy, name, slug, description, color, displayOrder, isActive |
| data-sources | name, slug, sourceType, baseUrl, trustLevel, refreshFrequency, isActive, notes |
| enrichment-jobs | contentType, contentId, status, extractedAt, entityCount, errorLog, retryCount — plus model jobPayload (mixed; audit snapshot) |
All of the above collections use Mongoose createdAt / updatedAt timestamps except where the physical collection name differs (e.g. enrichment jobs store in extractionjobs). Ref fields (location, company, …) are Mongo ObjectIds (hex strings in APIs).
Relationships
A) Mongo document references (foreign keys)
These are stored on the document in MongoDB (not in the graph as the primary store). Resolve ids before writing.
| From collection | Field | To collection |
|---|---|---|
| products | company | companies (required) |
| videos | classifiedAs | taxonomy-terms (required; taxonomy video-kind) |
| videos | parentVideo | videos (optional; clip → parent) |
| stablecoin-profiles | token | tokens (required) |
| taxonomy-terms | taxonomy | taxonomies (required) |
| companies | location | locations (optional) |
| investors | location | locations (optional) |
| people | location | locations (optional) |
| events | location | locations (optional) |
| enrichment-jobs | contentId | polymorphic — contentType names the model (e.g. article, video, company) |
Standalone (no outgoing FK in this schema): articles, blockchains, data sources, shows, playlists, taxonomies, tokens (except being referenced), and others not listed above. Show ↔ playlist ↔ episode wiring is graph-only (see below).
B) Graph edges by collection (graph-backed rows)
Only collections in GRAPH_BACKED_COLLECTIONS (src/lib/graph/mongo-neo4j-mapping.ts) have nodes with mongoId suitable for POST /api/v1/graph/relationships. Allowed relationship type names are exactly EDGE_TYPES in src/graph/schema.ts (not every type applies to every pair).
| Collection | Graph label | Typical outgoing (from this node) | Typical incoming (to this node) |
|---|---|---|---|
| articles | Episode (sourceType: article) | SOURCED_FROM → DataSource; COVERS → Topic; MENTIONS → Company / Person / Asset / …; CLASSIFIED_AS → TaxonomyTerm; FEATURES → Person; AUTHORED_BY → Person; knowledge edges per pipeline (REFERENCES, …) | — |
| videos | Video | CLASSIFIED_AS → TaxonomyTerm (from classifiedAs; aspect: 'videoKind'); CLIP_OF → Video (clips); IN_PLAYLIST → Playlist; MENTIONS, FEATURES, … | CLIP_OF ← Video |
| playlists | Playlist | FOR_SHOW → Show | IN_PLAYLIST ← Video |
| shows | Show | — | FOR_SHOW ← Playlist |
| companies | Company | CLASSIFIED_AS → TaxonomyTerm; OPERATES → Blockchain | INVESTED_IN ← Investor; WORKS_AT / FOUNDED ← Person; MENTIONS ← Episode/Video/…; AUDITED_BY ← Asset (curated bridge); ISSUED_BY ← Asset (via product) |
| people | Person | WORKS_AT / AFFILIATED_WITH → Company; FOUNDED → Company; MENTIONS, extraction edges (e.g. to Claim) | FEATURES ← Episode; AUTHORED_BY ← Episode |
| tokens | Asset | CLASSIFIED_AS → TaxonomyTerm; PRODUCT_OF → Product; ISSUED_BY → Company (bridge); NATIVE_TOKEN → Blockchain (curated) | MENTIONS ← …; AUDITED_BY → Company |
| products | Product | — | PRODUCT_OF ← Asset; company link in Mongo drives issuer edges in bridge sync |
| blockchains | Blockchain | CLASSIFIED_AS → TaxonomyTerm; NATIVE_TOKEN ← Asset | OPERATES ← Company |
| investors | Investor | INVESTED_IN → Company | — |
| events | Event | ORGANIZED_BY, INVOLVES, … (per pipeline) | — |
| data-sources | DataSource | — | SOURCED_FROM ← Episode (article) / Claim / … |
| taxonomy-terms | TaxonomyTerm | — | CLASSIFIED_AS ← Company, Asset, Blockchain, … |
Not graph-backed as primary nodes (no mongoId vertex for these in the same way): locations, stablecoin-profiles, taxonomies (Taxonomy has no bridge node; only taxonomy-terms sync to the graph), enrichment-jobs.
Topic (COVERS, etc.) is graph-only (no Mongo collection); see SCHEMA.md.
Bridge / seed edges (curated maps, not Mongo FKs): src/graph/bridge-maps.ts and src/graph/bridgeSync.ts — e.g. TOKEN_PRODUCT_SLUG, OPERATES_EDGES, PERSON_WORKS_AT_EDGES, INVESTED_IN_EDGES, NATIVE_TOKEN_EDGES, AUDITED_BY_EDGES.
Checklist
- Output must be valid JSON (parseable; use UTF-8).
- The enrichment tool does not write to Cortex’s document or graph stores — only emits JSON (stdout, file, or HTTP body).
- Prefer ISO 8601 strings for datetimes.
- If there are no relationships, use
"relationships": []. - Taxonomy-backed fields (
classifiedAs,format, etc.) should match Cortex’s controlled vocabulary (how you discover valid ids/slugs is up to you). - Relationship targets: correct
slugper collection (scraper’s job); includedocument.mongoIdwhenGETreturned an existing row. relationshipsmay omitmongoIdwhen the target does not exist yet — downstream typicallyEnrichmentJob+ create row, then edges (see Relationships: targets).- Importer (separate step): before graph edges, every
relationships[].documentmust resolve in Mongo (create viaPOSTif needed). - Importer: may query
graph/relationshipswhen merging edges to avoid duplicates (optional dedup strategy). - Deliverable from the tool: that JSON object and nothing else required from an API write perspective.
For data model details, see SCHEMA.md.