Technical design · KYAS · v2

A pipeline that turns crawled web pages into verified PDBI entries.

Five stateless stages joined by durable queues. A page is crawled, distilled into one canonical record, encoded into a vector for duplicate-checking, judged by three scoped AI agents, then written into PDBI in its native CMS shape. This document traces that path end-to-end and details the two components that carry the most weight: the crawler and the encoder.

Go workersNATS JetStreamPostgres + pgvector object storagePlaywright (stealth)single repo

Architecture

Stateless workers, durable queues, shared stores.

Each stage is its own worker pool. They communicate only through NATS streams that carry record IDs, never payloads — the records live in Postgres and large blobs live in object storage. This is what lets every stage scale, restart, and be tested in isolation.

Pipeline — worker pools + queues (A → E)

crawl-worker

→ crawl.out

extract-worker

→ candidate.new

encode + search

→ candidate.encoded

decision-worker

→ decision.accept

publisher

→ PDBI / shadow

State — Postgres, pgvector, object storage

frontier

URL queue + crawl metadata

candidates

distilled records (PG)

vector index

pgvector, partitioned by shape

decisions

verdicts + trace refs

entries

published / shadow + pdbi_id

pdbi mirror

58,413 entries · dedup baseline

object storage

bodies (transient) · docs/images · traces

audit / labels

samples + human verdicts

Shared services

browser-fetch (Playwright sidecar) resilient provider pool — search / fetch / AI API + dashboard (decision stream, quality review)

Seam rules

Queues carry IDs only. A NATS message is a `candidate_id`; the worker loads the record from Postgres / object storage.
Idempotency keys. crawl = normalized URL · processing = `candidate_id = hash(url, unit_index)` · dedup = `canonical_entity_key` (decided by D, never by the id).
Body cache until verdict. The fetched page is cached in object storage until D rules, then garbage-collected. D verifies against the cache — no mid-pipeline re-fetch.
Deterministic orchestration. The flow is fixed code. AI is called only inside stage D (and a cheap classifier in A) — never to decide the flow itself.

From a URL to one accepted entry

The full path of a single datum.

Worked example: the frontier hands out a textile-research URL and the system produces one accepted PDBI entry for Songket Palembang. The record at each step is shown verbatim.

A · crawl-worker

Fetch & cache

Frontier pops a URL. HTTP fetch (robots OK, polite delay); the body is cached to object storage and the page text + outbound links returned.

GET …/songket-palembang → 200, id, lang=id, links[37]
body → bodies/{candidate_id} (transient)

A · relevance gate

Cultural? Yes → keep

A cheap classifier scores the page as Indonesian cultural data. It passes to C and its links are enqueued at high priority. (A "no" would discard the body with zero storage.)

relevant = true (textile / songket) → enqueue 37 links

C · extract-worker

Distill to a Candidate

Entity resolved and canonicalized to Indonesian; category / province / element mapped; category-specific attributes pulled; each fact bound to its source quote; image URLs captured as refs (no download).

entity="Songket Palembang" cat=4 prov=Sumsel
attributes={motifs:[Lepus,Tabur,Tretes…], material, technique}
facts=[3] each → {source_url, quote}

B · encode + search

Vector + neighbors

The Candidate becomes a custom feature vector. Neighbor search runs inside the category×province partition of pgvector and returns structured match evidence — not an opaque score.

block=Motif Kain · Sumsel · has_motif_list
nearest=0.41 "Songket Lepus" — distinct motif set

D · decision-worker

Three scoped agents

Cultural-gate passes. Novelty reads the neighbors → net-new. Cross-validation finds the facts corroborated and uncontradicted → accept. A full trace is persisted.

verdict=net-new cross_val=accept conf=0.86
trace_id=… → visible in decision stream

E · publisher

Draft, image, publish

Composes the CMS payload (history + motif list as structured HTML); fetches the image, license-checks it, generates a data-derived caption, uploads it; resolves element/province ids; creates the entry.

action=create mode=shadow→live
field_description=<html: sejarah + Motif[]> field_file=berkas:…
POST /cms/entry/type/4/create → pdbi_id

E → B · close loop

Re-encode & sample

The accepted entry is re-encoded into the index (it now guards the next decision) and, being acceptable for review, may be pulled into the audit sample. The body cache is freed.

reencoded=true audit_sampled=true bodies/{id} GC'd

One accepted datumSeven steps, three AI calls (all in D), zero unsourced sentences. Every other crawled page that fails the relevance gate, the cultural gate, or dedup is dropped before it ever reaches publish.

Component · the Crawl Engine (A)

A focused crawler, not a blind one.

Open-ended link-following would drown in non-cultural pages. Instead the frontier is a best-first queue ordered by predicted cultural relevance: relevant pages get their links followed first, so the crawl stays in cultural neighbourhoods of the web — in any language.

Mechanics

Frontier. URL queue with depth, parent, relevance-prior, language hint. Best-first, not BFS/DFS.
Two fetchers. Polite HTTP for static pages & PDFs; a stealth browser (Playwright) for JS / protected / social pages.
Relevance gate. A cheap classifier on every page — keep & expand, or discard with zero storage.
Wikipedia = reference harvester. Descend a page's References to reach the primary sources; the Wikipedia text itself is never kept.
Documents. A 2,000-page PDF is segmented into units and fanned out — one source can yield hundreds of candidates.
Re-encounter, not re-poll. Content-hash idempotency skips unchanged pages. Cultural data is static, so there is no freshness timer; new data comes from new pages → the enrich path.

Seeds & expansion

Wikipedia category treesresearch repositories known cultural domainsPDBI coverage gaps DuckDuckGo APISearXNGWikipedia search

StorageEphemeral by default — only distilled records, provenance, and the URL frontier persist. Raw bodies are dropped after the decision; only high-value primary documents (PDFs, papers) are retained for re-mining.

Component · the Encoder & Index (B)

A custom vector we control — not an LLM embedding.

The dedup question is "does PDBI already hold this data?" An LLM embedding clusters by vague meaning and hides that answer. Instead each record is encoded into a feature vector built from the PDBI schema, in two blocks.

Shape blockblocking + routing

categoryprovince / regionelement has_recipehas_motif_listhas_steps num_imageslength_bucketnum_sections

Low-cardinality. Used to narrow the search (the index is partitioned on it) and to route the draft template — never to decide a match.

Content blockthe match decision

entity n-gramsregion / ethnicity tokens ingredients[]motifs[]occasionactors

High-cardinality. Content decides dedup, so two soto dishes from one region read as distinct data. Category-conditional slots, zero-padded when N/A.

Matching is two-stage, not one blended score

Stage 1 · block

Shape narrows

Partition by category × province. The candidate is only compared against its own neighbourhood — and this is also the scale strategy (no full-index scan at 1M+).

Stage 2 · score

Content decides

Within the partition, content similarity yields structured evidence: "same category+region, 80% ingredient overlap, distinct name" → handed to D.

Why content dominatesShape is low-cardinality — at 1M entries every food-from-Java record looks alike, so weighting shape would suppress genuinely new data in the densest categories. Over-weighting shape loses data; under-weighting it only costs CPU. So: content scores, shape blocks. The encoder is deterministic v0; a trained model is earned later from accumulated labels.

Component · the AI Decision-Maker (D)

Three scoped agents inside one deterministic harness.

Stage D is the only place that runs agentic, tool-calling AI — and the orchestration around it is fixed code, never an agent deciding the flow. Each agent is the same harnessed skeleton: it is handed exactly the context and tools it needs, then what it returns is validated before it counts.

// the harness — identical skeleton for all three agents 1 · assemble context candidate · B neighbors + evidence · sourced facts · prior verdicts 2 · call model forced-schema output · via the resilient provider pool 3 · scoped tools neighbor-search · source re-read (cache) · contradiction · credibility 4 · validate schema + confidence floor 5 · ACCEPT · RETRY · DEFER insufficient context → buffer, never guess

The three agents

Agent	Question	Scoped tools	Output
Cultural-gate	In-scope Indonesian cultural datum?	category / scope validators	pass · reject
Novelty	Already in PDBI, or a new angle?	B neighbor-search, mirror lookup	skip · enrich · net-new
Cross-validation	Trustworthy enough to publish?	source re-read (cache), contradiction & credibility check	accept · reject + confidence

Every decision is traced. The assembled context, tool calls, and model output are persisted per candidate — the dashboard's decision stream shows exactly why each entry was skipped, enriched, or published.
Cheap base, paid tip. The relevance gate (A) runs on millions of pages as a tiny/local classifier; extraction (C) is one cheap call; only D's agents and E's drafting use stronger models.
Provider pool. Every model call fails over on 429, circuit-breaks on 402, and if a capability is exhausted the stage pauses and buffers — it never publishes degraded output.

Reliability = harness, not humansFull autonomy, no blocking review. Trust comes from scoped tools, forced-schema outputs, confidence floors, and an audit-sampling loop that keeps surfacing single-source / low-confidence entries for human labelling.

The records that flow between stages

Three contracts: Candidate, Decision, Entry.

The Candidate is the spine — C produces it, B/D/E consume it. Two identities matter: candidate_id for processing idempotency, and canonical_entity_key for grouping (deliberately not unique, so "same subject, new data" can coexist).

Candidate { // C → B, D, E candidate_id // hash(normalized_url, unit_index) content_hash // re-encounter / re-mine canonical_entity_key // norm(entity)+category+province — NOT unique entity_name, aliases[] category_id, province_id, element_id, origin attributes{} // category-specific; typed for the 3 pilot cats facts[]{ claim, source_url, source_lang, quote } images[]{ source_url, caption, license_hint } // refs only provenance{ lead_chain[], crawl_path[], doc_page? } source_language, extraction_version, extraction_confidence }

Decision { // D → E ; trace → dashboard candidate_id, canonical_entity_key cultural_gate pass | reject novelty_verdict skip | enrich | net-new target_pdbi_id? // set on skip(dup) / enrich cross_validation accept | reject confidence, rationale, evidence_refs{ neighbors[], sources[] } labels{ is_duplicate?, is_accepted? } // → calibration trace_id // assembled context + tool calls + output }

Entry { // E → PDBI ; → B re-encode candidate_id, decision_id mode shadow | trickle | live action create | update(target_pdbi_id) title, field_category, field_element, field_province, field_from field_description // dense factual HTML; written for OSAN indexing field_file? // berkas ref after upload citations[], image{ object_ref, generated_caption, pdbi_berkas_ref } pdbi_id?, publish_status, reencoded, audit_sampled, sample_reason }

Expected output — accepted entries

What lands in PDBI, per category.

The end product: short, structured, fully-sourced data in PDBI's native shape. One per pilot category — chosen to show the three different data shapes (recipe · motif list · prose) and three decision outcomes (net-new · enrich · audit-sampled). Every line traces to a source; nothing is padded.

Makanan Minuman · id 3

Soto Bancar Purbalingga

province Jawa Tengahfrom Purbalingga

Deskripsi

Soto berkuah bening khas Purbalingga, disajikan dengan ketupat, suwiran daging, tauge, dan taburan kacang tanah goreng.¹

Bahan

Daging sapi & jeroan, kaldu bening
Ketupat, tauge, daun bawang, seledri
Kacang tanah goreng (taburan)
Bumbu: bawang, kunyit, jahe, serai, lengkuas²

Cara membuat

Rebus daging hingga empuk; sisihkan kaldu bening.
Tumis bumbu halus, masukkan ke kaldu.
Sajikan dengan ketupat, tauge, suwiran daging, taburan kacang.²

Image · captionSoto Bancar Purbalingga — soto kuah bening dengan kacang, Jawa Tengah.

Sumber1 · Jurnal Kuliner Banyumasan (2021)
2 · Dokumentasi Kuliner Daerah Purbalingga

verdict net-newconf 0.84 · 2 sumber

Motif Kain · id 4

Songket Palembang

province Sumatera Selatanfrom Palembang

Sejarah singkat

Kain tenun benang emas/perak dari Palembang, berkembang sejak masa Kesultanan Palembang sebagai kain kehormatan.¹

Motif

Lepus — penuh benang emas
Tabur / Bintang — motif tersebar
Tretes — tanpa motif tengah
Limar — tenun ikat pakan²

Bahan & teknik

Benang sutra dengan benang emas/perak, ditenun memakai ATBM.³

Image · captionSongket Palembang — kain tenun motif Lepus, Sumatera Selatan.

Sumber1 · Ensiklopedia Kain Tradisional Indonesia
2 · Museum Tekstil — koleksi songket
3 · Jurnal Kriya Nusantara

verdict enrich #4-xxxx+2 motif, +teknik · 0.88

Tarian · id 14

Tari Bedhaya Ketawang

province Jawa Tengahfrom Surakarta

Asal-usul

Tarian sakral Keraton Kasunanan Surakarta, ditarikan oleh sembilan penari putri.¹

Struktur & makna

Sembilan penari melambangkan arah mata angin; gerak menggambarkan hubungan penguasa dengan Kanjeng Ratu Kidul.¹

Fungsi

Dipentaskan pada Tingalan Jumenengan — peringatan kenaikan takhta Susuhunan.²

Image · captionTari Bedhaya Ketawang — tari sakral Keraton Surakarta, Jawa Tengah.

Sumber1 · Etnografi Tari Keraton (penelitian)
2 · Dokumentasi Keraton Kasunanan Surakarta

verdict net-newconf 0.79 · audit-sampled

IllustrativeRepresentative of the output shape, length, and sourcing discipline — not live published data. The real field_description is HTML; the low-confidence, scarce-source dance is the kind of entry the audit sample preferentially surfaces for human review.