KYAS
Architecture & Data Flow
Technical design · KYAS · v2

A pipeline that turns crawled web pages into verified PDBI entries.

Five stateless stages joined by durable queues. A page is crawled, distilled into one canonical record, encoded into a vector for duplicate-checking, judged by three scoped AI agents, then written into PDBI in its native CMS shape. This document traces that path end-to-end and details the two components that carry the most weight: the crawler and the encoder.

Go workersNATS JetStreamPostgres + pgvector object storagePlaywright (stealth)single repo
Architecture

Stateless workers, durable queues, shared stores.

Each stage is its own worker pool. They communicate only through NATS streams that carry record IDs, never payloads — the records live in Postgres and large blobs live in object storage. This is what lets every stage scale, restart, and be tested in isolation.

Pipeline — worker pools + queues (A → E)
A
crawl-worker
→ crawl.out
C
extract-worker
→ candidate.new
B
encode + search
→ candidate.encoded
D
decision-worker
→ decision.accept
E
publisher
→ PDBI / shadow
State — Postgres, pgvector, object storage
frontier
URL queue + crawl metadata
candidates
distilled records (PG)
vector index
pgvector, partitioned by shape
decisions
verdicts + trace refs
entries
published / shadow + pdbi_id
pdbi mirror
58,413 entries · dedup baseline
object storage
bodies (transient) · docs/images · traces
audit / labels
samples + human verdicts
Shared services
browser-fetch (Playwright sidecar) resilient provider pool — search / fetch / AI API + dashboard (decision stream, quality review)

Seam rules

From a URL to one accepted entry

The full path of a single datum.

Worked example: the frontier hands out a textile-research URL and the system produces one accepted PDBI entry for Songket Palembang. The record at each step is shown verbatim.

A · crawl-worker

Fetch & cache

Frontier pops a URL. HTTP fetch (robots OK, polite delay); the body is cached to object storage and the page text + outbound links returned.

GET …/songket-palembang → 200, id, lang=id, links[37]
body → bodies/{candidate_id} (transient)
A · relevance gate

Cultural? Yes → keep

A cheap classifier scores the page as Indonesian cultural data. It passes to C and its links are enqueued at high priority. (A "no" would discard the body with zero storage.)

relevant = true (textile / songket) → enqueue 37 links
C · extract-worker

Distill to a Candidate

Entity resolved and canonicalized to Indonesian; category / province / element mapped; category-specific attributes pulled; each fact bound to its source quote; image URLs captured as refs (no download).

entity="Songket Palembang" cat=4 prov=Sumsel
attributes={motifs:[Lepus,Tabur,Tretes…], material, technique}
facts=[3] each → {source_url, quote}
B · encode + search

Vector + neighbors

The Candidate becomes a custom feature vector. Neighbor search runs inside the category×province partition of pgvector and returns structured match evidence — not an opaque score.

block=Motif Kain · Sumsel · has_motif_list
nearest=0.41 "Songket Lepus" — distinct motif set
D · decision-worker

Three scoped agents

Cultural-gate passes. Novelty reads the neighbors → net-new. Cross-validation finds the facts corroborated and uncontradicted → accept. A full trace is persisted.

verdict=net-new cross_val=accept conf=0.86
trace_id=… → visible in decision stream
E · publisher

Draft, image, publish

Composes the CMS payload (history + motif list as structured HTML); fetches the image, license-checks it, generates a data-derived caption, uploads it; resolves element/province ids; creates the entry.

action=create mode=shadow→live
field_description=<html: sejarah + Motif[]> field_file=berkas:…
POST /cms/entry/type/4/create → pdbi_id
E → B · close loop

Re-encode & sample

The accepted entry is re-encoded into the index (it now guards the next decision) and, being acceptable for review, may be pulled into the audit sample. The body cache is freed.

reencoded=true audit_sampled=true bodies/{id} GC'd
One accepted datumSeven steps, three AI calls (all in D), zero unsourced sentences. Every other crawled page that fails the relevance gate, the cultural gate, or dedup is dropped before it ever reaches publish.
Component · the Crawl Engine (A)

A focused crawler, not a blind one.

Open-ended link-following would drown in non-cultural pages. Instead the frontier is a best-first queue ordered by predicted cultural relevance: relevant pages get their links followed first, so the crawl stays in cultural neighbourhoods of the web — in any language.

Mechanics

  • Frontier. URL queue with depth, parent, relevance-prior, language hint. Best-first, not BFS/DFS.
  • Two fetchers. Polite HTTP for static pages & PDFs; a stealth browser (Playwright) for JS / protected / social pages.
  • Relevance gate. A cheap classifier on every page — keep & expand, or discard with zero storage.
  • Wikipedia = reference harvester. Descend a page's References to reach the primary sources; the Wikipedia text itself is never kept.
  • Documents. A 2,000-page PDF is segmented into units and fanned out — one source can yield hundreds of candidates.
  • Re-encounter, not re-poll. Content-hash idempotency skips unchanged pages. Cultural data is static, so there is no freshness timer; new data comes from new pages → the enrich path.

Seeds & expansion

Wikipedia category treesresearch repositories known cultural domainsPDBI coverage gaps DuckDuckGo APISearXNGWikipedia search
StorageEphemeral by default — only distilled records, provenance, and the URL frontier persist. Raw bodies are dropped after the decision; only high-value primary documents (PDFs, papers) are retained for re-mining.
Component · the Encoder & Index (B)

A custom vector we control — not an LLM embedding.

The dedup question is "does PDBI already hold this data?" An LLM embedding clusters by vague meaning and hides that answer. Instead each record is encoded into a feature vector built from the PDBI schema, in two blocks.

Shape blockblocking + routing
categoryprovince / regionelement has_recipehas_motif_listhas_steps num_imageslength_bucketnum_sections

Low-cardinality. Used to narrow the search (the index is partitioned on it) and to route the draft template — never to decide a match.

Content blockthe match decision
entity n-gramsregion / ethnicity tokens ingredients[]motifs[]occasionactors

High-cardinality. Content decides dedup, so two soto dishes from one region read as distinct data. Category-conditional slots, zero-padded when N/A.

Matching is two-stage, not one blended score

Stage 1 · block

Shape narrows

Partition by category × province. The candidate is only compared against its own neighbourhood — and this is also the scale strategy (no full-index scan at 1M+).

Stage 2 · score

Content decides

Within the partition, content similarity yields structured evidence: "same category+region, 80% ingredient overlap, distinct name" → handed to D.

Why content dominatesShape is low-cardinality — at 1M entries every food-from-Java record looks alike, so weighting shape would suppress genuinely new data in the densest categories. Over-weighting shape loses data; under-weighting it only costs CPU. So: content scores, shape blocks. The encoder is deterministic v0; a trained model is earned later from accumulated labels.
Component · the AI Decision-Maker (D)

Three scoped agents inside one deterministic harness.

Stage D is the only place that runs agentic, tool-calling AI — and the orchestration around it is fixed code, never an agent deciding the flow. Each agent is the same harnessed skeleton: it is handed exactly the context and tools it needs, then what it returns is validated before it counts.

// the harness — identical skeleton for all three agents 1 · assemble context candidate · B neighbors + evidence · sourced facts · prior verdicts 2 · call model forced-schema output · via the resilient provider pool 3 · scoped tools neighbor-search · source re-read (cache) · contradiction · credibility 4 · validate schema + confidence floor 5 · ACCEPT · RETRY · DEFER insufficient context → buffer, never guess

The three agents

AgentQuestionScoped toolsOutput
Cultural-gateIn-scope Indonesian cultural datum?category / scope validatorspass · reject
NoveltyAlready in PDBI, or a new angle?B neighbor-search, mirror lookupskip · enrich · net-new
Cross-validationTrustworthy enough to publish?source re-read (cache), contradiction & credibility checkaccept · reject + confidence
  • Every decision is traced. The assembled context, tool calls, and model output are persisted per candidate — the dashboard's decision stream shows exactly why each entry was skipped, enriched, or published.
  • Cheap base, paid tip. The relevance gate (A) runs on millions of pages as a tiny/local classifier; extraction (C) is one cheap call; only D's agents and E's drafting use stronger models.
  • Provider pool. Every model call fails over on 429, circuit-breaks on 402, and if a capability is exhausted the stage pauses and buffers — it never publishes degraded output.
Reliability = harness, not humansFull autonomy, no blocking review. Trust comes from scoped tools, forced-schema outputs, confidence floors, and an audit-sampling loop that keeps surfacing single-source / low-confidence entries for human labelling.
The records that flow between stages

Three contracts: Candidate, Decision, Entry.

The Candidate is the spine — C produces it, B/D/E consume it. Two identities matter: candidate_id for processing idempotency, and canonical_entity_key for grouping (deliberately not unique, so "same subject, new data" can coexist).

Candidate { // C → B, D, E candidate_id // hash(normalized_url, unit_index) content_hash // re-encounter / re-mine canonical_entity_key // norm(entity)+category+province — NOT unique entity_name, aliases[] category_id, province_id, element_id, origin attributes{} // category-specific; typed for the 3 pilot cats facts[]{ claim, source_url, source_lang, quote } images[]{ source_url, caption, license_hint } // refs only provenance{ lead_chain[], crawl_path[], doc_page? } source_language, extraction_version, extraction_confidence }
Decision { // D → E ; trace → dashboard candidate_id, canonical_entity_key cultural_gate pass | reject novelty_verdict skip | enrich | net-new target_pdbi_id? // set on skip(dup) / enrich cross_validation accept | reject confidence, rationale, evidence_refs{ neighbors[], sources[] } labels{ is_duplicate?, is_accepted? } // → calibration trace_id // assembled context + tool calls + output }
Entry { // E → PDBI ; → B re-encode candidate_id, decision_id mode shadow | trickle | live action create | update(target_pdbi_id) title, field_category, field_element, field_province, field_from field_description // dense factual HTML; written for OSAN indexing field_file? // berkas ref after upload citations[], image{ object_ref, generated_caption, pdbi_berkas_ref } pdbi_id?, publish_status, reencoded, audit_sampled, sample_reason }
Expected output — accepted entries

What lands in PDBI, per category.

The end product: short, structured, fully-sourced data in PDBI's native shape. One per pilot category — chosen to show the three different data shapes (recipe · motif list · prose) and three decision outcomes (net-new · enrich · audit-sampled). Every line traces to a source; nothing is padded.

Makanan Minuman · id 3

Soto Bancar Purbalingga

province Jawa Tengahfrom Purbalingga
Deskripsi

Soto berkuah bening khas Purbalingga, disajikan dengan ketupat, suwiran daging, tauge, dan taburan kacang tanah goreng.1

Bahan
  • Daging sapi & jeroan, kaldu bening
  • Ketupat, tauge, daun bawang, seledri
  • Kacang tanah goreng (taburan)
  • Bumbu: bawang, kunyit, jahe, serai, lengkuas2
Cara membuat
  1. Rebus daging hingga empuk; sisihkan kaldu bening.
  2. Tumis bumbu halus, masukkan ke kaldu.
  3. Sajikan dengan ketupat, tauge, suwiran daging, taburan kacang.2
Image · captionSoto Bancar Purbalingga — soto kuah bening dengan kacang, Jawa Tengah.
Sumber1 · Jurnal Kuliner Banyumasan (2021)
2 · Dokumentasi Kuliner Daerah Purbalingga
verdict net-newconf 0.84 · 2 sumber
Motif Kain · id 4

Songket Palembang

province Sumatera Selatanfrom Palembang
Sejarah singkat

Kain tenun benang emas/perak dari Palembang, berkembang sejak masa Kesultanan Palembang sebagai kain kehormatan.1

Motif
  • Lepus — penuh benang emas
  • Tabur / Bintang — motif tersebar
  • Tretes — tanpa motif tengah
  • Limar — tenun ikat pakan2
Bahan & teknik

Benang sutra dengan benang emas/perak, ditenun memakai ATBM.3

Image · captionSongket Palembang — kain tenun motif Lepus, Sumatera Selatan.
Sumber1 · Ensiklopedia Kain Tradisional Indonesia
2 · Museum Tekstil — koleksi songket
3 · Jurnal Kriya Nusantara
verdict enrich #4-xxxx+2 motif, +teknik · 0.88
Tarian · id 14

Tari Bedhaya Ketawang

province Jawa Tengahfrom Surakarta
Asal-usul

Tarian sakral Keraton Kasunanan Surakarta, ditarikan oleh sembilan penari putri.1

Struktur & makna

Sembilan penari melambangkan arah mata angin; gerak menggambarkan hubungan penguasa dengan Kanjeng Ratu Kidul.1

Fungsi

Dipentaskan pada Tingalan Jumenengan — peringatan kenaikan takhta Susuhunan.2

Image · captionTari Bedhaya Ketawang — tari sakral Keraton Surakarta, Jawa Tengah.
Sumber1 · Etnografi Tari Keraton (penelitian)
2 · Dokumentasi Keraton Kasunanan Surakarta
verdict net-newconf 0.79 · audit-sampled
IllustrativeRepresentative of the output shape, length, and sourcing discipline — not live published data. The real field_description is HTML; the low-confidence, scarce-source dance is the kind of entry the audit sample preferentially surfaces for human review.