Internal review

KYAS

An automated factory that writes Indonesian cultural articles for PDBI — status, architecture, and the road to scale.
Prepared by Rizky Azmi Swandy Basis Measured staging data · Jun 2026
01What it is

A production line for cultural articles

KYAS turns Indonesian cultural subjects into finished, fully-sourced articles for PDBI — automatically. A chain of specialised stages assembles each article, and any stage can reject it, so only verified work is kept.

Input
58k subjects
Indonesian cultural entries in PDBI waiting to be written.
Engine
11 staged factories
Find sources → extract evidence → draft → verify → seal.
Output
Vault articles
Immutable, fully-cited, certified records.
Goal
1,000,000
The target volume for the public database.
02Where we are

The factory is built and running

ComponentStatusNotes
Production chain (sources → vault)✓ runningThe full chain runs end to end in staging.
Evidence & Grounding✓ AI, liveBoth call AI (gpt-4o-mini) — part of the 18 calls/article.
Draft writing✓ AI, livegpt-5.4-mini for prose, gpt-4o-mini for the plan.
Stability✓ hardenedRecovery sweepers + queue fixes; ~70 changes in 8 days.
Sourcing⚠ Wikipedia onlyFree & reliable today; broader coverage needs paid search.
Publish to PDBI⏸ gatedDry-run, awaiting canary approval before anything goes public.
03Architecture

A real distributed system, with money in the loop

queries AI calls governs state blobs Search APIsBrave · Google · Wikipedia OpenRouter AIdraft · evidence · grounding Autopilotgoverns the fleet SubjectsPDBI · 58k Sourcingsearch + fetch Worker fleetNATS bus · 11 stagesauto gates · immutable Vaultsealed PDBIpublish (gated) Postgresstate · lineage SeaweedFSarticle blobs

Ten-plus components and external paid, rate-limited services, with real spend at two points — AI generation and paid search (publishing is just an internal write). It is genuinely complex to operate — and because money flows through it, it carries real operational risk.

04Pipeline detail · source to accepted draft

From a found source to a publish-ready draft

URLs plan pages sources claims draft grounded checked sealed Discoverysource lake Portfolioplan sources Fetchdownload Qualityfilter Evidenceextract claims Draftwrite article Groundingverify claims Certifycategory · dupes Vaultseal Acceptedready to publish

Each stage transforms the work and can reject it. Gold = AI step, clay = quality gate; the small labels are the artifact passed to the next stage. Only a draft that clears every gate is sealed — and only then is it ready to publish.

05The flow · one article, end to end

Watch the 18 AI calls add up

Scroll here to run one article from topic to publish.
Stage AI calls0 / 18 AI cost · this article$0.000 Articles done0

Three stages call AI — Evidence (+4), Draft (+10), Grounding (+4) — totalling 18 calls ≈ $0.04 by the time the article is sealed and published.

06The hard part

The bottleneck is sourcing & grounding

Writing is the easy half. The hard half is feeding the line enough good sources: every claim must be backed by a real, quotable source, and grounding rejects the draft if it isn't. At scale, finding and fetching those sources — cheaply — is where the system starves.

Input need
4+ sources / article
Several quality URLs across 2+ domains before drafting can start.
Grounding
No support → reject
If a claim isn't found in a source, the paid-for draft is discarded.
Paid search
Costly at scale
Commercial search APIs charge per query — millions of articles, big bill.
Free search
SearXNG = poor
Stale, low-quality results; CAPTCHA-blocked on datacenter IPs.
Fetch
Slow & serial
Polite per-domain fetching is I/O-bound and hard to parallelize.
Coverage
Thin for niche
Many subjects have few pages that truly support claims.
07Cost · the reality check

Reaching the target costs on two fronts

$0.04
measured AI cost per accepted article
18
AI calls per article (evidence + draft + grounding)
$40,000
AI credits to reach 1,000,000
AI credits~$6,000 / month
draft 70% · evidence 28% · grounding 2%
Infrastructure + sourcing~$1–4k / month + paid-search risk
worker fleet · DB · storage
paid search at scale (can rival the AI bill)
The budget, plainly

To reach 1,000,000 you fund both: ~$40k of AI credits (≈$6k/mo) and the infrastructure + search to feed it. Today we run on ~$150/mo of manual top-ups — about 40× under the AI line alone.

08Cost in detail · search + infrastructure

Two cost engines: search providers and the worker fleet

Paid search — to widen coverage beyond Wikipedia
Provider~ / 1kHow it helps
Brave Search API$3–5Independent, fresh index — broad coverage.
Google (SerpAPI / CSE)$5–10Best relevance for niche Indonesian subjects.
Tavily$5–8Returns clean content — skips the slow fetch step.
Exa (semantic)$3–5Finds sources that actually support a claim → better grounding.

~10–30 searches per accepted article → at 1M scale, ~$40k–$150k in search alone. SearXNG is free, but stale & blocked.

Infrastructure — mostly per worker
Item~ / month
Worker compute · 2–4 vCPU / 4–8 GB$40–80 / worker
PostgreSQL (+ read replica)$150–400
SeaweedFS / object storage$30–80
NATS + bandwidth / egress$40–80

~4 workers ≈ $500/mo · ~10–12 workers ≈ $1.2k/mo — before search.

The complexity, in one line

Infrastructure is the predictable cost (~$0.5–1.2k/mo). Paid search is the wildcard — at full scale it can match or exceed the entire AI bill, which is exactly why staying on free Wikipedia and widening coverage cheaply matters so much.

09Throughput · output per day

What each level of investment produces

Today · 1 worker, credit-capped · measured~64 / day
~13k by year-end · 1M in ~43 years
Funded · ~4 workers~2,250 / day
~440k by year-end · 1M in ~15 months
Full-scale · ~10–12 workers, multi-key~5,100 / day
~1.0M by year-end · 1M in ~6.5 months
~80× faster than todayto reach 1M by year-end · peak so far 134/day · target 5,076/day
10Risks to flag

Two limits that get worse at scale

Risk 01 · variety
AI output saturates
Asking the AI for fresh angles and prose across hundreds of thousands of subjects, responses start to converge and repeat. More output is not automatically more distinct output.
Risk 02 · evidence
Claims lack support
For many subjects, few websites truly back the claims; grounding then fails. Raw volume is capped by how much trustworthy source material exists — not by how fast we write.
Implication

Past a point, pushing volume harder yields repetitive or weakly-supported articles. The cheapest real lever is improving source coverage & draft acceptance — not simply spending more on generation.

11Realistic expectations

What we can commit to — and the path

Status quo
12–25k
Current funding & one worker, run in bursts.
Funded + ~4 workers
~440k
The realistic, fundable target for this year.
Full mandate
1,000,000
Needs funding + ~80× throughput + better source coverage.
  1. Fund AI + sourcing properly
    ~$6k/mo AI on a funded account, plus a search/fetch budget — the prerequisite for continuous running.
  2. Scale workers horizontally
    More parallel workers & keys to lift throughput toward the target rate.
  3. Raise source coverage & acceptance
    Cut the grounding-reject rate and widen sources — lowers cost and lifts every tier at once.
12Provenance & basis

All figures are measured from the KYAS staging environment, June 2026 — not estimates. Unit cost comes from logged + actual OpenRouter spend; the funnel and acceptance rates from a rolling 7-day window; throughput from daily vault output; queue depth from live database counts. Per-article cost ($0.04) uses real provider charges, ~1.8× the internally-logged figure. The ~18 AI calls per accepted article span evidence extraction, drafting, and grounding. Window to year-end: 197 days from 17 Jun 2026; the 5,076/day figure is 1,000,000 ÷ 197. Infrastructure and paid-search figures are planning estimates, clearly marked as such. Publishing remains gated pending canary approval — no machine-written article has been released to PDBI.