Driving an LLM pipeline to ~$0 without losing accuracy

A multi-source soccer-transfer pipeline was spending about $100/mo on a paid LLM for two high-volume stages — classification and entity extraction — with no eval on the extraction stage at all. I built the eval first, ran a bake-off that proved a free open-weight model matches the paid model on classification (86.1% vs 86.8%, inside a ±2% noise band) and beats it on extraction (100/100/100 vs 96/94/92 at n=15), then shipped a per-use-case, eval-gated, quality-ordered model router that leads every chain with a free model and keeps one paid model as an emergency tail. Classification and extraction now run at roughly $0 in production.

Wiring up cost observability surfaced a second problem: the tracing layer had been dark in production for about eight months because of one unset environment variable. And a follow-on migration taught the most important lesson of all — recounted in the addendum at the end.

The thesis: cost optimization on an LLM pipeline is an eval problem, not a procurement problem. You cannot safely switch a model on a stage you have never measured — so the cheapest path to $0 starts by building the measurement that makes the switch defensible.

The setup

The product is a data pipeline, not an agent system: scheduled multi-source ingestion, then LLM-based extraction and normalization, then a scoring layer, then predictions surfaced in a deterministic UI.

Lens	This work
End user	A soccer fan tracking their club’s transfer activity. They never see the model; they only feel it when a prediction names the wrong player.
Operator (the real “user” of this work)	The person running the pipeline. Job-to-be-done: keep it running at a sustainable cost without silently degrading what the fan reads.
The alternative today	Pay about $100/mo and hope quality held — because only the cheap front gate was ever evaluated. The stage that decides what the fan actually reads ran unmeasured.
Trigger	Spend hit about $10 in under three days (a ~$100/mo run-rate) against a sub-$50/mo budget, while a free-model opportunity was open.

The constraint that made this hard: you cannot swap a model on a stage you have not measured. Classification had a 151-case golden set; extraction had none. So the first real unit of work was not “find a cheaper model” — it was “build the eval that lets you switch models without shipping wrong players.”

The arc, in five acts

Act 1 — Measure the cost, expose the waste

Before touching models, instrument the spend. The headline waste was thinking tokens: the paid model with reasoning on spent about 373 thought tokens per call (~83% of output, billed as output) to buy about +2% accuracy (86.8% thinking-on vs 84.8% off). Total tokens dropped 520 to 178 with thinking off. That is roughly 3x the token cost for a gain inside the run-to-run noise band — invisible without measurement, and the first thing an eval-first discipline catches.

Act 2 — Eval-first bake-off (the load-bearing act)

Build the missing measurement, then run a six-plus-model bake-off across both stages using a generic OpenAI-compatible eval path so free providers could be tested with zero production risk.

Stage	Ground truth	Result
Classification (accept/reject a rumour)	151 hand-labeled gold	Free Gemma 86.1% = best single model (ties a free Llama); paid model 86.8% (within the ±2% band). A free five-family oracle ceiling hit 92.7%, exceeding the prior 92.1% from six paid models.
Extraction (player to destination club)	52 validated transfers (first-ever extraction eval)	Free Gemma 100/100/100 (player/destination/both) at n=15 — at or above the paid model’s 96/94/92.

The unlock: free models did not just approach the paid baseline — on the stage that matters most to the fan (extraction), the free model beat it. That converts “should we switch?” from an opinion into a gated, defensible decision.

Two honesty caveats I logged rather than buried: extraction n=15 is small (10 of 25 cases timed out in the sandbox — an environment artifact, not a model error), and the extraction eval has mild circularity. Both are directional floors, flagged for confirmation at higher n from production traces.

Act 3 — The eval-gated, per-use-case router

Generalize a hand-rolled three-provider nested-if fallback into a data-driven, label-keyed, per-use-case, quality-ordered chain:

CHAINS.classification = [gemma, groq-llama, mistral-small, groq-gpt-oss, gemini-flash]
CHAINS.extraction     = [gemma, groq-llama, gemini-flash]
CHAINS.default        = [gemini, openai, claude]   // unevaluated stages stay on legacy

Each design choice is defendable:

Ordering is eval-gated, not guessed — chain order mirrors the bake-off; treat it as config, re-run the bake-off on a cadence.
Free leads, one paid tail — a free-tier outage can never drop the pipeline; the paid model is insurance, not the default. Cost is $0 on the nominal path.
Per-use-case, because best-model is task-dependent — Gemma wins extraction and non-English classification; the Llama wins English. One global chain would erase that.
Resilience owned by the app, not the library — every model is constructed with no library-level retries so the framework cannot silently retry the same dead provider; the router owns failover and benches a model only on quota errors. Cooldowns are label-keyed (free 5 min, paid 60 min); pacing is per-provider from measured throughput.

Shipped behind a review pack: 0 critical / 1 high found-and-fixed (a content-sanitizer bug for one model’s array-of-parts output that would have broken JSON parsing), router smoke 12/12, all files under the line cap. Verified end-to-end in production.

Act 4 — Cost observability, and the dark-launch discovery

The next change logged which model served each call (model id, cost-tier, paid flag, fell-back flag) so a dashboard can group by model and prove the free models are serving — and catch any unexpected paid-tail fallback.

Wiring this up surfaced the second problem: the tracing layer had been dark in production for about eight months — since launch. The keys were never set in the host environment, so the trace call had been silently no-op’ing the entire time. The pipeline always worked; we’d just been blind to it.

Banked lesson: “the pipeline runs” is not the same as “we have observability.” A missing-config no-op is the most dangerous failure mode there is, because nothing crashes. The fix is a boot/CI assert on telemetry connectivity so a dark launch fails loudly on day one.

Act 5 — The last paid stage, and a measure-first correction

Two stages still called a paid model outside the router. My first instinct was to build a 35-context eval, sweep the token cap, and drop in a free model. An expert panel and a cross-model jury overrode that: measure first, don’t design the eval blind.

The measure-first pass on 904 stored explanations changed the metric:

0/904 contained a fallback marker, so the deterministic template never fires — the LLM is load-bearing; do not drop it.
0/904 truncated; the paid model never hit the token cap. Truncation was a different model’s artifact — so the planned token-cap sweep would have been wasted effort.
Therefore the real risk is faithfulness/hallucination, not truncation. The eval’s gate metric flipped accordingly.

The transferable point: an eval’s metric must be chosen from production evidence, not a priori. A beautifully-built eval that gates on the wrong axis is worse than none — it gives you confidence in the wrong thing.

Architectural decisions — the “defend it in an interview” table

Decision	Chosen	Rejected	Why
Cost strategy	Eval-first bake-off, then switch	Just swap to a cheaper model	Extraction was unmeasured; a blind swap risks silently-wrong output. Measurement is the prerequisite.
Model selection	Per-use-case quality-ordered chains	One global model	Best model is task- and language-dependent. A global chain erases measured differences.
Topology	Free-leads, one paid tail	Paid-first; or free-only	A free-tier outage can never drop the pipeline; paid is insurance.
Routing impl	Data-driven label-keyed registry	Keep the nested-if cascade	A cascade doesn’t scale past three providers; a registry is auditable, reorderable config.
Resilience	No library retries + app-owned failover	Library-layer retries	The library retries the same failing provider while the deadline burns. The router must own failover.
Observability	Record model+cost per call; assert connectivity	Leave telemetry implicit	A silent no-op (the eight-month dark trace) is invisible without an explicit assert.

Measured results

Metric	Value
Monthly LLM cost (before)	about $100/mo
Classification + extraction cost (after)	about $0
Classification — free model	86.1% (ties best; vs paid 86.8%, within ±2%)
Classification — free five-family oracle	92.7% (exceeds prior 92.1% from six paid models)
Extraction — free model	100 / 100 / 100 at n=15 (at or above paid 96/94/92)
Thinking-token waste (paid model)	about 83% of output for ~+2% accuracy
Telemetry dark period	about 8 months (one unset env var)

What I’d do differently at scale

Assert telemetry on boot, from day one. A five-line connectivity check would have failed loudly at launch. Observability that can silently no-op is worse than none — it manufactures false confidence.
Automate the bake-off as a scheduled job. Chain order is “eval-gated config,” but nothing re-runs the eval. Free model ids churn; accuracy drifts. At scale this is a nightly job that opens a PR when the optimal order changes.
Health-check model ids on boot, pinned to families. A rotated free id fails silently. Resolve family to current-best-live-id with a smoke call at startup.
Grow the eval beyond a directional floor. 151 classification labels and n=15 extraction are enough to decide a switch but not to defend an SLA.

System diagram

                  TRANSFER PIPELINE  (data pipeline, not agentic)

  RSS / multi-source --> [ CLASSIFY ] --> [ EXTRACT ] --> [ SCORE ] --> [ AGGREGATE ] --> UI
   ingestion (cron)      accept/reject    player->club                    prediction       |
                              |                |                                     [ HOVER EXPLAIN ]
                              v                v                                     (last paid stage,
                    executeWithFallback(  executeWithFallback(                        outside the router)
                      CHAINS.classification ) CHAINS.extraction )
                              |                |
              +---------------+----+    +------+-------------+
              | gemma     FREE     |    | gemma     FREE     |  <- LEAD (eval-gated)
              | groq-llama FREE    |    | groq-llama FREE    |
              | mistral   FREE     |    | gemini    PAID     |  <- emergency tail
              | groq-oss  FREE     |    +--------------------+
              | gemini    PAID     |  <- emergency tail
              +--------------------+
                              |
        model-pricing --> recordModelGeneration --> TRACING
        (free=$0)          (model mix + cost + fell-back flag)
                            (was DARK ~8 months -> now asserted)

Five interview follow-ups

Q1. Gemma was 86.1% and the paid model 86.8% — inside your own ±2% band. How do you justify switching to a nominally-worse model, and how would you know if you were wrong?

The classification gap is statistically indistinguishable from noise, so there it’s “equivalent quality at a fraction of the cost.” But I wouldn’t hang the decision on classification alone: on extraction — the stage the fan actually reads — the free model was not a tie; it scored 100/100/100 vs 92.3% both-correct. I de-risk being wrong four ways: a paid emergency tail in every chain; a downstream DB-validation layer that overrides the club with database truth regardless of model; live model-mix + cost monitoring; and a one-line revert, since chain order is config. The honest caveat I’d volunteer: extraction n=15 is a directional floor, so “verify at higher n in prod” is an explicit follow-up.

Q2. Your free oracle hit 92.7% but you ship a single model at 86.1%. Why leave points on the table?

Because an oracle ceiling is an upper bound, not an achievable score — it assumes you could magically pick the right model per item, which is the original problem restated. The realizable approximation is a majority vote, but that’s 3-5x the calls and latency; not justified for a commodity, high-volume stage. So I ship a single best model with fallback for the nominal path and reserve voting for low-volume, high-cost-of-error work. The oracle’s real job in my design isn’t a target; it’s a feasibility signal that the free families are diverse — the prerequisite for any future vote to beat a single member.

Q3. You found telemetry dark for ~8 months. How does that happen, and how do you make it structurally impossible to recur?

It happens because of the most insidious failure mode in observability: a missing config that no-ops instead of crashing. With no keys, the SDK doesn’t throw — the trace call just silently returns. The direct cost was modest; the real cost was decision-blindness — I couldn’t confirm the router was serving free models. The structural fix is a boot/CI assertion on telemetry connectivity wired into the deploy smoke, so a dark launch fails loudly on day one. The generalizable principle: any dependency that degrades to a silent no-op must have an explicit liveness check, because “it didn’t crash” is not evidence it’s working.

Q4. Your own panel told you to build a 35-context eval; the panel and a cross-model jury overrode you to “measure first.” What did measuring reveal?

Two assumptions collapsed: 0/904 explanations ever truncated on the paid model (truncation was a different model’s artifact, so the token-cap sweep was wasted effort), and 0/904 ever fell back to the template (so “just drop the LLM” was off the table). With both gone, the real risk resolved to faithfulness, and the gate metric flipped to claim-grounding. The transferable lesson: an eval’s metric must be chosen from production evidence, not assumed.

Q5. Free models rate-limit aggressively and their ids churn. Haven’t you traded a cost problem for a reliability one?

Only if free were the only tier — and it isn’t. Every chain ends in a paid tail that can’t be exhausted by rate limits; failover moves to the next member instead of retry-storming a dead provider; label-keyed cooldowns and per-provider pacing keep us under the caps; only quota errors bench a model; and a downstream DB-validation layer backstops extraction quality regardless of which model answered. The one genuinely open gap I’d name unprompted is id churn — the fix is health-check-on-boot pinned to families. So the reliability surface is real but bounded and mostly mitigated, and the one unmitigated edge is named with its fix.

Addendum — the $0 router quietly killed ingestion

The router above optimized cost and accuracy — the two dimensions the eval measured. About seven days after a follow-on migration put the free model on the scrape ingestion hot path, rumour ingestion collapsed (all feeds, 55-135/day, fell to 1-3/day). The “$0 pipeline” had quietly become a 0-rumour pipeline — and the polished cost case study never saw it coming, because the failure was on two axes the eval did not measure:

Throughput under a hard time budget. The free model was 16-45s/call vs the paid model’s 2-3s. The scrape cron has a 300s ceiling; at ~78s/rumour it timed out and inserted almost nothing. One feed measured at 362s. A per-output cost+quality eval is blind to wall-clock-under-a-cron-budget.
Reliability under load. In production the free model intermittently fetch-failed under burst, adding failover tax an offline eval scoring a paced sample never exercises.

The fix swapped the hot chains to a fast free Llama lead with a paid tail and re-timed the same feed 362s to 6.8s. And the apparent “accuracy gap” that had kept the Llama out of the original bake-off turned out to be a prompt/label drift, not a capability gap: diffing the misses against the strong model showed both models were correctly rejecting completed deals that the gold wanted accepted. Fixing the prompt lifted the Llama from 85.4% to 89.4% — and lifted every model.

The meta-lesson: an eval only protects the axes it measures. A model-swap eval must include throughput-under-the-real-time-budget and reliability-under-load, not just cost and quality — a $0, high-faithfulness model that takes 45s/call and fails under burst is worse than the thing it replaced. And when a fast model “underperforms,” diff its misses against the strong model with both models’ reasoning before blaming the model — the gap is often a prompt or label drift you can fix once for everyone.