
from 70% to 95%: the unglamorous path to accuracy
how error analysis, ground truth fixes, and deterministic guards turned a brittle extraction system into a reliable one.
prompt engineering gets you surprisingly far.
you write a solid system prompt, add a handful of examples, iterate on obvious edge cases, and suddenly your ai system is doing something useful. for us, that meant extracting electronics data from component datasheets in a production extraction system: pin assignments, voltage ranges, physical dimensions, package information, all feeding downstream footprint generation and manufacturing workflows.
the first version worked. roughly 70% of the time, the outputs were correct.
we shipped it.
and then we stalled.
every new prompt tweak fixed one failure mode and quietly introduced another. accuracy bounced around. regressions appeared days after changes that initially looked like improvements. the system felt brittle in a way that wasn't obviously tied to any single mistake.
this is the point where most teams keep iterating on prompts. try a different model. add more examples. rewrite instructions more carefully. we did some of that too. it didn't move the needle.
the real issue wasn't the prompt. it was that we didn't yet understand our failures.
this is a production extraction system#
it's worth being explicit about the setting, because it shapes every tradeoff.
this system isn't answering ad-hoc questions or generating summaries. it's extracting structured data that flows into coordinate calculations, footprint generation, and manufacturing decisions. mistakes are not academic. a single wrong value can mean a dead board, a respin, or weeks of lost time.
that constraint ruled out hand-wavy notions of correctness. we needed outputs that were numerically precise, consistent across parts and vendors, explainable when questioned, and conservative when ambiguity existed.
that context is why the rest of this article looks the way it does.
error analysis is where accuracy actually comes from#
there's a line hamel husain has repeated for years that eventually clicked for us: teams that succeed with llms spend most of their time on error analysis and evaluation, not on prompt engineering.
we resisted this at first because it felt unglamorous. looking at mismatches in a spreadsheet is slower and less satisfying than writing new code. but once we forced ourselves to examine every disagreement between extracted values and our reference data, patterns emerged quickly.
we stopped tracking a single accuracy number and started logging every extraction result. every failure was investigated and categorized. over nearly four months, we reviewed hundreds of extractions manually.
the three failure categories#
almost every mismatch fell into one of three buckets:
| part number | spec | expected | extracted | root cause | status |
|---|---|---|---|---|---|
| TPS7A0225 | V_out | 2.25V | 2.5V | GT wrong - nomenclature | Fixed GT |
| IR38063 | Vin(min) | 5.3V | 1.5V | PVin vs Vin confusion | Added prompt rule |
| NCP6336 | MOSFET | P+N | N×2 | Reasoning contradicts output | Added guard |
| TLV62595 | MOSFET | N×2 | P+N | RDS similarity ignored | Refined guard |
this classification turned out to be critical.
only the third category represents a true extraction failure. the first two are human problems, and ignoring them makes accuracy improvements impossible.
sometimes the ground truth was simply wrong. historical spreadsheets encoded assumptions no one remembered making. part nomenclature was misread. units were inconsistent. blindly trusting reference data turned out to be a mistake.
sometimes the requirement itself was ambiguous. did we want typical values or maximum values? default configurations or full capabilities? nominal dimensions or tolerance-safe ones?
only after separating these cases did the system's behavior start to make sense.
evaluation doesn't just measure correctness, it defines it#
one of the biggest surprises was how often evaluation forced us to clarify what "correct" actually meant.
early on, we assumed extracting dimensions would be straightforward. evaluation quickly proved otherwise. some datasheets list typical values, others list maximum values. both are faithful readings of the document.
we had to choose.
we eventually decided to prefer maximum values because they're safer for manufacturing. that decision forced us to update a large amount of ground truth that had quietly mixed conventions.
the same thing happened with switching frequency. some datasheets list the default operating frequency, others list the maximum supported frequency when externally driven. the model wasn't wrong. the requirement was incomplete.
evaluation didn't just expose mistakes. it exposed decisions we hadn't realized we were deferring.
once those decisions were made explicit, accuracy improved without touching the model.
how fixes actually work in practice#
once failures were visible, fixes fell into two clear buckets: prompt rules and guards.
prompt rules handle systematic misunderstandings#
when the model consistently misunderstood a concept, we encoded that explicitly.
for example: parts with i2c, spi, or pmbus interfaces should be treated as adjustable, not fixed-output. voltage limits should be extracted across the full operating temperature range, not just at 25°c. on dual-rail devices, vin is not interchangeable with pvin.
over time, the prompt became less elegant and more historical. it accumulated rules not because we were being clever, but because each rule corresponded to a failure we had actually seen.
guards handle what prompts can't reliably prevent#
some failure modes persisted even with clear instructions.
the most common was reasoning-output mismatch: the model would correctly identify a value in its reasoning, then output something different.
no amount of rewording reliably fixed this. so we stopped trying and added deterministic checks instead.
guards check things the model shouldn't be trusted to get right every time: does the output agree with the model's own reasoning? does the cited source actually contain this value? are the units and ranges physically plausible?
each guard was added only after the same failure appeared multiple times.
semantic equivalence is mandatory for meaningful metrics#
not every mismatch is a real error.
"21 v" and "21 volts" are the same value. "1500 mw" and "1.5 w" are the same value. "qfn-22" and "qfn-22l" refer to the same package. 4.0 mm and 4.1 mm are often within tolerance.
without domain-aware equivalence rules, evaluation becomes adversarial. correct behavior is penalized, accuracy numbers stop reflecting reality, and learning slows.
we added semantic equivalence checks that encode these domain rules explicitly and return both a verdict and a reason. this eliminated a large class of false negatives and made accuracy numbers trustworthy again.
accuracy didn't improve because the system got smarter. it improved because the metric stopped lying.
the eval log is the real artifact#
our evaluation log now contains hundreds of rows. each row records what we expected, what we extracted, why they differed, and what we changed as a result.
| category | what it meant | typical action |
|---|---|---|
| ground truth wrong | the reference data was incorrect or encoded a hidden assumption | fix or update ground truth |
| convention conflict | multiple valid readings existed, but we hadn't decided which one we wanted | define and document a convention |
| system error | the model or pipeline genuinely extracted the wrong value | add a prompt rule or guard |
the prompt is downstream of this file. the guards are downstream of this file. when someone asks why the system behaves a certain way, the answer is always traceable to a specific historical failure.
if that knowledge isn't written down, the system is fragile by definition.
what actually moved accuracy from 70% to 95%#
accuracy improved when feedback loops got tighter, not when prompts got longer.
many apparent model failures were actually specification failures. once we clarified conventions, the model stopped oscillating between equally valid answers.
ground truth turned out to be a frequent source of bugs. assuming it was correct slowed progress.
prompts handled patterns. guards handled inevitabilities. reasoning-output checks were especially high-leverage.
semantic equivalence mattered more than exact matching. without it, metrics were misleading.
progress was additive. there was no breakthrough moment, just dozens of small fixes accumulating over time.
the unglamorous truth#
getting to 95% accuracy required manually inspecting hundreds of failures and fixing them one category at a time. nearly four months of consistent work. there were no shortcuts, no magic prompts, and no model swaps that changed everything overnight.
the spreadsheet was the work.
once we accepted that, accuracy stopped feeling mysterious. it became operational.
for more on building evaluation systems for AI products, see Hamel Husain's Field Guide to Rapidly Improving AI Products and Your AI Product Needs Evals.
