When LLM Extraction Stops Paying for Itself

When LLM extraction stops paying for itself

Firecrawl charges 1 credit to scrape a page and 5 credits to extract structured fields from the same page (Firecrawl pricing, 2026). That's a 5x markup for the same HTML, sent through a model.

The pitch is real: describe what you want, get JSON back, no selectors to maintain. For unstable layouts and one-off targets, it earns the markup. But for the production pipeline that pulls 500K product pages a day off the same five retailers, it doesn't.

We've watched teams ship LLM-default extraction, hit invoice month, and start looking for a way out. The fix usually isn't to abandon LLMs. It's to put them in the right place in the pipeline.

The math gets ugly fast

Take Firecrawl at the cheap end. Scrape plus AI extract is 6 credits per page without crawl, 7 credits with crawl (ScrapeGraphAI breakdown, 2026). 100K pages a day on their growth tier runs roughly $21K per month before retries, before you've paid for a single proxy.

Run your own LLM pipeline and the math shifts but doesn't get small. GPT-4o is $2.50 per million input tokens and $10 per million output (PricePerToken, 2026). A product page after markdown conversion runs 4K-8K input tokens. Call it 6K input, 200 output for a JSON blob. At 100K pages a day, that's $360 daily, $11K monthly for a job CSS selectors do for free after one setup.

That's the cheap model. Move to Claude Sonnet 4.6 ($3 input, $15 output) and the bill doubles (PE Collective, 2026). Move to a reasoning model and tack on a 3-10x penalty depending on how much it thinks before answering.

None of that counts the failures. A 3-5% hallucination rate sounds harmless until you do the arithmetic. On 100K pages a day, that's 3,000-5,000 wrong records flowing into your warehouse, looking exactly like the right ones because the model returned them confidently. As DataHen put it: "It is not that AI gets it wrong sometimes. It is that it gets it wrong confidently." (DataHen, 2026).

What the experienced teams actually do

Read the docs from vendors who actually run scrapers in production and the pattern is consistent: hybrid. Use the LLM to figure out the page once, then run cheap deterministic code for everything that follows.

Zyte spells it out in their own documentation: "Instead of using an LLM per page, use your LLM to generate CSS selectors for the desired fields given the raw HTML of a first page, and use those selectors to parse all other pages." (Zyte LLM guide, 2026). Apify recommends the same flow in their 2026 guide: try CSS selectors first, fall back to LLM when they fail (Apify 2026 guide). A DEV Community write-up of a production rollout captured the architecture exactly: cached selector path costs nothing, LLM only fires when validation fails (DEV.to, 2026).

So the production split looks like this:

LLM bootstraps the selector (one call per target, fractions of a cent)
The selector runs against every page (free)
A validator (usually a regex or a presence check) catches drift
Drift triggers a re-bootstrap, weeks or months later

Cost per page collapses from ~$0.005 to well under $0.0001. Quality goes up because deterministic parsing doesn't hallucinate. And you spend tokens on the work LLMs are actually good at: reading novel structure, not parroting structure you've already mapped.

Where LLMs earn the bill anyway

This isn't an anti-LLM piece. Plenty of extraction jobs are exactly where the model is the right tool and the credit math works:

Unstable layouts that change weekly. Selectors that break every Tuesday cost more in engineering time than LLM extraction costs in tokens. Run the model.
Long-tail targets you'll never visit twice. No payoff for writing a selector. Run the model.
Unstructured content where the output is itself a summary. Job descriptions to skills, articles to claims, reviews to sentiment. Selectors can't help. Run the model.
Pages with optional fields scattered across layout variants. A single template with twenty conditional renders is exactly where LLMs beat regex chains.

Look at your pipeline. Sort targets by volume. The top 20% by request count almost always have stable structure (that's why they're the top 20% — you integrated them deliberately). They're selector candidates. The long tail is where the model belongs.

What this means for your stack

The vendor pitch in 2026 wants you to default to LLM extraction. Credit pricing makes that look reasonable on small projects. It stops being reasonable when you scale, the same way proxy pool size stopped predicting real success once the underlying signal broke.

Three takeaways for teams building real pipelines:

Separate the fetch from the parse. If your scraping vendor only returns LLM-extracted JSON, you can't fall back to selectors when the bill arrives. Pick infrastructure that hands you HTML and lets you pick the extraction path.
Cache aggressively at the selector level. Generated selectors are reusable across thousands of pages. The expensive call is the generation, not the use.
Measure cost per record, not per page. A pipeline that costs $0.001/page but ships 5% bad records costs more than one that costs $0.005/page and ships clean data. Storage, downstream queries, and the eventual cleanup all carry weight.

Pick the boring half

The LLM-extraction default is the right shape for a demo and the wrong shape for production. The teams getting it right are the ones treating LLMs as a tool for understanding a page, not a tool for reading a page. Boring deterministic code still wins the volume game in 2026; the model wins the novelty game. Both belong in the stack.

At FourA, Single and Browser hand back the raw response (HTML, rendered DOM, headers, body) and stop there. Whether you parse with selectors, send it to a model, or do both, that's your call. We don't tack on a credit multiplier for extraction we didn't do.