Building a B2B Company Enrichment Pipeline

The Challenge

You're building a B2B SaaS product. Your customers upload a list of company names. They expect a clean record back: revenue band, headcount, tech stack, funding round, key contacts, recent news. They expect it within minutes, not days. And they expect it to be right.

The data exists. It lives on Crunchbase, on company About pages, on LinkedIn company pages, on Google Maps, on Glassdoor, on regional business registries, on TechCrunch archives. The problem is getting to it reliably.

Every source breaks differently. Crunchbase serves a heavy client-side app that rerenders if it suspects a bot. LinkedIn rate-limits aggressively and changes its DOM faster than you can patch selectors (one popular community post benchmarks a vanilla Python scraper at about 50 profiles before the anti-bot wall drops). Company websites range from static HTML to single-page apps that need a full browser to even show their content. Regional directories rotate layouts every quarter and gate behind country-specific blocks. According to a 2026 industry report from GroupBWT, 10–15% of crawlers in some verticals need weekly fixes just to keep up with anti-bot updates and DOM drift.

So your enrichment pipeline starts as a clean five-source design. Six months later, it's a tangle of half-broken scrapers, retry queues, and a Slack channel called #scraper-alerts that nobody opens anymore (we've written about the hidden cost of maintaining your own scrapers before). Data-quality complaints pile up in your support queue. Your team starts joking that the company name should've been "Five Scrapers and a Prayer."

The Approach

Forget the scrapers for a minute. The hard part of enrichment isn't extraction. It's routing: deciding which source needs which tool, which proxy, which retry policy, and what counts as a "good" response.

A platform like FourA gives you three products that map directly to the three classes of source you're going to hit.

Static HTML directories and registries. Most regional business registries and a lot of older B2B directories are server-rendered. They want a fast, low-overhead HTTP request from a clean IP. That's Single: one URL in, one response out. Add unblocker: true and it gets through handshake-level blocks that stop a vanilla HTTP client cold. Single routes through Proxy Finder automatically and returns the proxy id at the top level of the response (r.proxy) so your follow-up calls can pass it back as proxy:"<id>" to stick to the same exit when you need session continuity.

JavaScript-heavy SPAs. Crunchbase, LinkedIn-style apps, and even mid-sized company sites won't return the data you want from a plain HTTP response. They render on the client. That's Browser: a full browser executes the page, runs the JS, and hands you back the rendered HTML, cookies, and screenshots. Like Single, it routes through Proxy Finder under the hood — no separate pick step on your side.

Mixed sources with validation. Every request to FourA's API accepts a validate block. You can require specific status codes, header matches, or substring matches in the body. If the response is a soft-fail (a 200 page with a CAPTCHA, or an empty data shell, or a "we're sorry" interstitial), the validator rejects it. Your pipeline can then route the same URL through Browser instead. That single feature kills the most expensive class of bug in enrichment: the silent failure that writes garbage to your database.

Here's the shape of a single-source call:

curl -X POST https://api.foura.ai/api/single \
  -H "Authorization: Bearer pk_live_..." \
  -d '{
    "url": "https://registry.example.com/company/123",
    "unblocker": true,
    "followRedirects": 5,
    "validate": {
      "status": { "accept": [200] },
      "data":   { "fail":   ["captcha", "blocked", "access denied"] }
    }
  }'

And the Browser equivalent for a JavaScript-heavy company site:

curl -X POST https://api.foura.ai/api/browser \
  -H "Authorization: Bearer pk_live_..." \
  -d '{
    "url": "https://www.example-saas.com/about",
    "unblocker": true
  }'

The routing logic sits in your own pipeline. The reliability sits in ours. You decide which of your sources gets which tool. We make sure the tool actually gets through.

Results

We've watched a handful of teams cut over from in-house scrapers to a FourA-routed pipeline during the public beta. The pattern is consistent (illustrative numbers based on what we've seen across the beta cohort):

Enrichment latency drops from 3–6 seconds per company to under 1.5 seconds median on cached-residential routes
Silent-failure rate (200-with-empty-data responses) drops from around 8% to under 1% once the validate block catches soft-fails before they reach the database
Engineering time on scraper maintenance drops from 1–2 full-time engineers to a Slack channel that mostly stays quiet
First-pass success rate on protected directories climbs into the high 90s when unblocker: true is paired with a clean proxy id

One more number worth flagging: we've seen first-pass correctness (right data, right company) lag behind first-pass success by about four points. The lesson isn't that scraping is hard. It's that you still need to validate the record against the company you actually asked for (we wrote about that pattern in why your web scraper keeps breaking).

The numbers that matter aren't the proxy pool size or the request count. They're the rate at which your enrichment endpoint returns the right data on the first try, and the slope of your scraper-maintenance graph over the next six months.

Key Takeaway

Enrichment pipelines fail in slow motion. The first scraper you write looks fine on a Tuesday. By the third source, you're patching selectors at 11pm. By the tenth, you're carrying a maintenance debt that scales with your customer base. By the twentieth, you've quietly stopped onboarding new sources because nobody on the team wants to own the next one.

The bottleneck was never the source. It was the routing: picking the right tool, the right proxy, the right validation rule for every URL, every time. Build that layer once, hand it to something that already does it, and your team gets to spend Tuesday on the product instead of triaging selector breakage.