TECH 2026-03-28 15 min

How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing

Multi-IP import, automated scheduler, freshness monitoring, international expansion — data pipeline engineering for open data across 6 jurisdictions. From the first 404 to stable nightly updates of 110+ tables.

How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing

When building a legal AI platform on open data, the biggest challenge isn't AI or search. It's reliably fetching data from dozens of sources — Ukrainian government registries, international databases, sanctions lists — each with its own limitations, formats, and stability issues.

This article is an engineering deep-dive into how we built a fully automated sync pipeline for 380+ million records from 40+ sources. From multi-IP import architecture to cron scheduler, freshness monitoring, and international expansion across 6 jurisdictions.

Updated: May 2026 — live numbers from production servers.

The Problem: Government APIs Are Not Stripe

When working with data.gov.ua, NAIS, UIPV, or spending.gov.ua APIs, you face reality:

Undocumented rate limits — one service blocks after 100 req/min, another after 10
Format changes — a JSON field suddenly becomes null instead of a string, or the response comes as an HTML error page instead of JSON
Timeouts — a 200MB ZIP archive of the debtors registry might download for 20 minutes, or not at all
No idempotency — no ETag, Last-Modified, or diff endpoints. Every sync is a full rewrite
Disappearing URLs — data.gov.ua resources move without notice, returning 404

We can't afford manual imports. Lawyers rely on data freshness: the wanted persons registry must update daily, not monthly.

Architecture: Three Layers of Reliability

Our pipeline consists of three independent components:

┌─────────────────────────────────────────┐
│  opendata-sync (Docker container)       │
│  ├─ node-cron scheduler                 │
│  ├─ 26 sources on schedule              │
│  └─ Triggers → backend / openreyestr    │
└───────────┬─────────────────┬───────────┘
            │                 │
            ▼                 ▼
┌───────────────────┐ ┌──────────────────┐
│  ImportTaskService │ │  OpenReyestr     │
│  (mcp_backend)     │ │  sync-registry   │
│  ├─ 10 source IPs  │ │  ├─ ZIP download │
│  ├─ round-robin    │ │  ├─ XML parsing  │
│  ├─ retry logic    │ │  └─ UPSERT       │
│  └─ progress track │ │                  │
└────────┬──────────┘ └────────┬─────────┘
         │                     │
         ▼                     ▼
┌─────────────────────────────────────────┐
│  PostgreSQL: 110+ data tables (1.26 TB) │
│  Monitoring: db-status.py + freshness   │
└─────────────────────────────────────────┘

Layer 1: Scheduler — opendata-sync

The first layer is a lightweight Node.js microservice that doesn't download data itself. It's only responsible for scheduling and triggering.

Source Configuration

Each source is declared declaratively:

{
  name: 'mvs_wanted_persons',
  title: 'MVS — Wanted Persons',
  cron: '0 3 * * *',           // 03:00 daily
  target: 'backend',           // where to send the trigger
  sourceName: 'mvs_wanted_persons',
  enabled: true
}

Sync Schedule

Deduplication Protection

Before each trigger, the scheduler checks if an import is already running for that source. If status is running, no new task is created.

Layer 2: ImportTaskService — Multi-IP Import

This is the heart of the pipeline. When the scheduler sends a trigger, ImportTaskService handles all the downloading.

Three Import Modes

Government sources use different formats, so we support three strategies:

Multi-IP: 10 Addresses × 5 Threads = 50 Concurrent Downloads

For sources with per-IP rate limits, we use a pool of 10 network interfaces (AWS ENI). Pages are distributed round-robin:

Page 1  → IP 172.31.x.1
Page 2  → IP 172.31.x.2
...
Page 10 → IP 172.31.x.10
Page 11 → IP 172.31.x.1  (back to first)

With 5 threads per IP, we get 50 concurrent connections. For UIPV with a 1100ms/request rate limit, this gives ~45 pages/second instead of 0.9.

Retry with Exponential Backoff

Each request has up to 5 attempts with increasing delays:

Attempt 1: immediately
Attempt 2: after 2 seconds
Attempt 3: after 4 seconds
Attempt 4: after 8 seconds
Attempt 5: after 16 seconds

For 429 (Too Many Requests) errors — separate logic: we respect Retry-After from the server response.

Progress Tracking Without Database Load

Progress is stored in memory and flushed to PostgreSQL every 100 pages:

// In-memory — updated every page (microseconds)
taskProgress.set(taskId, {
  pagesDone: 4521,
  recordsImported: 45210,
  currentPage: 4522,
  lastError: null
});

// To DB — flush every 100 pages
// UPDATE import_tasks SET pages_done=2, records_imported=3 WHERE id=$1

This provides real-time progress via API without overwhelming the database with thousands of UPDATE queries.

MCP Tools for Control

The entire process is managed through 4 MCP tools:

This means the AI assistant can launch an import, monitor progress, and notify the lawyer when data is updated.

Layer 3: Freshness Monitoring

Data without monitoring is a ticking bomb. We built a system that shows how fresh the data is in each table.

Expected Frequency Matrix

Freshness Indicators

🟢 within norm (freq × 1.5)           — all good
🟡 slightly overdue (freq × 1.5–2.5)  — worth checking
🟠 overdue (freq × 2.5–4)             — something went wrong
🔴 critical (> freq × 4)              — needs intervention
⛔ import completed with error
🔄 import currently running

Dashboard: db-status.py

The script connects to the production database via SSH and shows the full picture:

═══════════════════════════════════════════════════════════════
  📦 SecondLayer (main) — 110+ tables, 1.26 TB total
═══════════════════════════════════════════════════════════════
  #   Table                              Rows   Size   Norm  Age
  ──────────────────────────────────────────────────────────────
  1   opendata_vehicle_registrations   19.6M  5.9 GB    7d   3d ago   🟢
  2   spending_acts                     9.45M  8.3 GB    7d   5d ago   🟢
  3   opendata_invalid_passports        2.89M  1.0 GB    1d   2m ago   🟢
  4   opendata_court_case_status        1.25M  846 MB    1d   12m ago  🟢
  5   opensanctions_entities            1.25M  522 MB   30d   8d ago   🟢
  6   opendata_trademarks                382K  4.3 GB    7d   3d ago   🟢
  7   opendata_patents                   345K  5.0 GB    7d   3d ago   🟢
  8   opendata_missing_persons           117K  119 MB    1d   12m ago  🟢
  9   opendata_wanted_persons             71K   49 MB    1d   2m ago   🟢
  10  opendata_corruption                 58K  106 MB    1d   3h ago   🟢
  ...

Real Problems and How We Solved Them

Problem 1: Docker Can't Bind to ENI IP

json_array sources (MVS, NAZK) are a single HTTP request, not pagination. When we passed ENI IP for bind, the Docker container got EADDRNOTAVAIL — it can't see the host network.

Solution: multi-IP is only needed for paginated sources. For json_array — regular fetch without bind.

Problem 2: URLs Disappear Without Warning

data.gov.ua periodically updates resource IDs for MVS and NAZK. Old URLs return 404.

Solution: URLs are stored in the import_source_catalog table, not hardcoded. Updating a URL is a single UPDATE query, no code rebuild needed.

Problem 3: NULL Bytes in PDF/XML

Some registries contain \x00 characters that PostgreSQL rejects:

ERROR: invalid byte sequence for encoding "UTF8": 0x00

Solution: strip null bytes during parsing, before INSERT.

Problem 4: Response Is Not JSON

When servers are overloaded, some APIs return an HTML error page or empty string instead of JSON.

Solution: parsing wrapped in try/catch with Content-Type checking. If response isn't JSON — retry from next IP.

Problem 5: Memory Leak on Large Imports

Importing 9.45M spending_acts records kept all records in memory.

Solution: streaming parsing — processing in chunks of 1000 records, UPSERT, release memory.

Numbers

International Expansion: From 15 Ukrainian Sources to 40+ Global

Since March 2026, the pipeline expanded far beyond Ukrainian registries. Here's what was added:

ICIJ Offshore Leaks — 4.9M Records

Full Panama Papers, Paradise Papers, Pandora Papers database. 814K entities, 771K officers, 2.9M relationships, 402K addresses. CSV import in ~2 minutes, data updates with each new leak.

Spain — 20 Tables, 780K Records

The most complex international import. 14 sources: Tribunal Constitucional (27K decisions), BOE (48K announcements + 12K laws), BORME (276K companies), EUR-Lex (8.6K acts), CENDOJ (2.3K criminal decisions). CENDOJ turned out to be geo-blocked for non-EU IPs — required Playwright + auto IP rotation (81 EIP rotations, 3 parallel EC2 workers).

Netherlands — 1.1M Court Decisions

Rechtspraak Open Data API — 1,106,921 decisions. One of the cleanest APIs across all sources: XML with clear schema, working pagination, documented rate limits.

Switzerland — 661K Court Decisions

Entscheidsuche.ch — federal and cantonal courts. Zefix (1.7M companies) and SHAB (2.18M HR records) still blocked due to 403/timeout.

Ireland — 812K Companies

Companies Registration Office (CRO) — complete registry of Irish companies.

Luxembourg — 3.3M Records

GLEIF LEI — Global Legal Entity Identifier. 3,282,067 international legal entity records.

OpenSanctions — 1.25M Records

Aggregated sanctions list: 1,020K persons, 108K companies, 71K legal entities. 330 unique datasets from around the world.

What's Next

✅ Done from Previous Plan

More sources — from 15 to 26 automated + 20 international importers
Incremental sync — implemented for EDRSR (sync-edrsr-incremental.sh)
Data quality checks — basic row count drop verification after imports

🔜 Next Steps

EDRSR fulltext gap 2022-2026 — 32.9M documents missing full text, active backfill via /Review/ endpoint (~4M already recovered)
Qdrant hybrid search — EDRSR vectors (103M+ points) timing out at 60s, needs HNSW tuning or wait for indexing completion
Spain Tier 2 — 12 more importers: Plataforma Contratación (~5-8M tenders), Congreso votes (~25M), CENDOJ non-penal, Catastro INSPIRE
Switzerland — 12 importers targeting ~9.2M records: kantonsblatt.ch, fedlex, parlament.ch, Zefix, opendata.swiss
data.gov.ua OSINT — discovered 150+ new datasets across P0-P2 categories, gradual integration
Alerting — Telegram bot for failed import notifications

Conclusion

Building a pipeline for open data isn't about fetch → insert. It's about reliability engineering: retry, rate limits, multi-IP, freshness monitoring, graceful degradation. And when the pipeline goes international — it's also about Playwright for geo-blocked sites, EIP rotation to escape ban lists, and parsing XML schemas from 6 different jurisdictions.

Each of the 40+ sources is its own story with unique problems. But when the pipeline runs stable, a lawyer asks a question in chat and gets fresh data from MVS, NAZK, UIPV, NAIS, spending.gov.ua, ICIJ, Rechtspraak, and CENDOJ — without ever thinking about how much engineering work stands behind each response.

Registration: legal.org.ua