How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing
Multi-IP import, automated scheduler, freshness monitoring, international expansion — data pipeline engineering for open data across 6 jurisdictions. From the first 404 to stable nightly updates of 110+ tables.
How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing
When building a legal AI platform on open data, the biggest challenge isn't AI or search. It's reliably fetching data from dozens of sources — Ukrainian government registries, international databases, sanctions lists — each with its own limitations, formats, and stability issues.
This article is an engineering deep-dive into how we built a fully automated sync pipeline for 380+ million records from 40+ sources. From multi-IP import architecture to cron scheduler, freshness monitoring, and international expansion across 6 jurisdictions.
Updated: May 2026 — live numbers from production servers.
The Problem: Government APIs Are Not Stripe
When working with data.gov.ua, NAIS, UIPV, or spending.gov.ua APIs, you face reality:
- Undocumented rate limits — one service blocks after 100 req/min, another after 10
- Format changes — a JSON field suddenly becomes null instead of a string, or the response comes as an HTML error page instead of JSON
- Timeouts — a 200MB ZIP archive of the debtors registry might download for 20 minutes, or not at all
- No idempotency — no
ETag,Last-Modified, or diff endpoints. Every sync is a full rewrite - Disappearing URLs — data.gov.ua resources move without notice, returning 404
We can't afford manual imports. Lawyers rely on data freshness: the wanted persons registry must update daily, not monthly.
Architecture: Three Layers of Reliability
Our pipeline consists of three independent components:
┌─────────────────────────────────────────┐
│ opendata-sync (Docker container) │
│ ├─ node-cron scheduler │
│ ├─ 26 sources on schedule │
│ └─ Triggers → backend / openreyestr │
└───────────┬─────────────────┬───────────┘
│ │
▼ ▼
┌───────────────────┐ ┌──────────────────┐
│ ImportTaskService │ │ OpenReyestr │
│ (mcp_backend) │ │ sync-registry │
│ ├─ 10 source IPs │ │ ├─ ZIP download │
│ ├─ round-robin │ │ ├─ XML parsing │
│ ├─ retry logic │ │ └─ UPSERT │
│ └─ progress track │ │ │
└────────┬──────────┘ └────────┬─────────┘
│ │
▼ ▼
┌─────────────────────────────────────────┐
│ PostgreSQL: 110+ data tables (1.26 TB) │
│ Monitoring: db-status.py + freshness │
└─────────────────────────────────────────┘
Layer 1: Scheduler — opendata-sync
The first layer is a lightweight Node.js microservice that doesn't download data itself. It's only responsible for scheduling and triggering.
Source Configuration
Each source is declared declaratively:
{
name: 'mvs_wanted_persons',
title: 'MVS — Wanted Persons',
cron: '0 3 * * *', // 03:00 daily
target: 'backend', // where to send the trigger
sourceName: 'mvs_wanted_persons',
enabled: true
}
Sync Schedule
| Time | Sources | Target Service | |——|———|—————-| | 03:00 daily | MVS wanted, MVS missing, MVS vehicles, MVS invalid passports, NAZK corruption, NAZK offenders | backend | | 03:30 daily | Case statuses, court schedules, advocates, lustration, state aid, large taxpayers, wage debtors | backend | | 04:00–05:00 daily | Arbitration managers, bankruptcy, enforcement, debtors | openreyestr | | Sunday 02:00 | UIPV patents, trademarks, models, designs | backend | | Monday 02:00–05:00 | Notaries, court experts, special forms, streets, ATU | openreyestr |
Deduplication Protection
Before each trigger, the scheduler checks if an import is already running for that source. If status is running, no new task is created.
Layer 2: ImportTaskService — Multi-IP Import
This is the heart of the pipeline. When the scheduler sends a trigger, ImportTaskService handles all the downloading.
Three Import Modes
Government sources use different formats, so we support three strategies:
| Mode | Sources | How It Works |
|——|———|————-|
| api_paginated | UIPV (patents, trademarks) | Page-by-page API traversal, 1100ms between requests |
| json_array | MVS, NAZK | Single HTTP request → JSON array |
| file_download | NAIS registries | ZIP → XML → parsing → UPSERT |
Multi-IP: 10 Addresses × 5 Threads = 50 Concurrent Downloads
For sources with per-IP rate limits, we use a pool of 10 network interfaces (AWS ENI). Pages are distributed round-robin:
Page 1 → IP 172.31.x.1
Page 2 → IP 172.31.x.2
...
Page 10 → IP 172.31.x.10
Page 11 → IP 172.31.x.1 (back to first)
With 5 threads per IP, we get 50 concurrent connections. For UIPV with a 1100ms/request rate limit, this gives ~45 pages/second instead of 0.9.
Retry with Exponential Backoff
Each request has up to 5 attempts with increasing delays:
Attempt 1: immediately
Attempt 2: after 2 seconds
Attempt 3: after 4 seconds
Attempt 4: after 8 seconds
Attempt 5: after 16 seconds
For 429 (Too Many Requests) errors — separate logic: we respect Retry-After from the server response.
Progress Tracking Without Database Load
Progress is stored in memory and flushed to PostgreSQL every 100 pages:
// In-memory — updated every page (microseconds)
taskProgress.set(taskId, {
pagesDone: 4521,
recordsImported: 45210,
currentPage: 4522,
lastError: null
});
// To DB — flush every 100 pages
// UPDATE import_tasks SET pages_done=2, records_imported=3 WHERE id=$1
This provides real-time progress via API without overwhelming the database with thousands of UPDATE queries.
MCP Tools for Control
The entire process is managed through 4 MCP tools:
| Tool | Purpose |
|——|———|
| list_import_sources | Catalog of all sources: URL, type, table, rate limit |
| start_import | Launch background task: source_name → task_id |
| get_import_status | Progress: %, ETA, speed, errors |
| cancel_import | Stop via AbortController, preserving progress |
This means the AI assistant can launch an import, monitor progress, and notify the lawyer when data is updated.
Layer 3: Freshness Monitoring
Data without monitoring is a ticking bomb. We built a system that shows how fresh the data is in each table.
Expected Frequency Matrix
| Frequency | Tables | Examples | |———–|——–|———-| | Daily (1d) | 24 | MVS wanted, invalid passports, NAZK corruption, debtors, enforcement, case statuses, advocates | | Weekly (7d) | 48 | Patents, trademarks, OpenSanctions, deputies, judges, bills | | Monthly (30d) | 8 | Session schedules, large taxpayers, court experts, special forms |
Freshness Indicators
🟢 within norm (freq × 1.5) — all good
🟡 slightly overdue (freq × 1.5–2.5) — worth checking
🟠 overdue (freq × 2.5–4) — something went wrong
🔴 critical (> freq × 4) — needs intervention
⛔ import completed with error
🔄 import currently running
Dashboard: db-status.py
The script connects to the production database via SSH and shows the full picture:
═══════════════════════════════════════════════════════════════
📦 SecondLayer (main) — 110+ tables, 1.26 TB total
═══════════════════════════════════════════════════════════════
# Table Rows Size Norm Age
──────────────────────────────────────────────────────────────
1 opendata_vehicle_registrations 19.6M 5.9 GB 7d 3d ago 🟢
2 spending_acts 9.45M 8.3 GB 7d 5d ago 🟢
3 opendata_invalid_passports 2.89M 1.0 GB 1d 2m ago 🟢
4 opendata_court_case_status 1.25M 846 MB 1d 12m ago 🟢
5 opensanctions_entities 1.25M 522 MB 30d 8d ago 🟢
6 opendata_trademarks 382K 4.3 GB 7d 3d ago 🟢
7 opendata_patents 345K 5.0 GB 7d 3d ago 🟢
8 opendata_missing_persons 117K 119 MB 1d 12m ago 🟢
9 opendata_wanted_persons 71K 49 MB 1d 2m ago 🟢
10 opendata_corruption 58K 106 MB 1d 3h ago 🟢
...
Real Problems and How We Solved Them
Problem 1: Docker Can't Bind to ENI IP
json_array sources (MVS, NAZK) are a single HTTP request, not pagination. When we passed ENI IP for bind, the Docker container got EADDRNOTAVAIL — it can't see the host network.
Solution: multi-IP is only needed for paginated sources. For json_array — regular fetch without bind.
Problem 2: URLs Disappear Without Warning
data.gov.ua periodically updates resource IDs for MVS and NAZK. Old URLs return 404.
Solution: URLs are stored in the import_source_catalog table, not hardcoded. Updating a URL is a single UPDATE query, no code rebuild needed.
Problem 3: NULL Bytes in PDF/XML
Some registries contain \x00 characters that PostgreSQL rejects:
ERROR: invalid byte sequence for encoding "UTF8": 0x00
Solution: strip null bytes during parsing, before INSERT.
Problem 4: Response Is Not JSON
When servers are overloaded, some APIs return an HTML error page or empty string instead of JSON.
Solution: parsing wrapped in try/catch with Content-Type checking. If response isn't JSON — retry from next IP.
Problem 5: Memory Leak on Large Imports
Importing 9.45M spending_acts records kept all records in memory.
Solution: streaming parsing — processing in chunks of 1000 records, UPSERT, release memory.
Numbers
| Metric | Value | |——–|——-| | Total data volume | 380M+ records, 1.26 TB (2 databases) | | Number of sources | 26 in import_source_catalog + 20 international importers | | Number of tables | 110+ data tables (31 opendata + 20 spain + 43 openreyestr + 50+ EDRSR partitions) | | MCP search tools | 30+ (opendata + spending + registries + international) | | Daily sync | 12 sources (03:00–05:00 UTC) | | Weekly sync | 14 sources (weekends) | | Concurrent connections | up to 50 (10 IPs × 5 threads) | | Full UIPV import time | ~45 min (345K records) | | MVS wanted import time | ~30 sec (71K records, single request) | | Largest table | enforcement_proceedings: 29.4M records, 19 GB | | International jurisdictions | 6 (Spain, Ireland, Netherlands, Switzerland, Luxembourg, EU) |
International Expansion: From 15 Ukrainian Sources to 40+ Global
Since March 2026, the pipeline expanded far beyond Ukrainian registries. Here's what was added:
ICIJ Offshore Leaks — 4.9M Records
Full Panama Papers, Paradise Papers, Pandora Papers database. 814K entities, 771K officers, 2.9M relationships, 402K addresses. CSV import in ~2 minutes, data updates with each new leak.
Spain — 20 Tables, 780K Records
The most complex international import. 14 sources: Tribunal Constitucional (27K decisions), BOE (48K announcements + 12K laws), BORME (276K companies), EUR-Lex (8.6K acts), CENDOJ (2.3K criminal decisions). CENDOJ turned out to be geo-blocked for non-EU IPs — required Playwright + auto IP rotation (81 EIP rotations, 3 parallel EC2 workers).
Netherlands — 1.1M Court Decisions
Rechtspraak Open Data API — 1,106,921 decisions. One of the cleanest APIs across all sources: XML with clear schema, working pagination, documented rate limits.
Switzerland — 661K Court Decisions
Entscheidsuche.ch — federal and cantonal courts. Zefix (1.7M companies) and SHAB (2.18M HR records) still blocked due to 403/timeout.
Ireland — 812K Companies
Companies Registration Office (CRO) — complete registry of Irish companies.
Luxembourg — 3.3M Records
GLEIF LEI — Global Legal Entity Identifier. 3,282,067 international legal entity records.
OpenSanctions — 1.25M Records
Aggregated sanctions list: 1,020K persons, 108K companies, 71K legal entities. 330 unique datasets from around the world.
What's Next
✅ Done from Previous Plan
- More sources — from 15 to 26 automated + 20 international importers
- Incremental sync — implemented for EDRSR (
sync-edrsr-incremental.sh) - Data quality checks — basic row count drop verification after imports
🔜 Next Steps
- EDRSR fulltext gap 2022-2026 — 32.9M documents missing full text, active backfill via /Review/ endpoint (~4M already recovered)
- Qdrant hybrid search — EDRSR vectors (103M+ points) timing out at 60s, needs HNSW tuning or wait for indexing completion
- Spain Tier 2 — 12 more importers: Plataforma Contratación (~5-8M tenders), Congreso votes (~25M), CENDOJ non-penal, Catastro INSPIRE
- Switzerland — 12 importers targeting ~9.2M records: kantonsblatt.ch, fedlex, parlament.ch, Zefix, opendata.swiss
- data.gov.ua OSINT — discovered 150+ new datasets across P0-P2 categories, gradual integration
- Alerting — Telegram bot for failed import notifications
Conclusion
Building a pipeline for open data isn't about fetch → insert. It's about reliability engineering: retry, rate limits, multi-IP, freshness monitoring, graceful degradation. And when the pipeline goes international — it's also about Playwright for geo-blocked sites, EIP rotation to escape ban lists, and parsing XML schemas from 6 different jurisdictions.
Each of the 40+ sources is its own story with unique problems. But when the pipeline runs stable, a lawyer asks a question in chat and gets fresh data from MVS, NAZK, UIPV, NAIS, spending.gov.ua, ICIJ, Rechtspraak, and CENDOJ — without ever thinking about how much engineering work stands behind each response.
Registration: legal.org.ua