LEX — AI Legal Platform for Law Firms

AI-powered legal analysis platform for law firms and corporate counsel.

Features

Resources

Blog Articles

Technology

Built on AWS (EC2, Bedrock Claude AI, ALB, WAF, S3, ACM, KMS). PostgreSQL, Redis, Qdrant vector database. TypeScript, React, Node.js.

Start free — 50 credits on registration. Sign up

TECH 15 min

How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing

Multi-IP import, automated scheduler, freshness monitoring, international expansion — data pipeline engineering for open data across 6 jurisdictions. From the first 404 to stable nightly updates of 110+ tables.

How We Sync 380M+ Records from 40+ Data Sources That Keep Crashing

When building a legal AI platform on open data, the biggest challenge isn't AI or search. It's reliably fetching data from dozens of sources — Ukrainian government registries, international databases, sanctions lists — each with its own limitations, formats, and stability issues.

This article is an engineering deep-dive into how we built a fully automated sync pipeline for 380+ million records from 40+ sources. From multi-IP import architecture to cron scheduler, freshness monitoring, and international expansion across 6 jurisdictions.

Updated: May 2026 — live numbers from production servers.


The Problem: Government APIs Are Not Stripe

When working with data.gov.ua, NAIS, UIPV, or spending.gov.ua APIs, you face reality:

We can't afford manual imports. Lawyers rely on data freshness: the wanted persons registry must update daily, not monthly.


Architecture: Three Layers of Reliability

Our pipeline consists of three independent components:

┌─────────────────────────────────────────┐
│  opendata-sync (Docker container)       │
│  ├─ node-cron scheduler                 │
│  ├─ 26 sources on schedule              │
│  └─ Triggers → backend / openreyestr    │
└───────────┬─────────────────┬───────────┘
            │                 │
            ▼                 ▼
┌───────────────────┐ ┌──────────────────┐
│  ImportTaskService │ │  OpenReyestr     │
│  (mcp_backend)     │ │  sync-registry   │
│  ├─ 10 source IPs  │ │  ├─ ZIP download │
│  ├─ round-robin    │ │  ├─ XML parsing  │
│  ├─ retry logic    │ │  └─ UPSERT       │
│  └─ progress track │ │                  │
└────────┬──────────┘ └────────┬─────────┘
         │                     │
         ▼                     ▼
┌─────────────────────────────────────────┐
│  PostgreSQL: 110+ data tables (1.26 TB) │
│  Monitoring: db-status.py + freshness   │
└─────────────────────────────────────────┘

Layer 1: Scheduler — opendata-sync

The first layer is a lightweight Node.js microservice that doesn't download data itself. It's only responsible for scheduling and triggering.

Source Configuration

Each source is declared declaratively:

{
  name: 'mvs_wanted_persons',
  title: 'MVS — Wanted Persons',
  cron: '0 3 * * *',           // 03:00 daily
  target: 'backend',           // where to send the trigger
  sourceName: 'mvs_wanted_persons',
  enabled: true
}

Sync Schedule

| Time | Sources | Target Service | |——|———|—————-| | 03:00 daily | MVS wanted, MVS missing, MVS vehicles, MVS invalid passports, NAZK corruption, NAZK offenders | backend | | 03:30 daily | Case statuses, court schedules, advocates, lustration, state aid, large taxpayers, wage debtors | backend | | 04:00–05:00 daily | Arbitration managers, bankruptcy, enforcement, debtors | openreyestr | | Sunday 02:00 | UIPV patents, trademarks, models, designs | backend | | Monday 02:00–05:00 | Notaries, court experts, special forms, streets, ATU | openreyestr |

Deduplication Protection

Before each trigger, the scheduler checks if an import is already running for that source. If status is running, no new task is created.


Layer 2: ImportTaskService — Multi-IP Import

This is the heart of the pipeline. When the scheduler sends a trigger, ImportTaskService handles all the downloading.

Three Import Modes

Government sources use different formats, so we support three strategies:

| Mode | Sources | How It Works | |——|———|————-| | api_paginated | UIPV (patents, trademarks) | Page-by-page API traversal, 1100ms between requests | | json_array | MVS, NAZK | Single HTTP request → JSON array | | file_download | NAIS registries | ZIP → XML → parsing → UPSERT |

Multi-IP: 10 Addresses × 5 Threads = 50 Concurrent Downloads

For sources with per-IP rate limits, we use a pool of 10 network interfaces (AWS ENI). Pages are distributed round-robin:

Page 1  → IP 172.31.x.1
Page 2  → IP 172.31.x.2
...
Page 10 → IP 172.31.x.10
Page 11 → IP 172.31.x.1  (back to first)

With 5 threads per IP, we get 50 concurrent connections. For UIPV with a 1100ms/request rate limit, this gives ~45 pages/second instead of 0.9.

Retry with Exponential Backoff

Each request has up to 5 attempts with increasing delays:

Attempt 1: immediately
Attempt 2: after 2 seconds
Attempt 3: after 4 seconds
Attempt 4: after 8 seconds
Attempt 5: after 16 seconds

For 429 (Too Many Requests) errors — separate logic: we respect Retry-After from the server response.

Progress Tracking Without Database Load

Progress is stored in memory and flushed to PostgreSQL every 100 pages:

// In-memory — updated every page (microseconds)
taskProgress.set(taskId, {
  pagesDone: 4521,
  recordsImported: 45210,
  currentPage: 4522,
  lastError: null
});

// To DB — flush every 100 pages
// UPDATE import_tasks SET pages_done=2, records_imported=3 WHERE id=$1

This provides real-time progress via API without overwhelming the database with thousands of UPDATE queries.

MCP Tools for Control

The entire process is managed through 4 MCP tools:

| Tool | Purpose | |——|———| | list_import_sources | Catalog of all sources: URL, type, table, rate limit | | start_import | Launch background task: source_name → task_id | | get_import_status | Progress: %, ETA, speed, errors | | cancel_import | Stop via AbortController, preserving progress |

This means the AI assistant can launch an import, monitor progress, and notify the lawyer when data is updated.


Layer 3: Freshness Monitoring

Data without monitoring is a ticking bomb. We built a system that shows how fresh the data is in each table.

Expected Frequency Matrix

| Frequency | Tables | Examples | |———–|——–|———-| | Daily (1d) | 24 | MVS wanted, invalid passports, NAZK corruption, debtors, enforcement, case statuses, advocates | | Weekly (7d) | 48 | Patents, trademarks, OpenSanctions, deputies, judges, bills | | Monthly (30d) | 8 | Session schedules, large taxpayers, court experts, special forms |

Freshness Indicators

🟢 within norm (freq × 1.5)           — all good
🟡 slightly overdue (freq × 1.5–2.5)  — worth checking
🟠 overdue (freq × 2.5–4)             — something went wrong
🔴 critical (> freq × 4)              — needs intervention
⛔ import completed with error
🔄 import currently running

Dashboard: db-status.py

The script connects to the production database via SSH and shows the full picture:

═══════════════════════════════════════════════════════════════
  📦 SecondLayer (main) — 110+ tables, 1.26 TB total
═══════════════════════════════════════════════════════════════
  #   Table                              Rows   Size   Norm  Age
  ──────────────────────────────────────────────────────────────
  1   opendata_vehicle_registrations   19.6M  5.9 GB    7d   3d ago   🟢
  2   spending_acts                     9.45M  8.3 GB    7d   5d ago   🟢
  3   opendata_invalid_passports        2.89M  1.0 GB    1d   2m ago   🟢
  4   opendata_court_case_status        1.25M  846 MB    1d   12m ago  🟢
  5   opensanctions_entities            1.25M  522 MB   30d   8d ago   🟢
  6   opendata_trademarks                382K  4.3 GB    7d   3d ago   🟢
  7   opendata_patents                   345K  5.0 GB    7d   3d ago   🟢
  8   opendata_missing_persons           117K  119 MB    1d   12m ago  🟢
  9   opendata_wanted_persons             71K   49 MB    1d   2m ago   🟢
  10  opendata_corruption                 58K  106 MB    1d   3h ago   🟢
  ...

Real Problems and How We Solved Them

Problem 1: Docker Can't Bind to ENI IP

json_array sources (MVS, NAZK) are a single HTTP request, not pagination. When we passed ENI IP for bind, the Docker container got EADDRNOTAVAIL — it can't see the host network.

Solution: multi-IP is only needed for paginated sources. For json_array — regular fetch without bind.

Problem 2: URLs Disappear Without Warning

data.gov.ua periodically updates resource IDs for MVS and NAZK. Old URLs return 404.

Solution: URLs are stored in the import_source_catalog table, not hardcoded. Updating a URL is a single UPDATE query, no code rebuild needed.

Problem 3: NULL Bytes in PDF/XML

Some registries contain \x00 characters that PostgreSQL rejects:

ERROR: invalid byte sequence for encoding "UTF8": 0x00

Solution: strip null bytes during parsing, before INSERT.

Problem 4: Response Is Not JSON

When servers are overloaded, some APIs return an HTML error page or empty string instead of JSON.

Solution: parsing wrapped in try/catch with Content-Type checking. If response isn't JSON — retry from next IP.

Problem 5: Memory Leak on Large Imports

Importing 9.45M spending_acts records kept all records in memory.

Solution: streaming parsing — processing in chunks of 1000 records, UPSERT, release memory.


Numbers

| Metric | Value | |——–|——-| | Total data volume | 380M+ records, 1.26 TB (2 databases) | | Number of sources | 26 in import_source_catalog + 20 international importers | | Number of tables | 110+ data tables (31 opendata + 20 spain + 43 openreyestr + 50+ EDRSR partitions) | | MCP search tools | 30+ (opendata + spending + registries + international) | | Daily sync | 12 sources (03:00–05:00 UTC) | | Weekly sync | 14 sources (weekends) | | Concurrent connections | up to 50 (10 IPs × 5 threads) | | Full UIPV import time | ~45 min (345K records) | | MVS wanted import time | ~30 sec (71K records, single request) | | Largest table | enforcement_proceedings: 29.4M records, 19 GB | | International jurisdictions | 6 (Spain, Ireland, Netherlands, Switzerland, Luxembourg, EU) |


International Expansion: From 15 Ukrainian Sources to 40+ Global

Since March 2026, the pipeline expanded far beyond Ukrainian registries. Here's what was added:

ICIJ Offshore Leaks — 4.9M Records

Full Panama Papers, Paradise Papers, Pandora Papers database. 814K entities, 771K officers, 2.9M relationships, 402K addresses. CSV import in ~2 minutes, data updates with each new leak.

Spain — 20 Tables, 780K Records

The most complex international import. 14 sources: Tribunal Constitucional (27K decisions), BOE (48K announcements + 12K laws), BORME (276K companies), EUR-Lex (8.6K acts), CENDOJ (2.3K criminal decisions). CENDOJ turned out to be geo-blocked for non-EU IPs — required Playwright + auto IP rotation (81 EIP rotations, 3 parallel EC2 workers).

Netherlands — 1.1M Court Decisions

Rechtspraak Open Data API — 1,106,921 decisions. One of the cleanest APIs across all sources: XML with clear schema, working pagination, documented rate limits.

Switzerland — 661K Court Decisions

Entscheidsuche.ch — federal and cantonal courts. Zefix (1.7M companies) and SHAB (2.18M HR records) still blocked due to 403/timeout.

Ireland — 812K Companies

Companies Registration Office (CRO) — complete registry of Irish companies.

Luxembourg — 3.3M Records

GLEIF LEI — Global Legal Entity Identifier. 3,282,067 international legal entity records.

OpenSanctions — 1.25M Records

Aggregated sanctions list: 1,020K persons, 108K companies, 71K legal entities. 330 unique datasets from around the world.


What's Next

✅ Done from Previous Plan

🔜 Next Steps

  1. EDRSR fulltext gap 2022-2026 — 32.9M documents missing full text, active backfill via /Review/ endpoint (~4M already recovered)
  2. Qdrant hybrid search — EDRSR vectors (103M+ points) timing out at 60s, needs HNSW tuning or wait for indexing completion
  3. Spain Tier 2 — 12 more importers: Plataforma Contratación (~5-8M tenders), Congreso votes (~25M), CENDOJ non-penal, Catastro INSPIRE
  4. Switzerland — 12 importers targeting ~9.2M records: kantonsblatt.ch, fedlex, parlament.ch, Zefix, opendata.swiss
  5. data.gov.ua OSINT — discovered 150+ new datasets across P0-P2 categories, gradual integration
  6. Alerting — Telegram bot for failed import notifications

Conclusion

Building a pipeline for open data isn't about fetch → insert. It's about reliability engineering: retry, rate limits, multi-IP, freshness monitoring, graceful degradation. And when the pipeline goes international — it's also about Playwright for geo-blocked sites, EIP rotation to escape ban lists, and parsing XML schemas from 6 different jurisdictions.

Each of the 40+ sources is its own story with unique problems. But when the pipeline runs stable, a lawyer asks a question in chat and gets fresh data from MVS, NAZK, UIPV, NAIS, spending.gov.ua, ICIJ, Rechtspraak, and CENDOJ — without ever thinking about how much engineering work stands behind each response.


Registration: legal.org.ua