How We Vectorize 33.7M Ukrainian Court Decisions via Voyage AI
EDRSR is the open-access Unified State Register of all Ukrainian court decisions. 44M+ vectors in Qdrant, 14.3M civil cases already processed out of 33.7M. Here's the pipeline: chunking, concurrency, checkpoint/resume, a dedicated EC2 for Qdrant, and the cost math.
How We Vectorize 33.7M Ukrainian Court Decisions via Voyage AI
EDRSR — the Unified State Register of Court Decisions — is effectively all of Ukraine's judicial practice in open access. Today Qdrant holds 44M+ vectors: criminal (19M), civil (14.3M), commercial (5.1M), misdemeanors (5.6M). Vectorization of civil cases (CPC, justice_kind=1) — the largest cohort at 33.7M documents — runs on a dedicated EC2 instance (r6a.xlarge, 32 GB RAM, 2 TB gp3). Here's what's under the hood: models, pipeline, cost, rakes, and current status.
Why Vectorize Courts
When a lawyer searches "is there case law on recovering bank prepayment fees" — they don't want to open 40 decisions and read them through. They want the system to surface the top 5 most relevant ones, pull out key paragraphs, and show how courts reasoned. Full-text search (FTS) over keywords doesn't give that — it returns every document containing the word "fee", and there are thousands.
For this semantic task you need vector representations of text. The model turns a paragraph from a decision into a point in a 1024-dimensional space; semantically similar paragraphs sit near each other. A kNN search in Qdrant returns the top K nearest, and an LLM composes the answer from exactly those relevant fragments.
The only problem: the register is big. Very big.
Scale
Our prod database holds full texts of decisions starting from 2006. Breakdown by procedural type:
- Civil (CPC) — 33.7M documents. The largest category. Consumer, housing, labor, family.
- Criminal (CrPC) — 12M+
- Administrative (CAS) — 14M+
- Commercial (CC) — 6M+
- Misdemeanors (CUaP) — 6M+
The Qdrant collection edrsr_decisions on a dedicated EC2 currently holds 44M+ vectors (122 segments, on_disk=true):
| Proceeding type | justice_kind | Vectors | |—|—|—| | Criminal (CrPC) | 2 | 19,036,347 | | Civil (CPC) | 1 | 14,328,427 | | Misdemeanors (CUaP) | 5 | 5,579,432 | | Commercial (CC) | 3 | 5,098,662 | | Total | | 44,042,868 |
Civil cases processed: 14.3M out of 33.7M — that's 42%. After CPC completes there will be roughly 63M+ vectors in a single collection.
For scale: a typical RAG project holds 100K — 1M vectors. Ours is two orders of magnitude bigger.
Stack
Embedding model. voyage-3.5 from Voyage AI. 1024-dimensional output, 6 cents per million tokens. We tested Voyage 3 Large and OpenAI text-embedding-3-large, but the quality gain on legal text didn't justify the cost difference (Voyage 3 Large is 3x more expensive). We already had an index on 3.5 for prior jurisdictions, so we stay on it for compatibility.
Vector DB. Qdrant v1.17, self-hosted in Docker on a dedicated EC2 (r6a.xlarge — 4 CPU, 32 GB RAM, 2 TB gp3). Collection edrsr_decisions with HNSW index, on_disk=true for both vectors and payload. Payload carries doc_id, court_code, judge, justice_kind, adjudication_date, plus chunk_index/total_chunks and chunk text. Dedicated instance because 44M+ points with HNSW were killing RAM on prod and blocking the chat service (OOM kills during segment optimization).
Source-of-truth. PostgreSQL 15, partitioned tables: RANGE by adjudication_date, LIST by adj_year. Full texts live in edrsr_fulltext, metadata in edrsr_documents. A JOIN across all partitions is 30M+ rows, so the pipeline walks year by year.
Runtime. Python 3.11, asyncio, aiohttp. No frameworks — direct HTTP to Voyage and Qdrant. 440 lines of code, one file.
Chunking
Court decisions are long. Average CPC ruling is 8–12K characters, longest reach 200K. Voyage accepts up to 32K tokens per input, but quality falls off on long contexts, and one long vector is poor for retrieval — the LLM can't tell which paragraph is relevant.
So we chunk: up to 2048 characters per chunk, 50-word overlap between neighbors. We split on paragraph boundaries to keep semantic coherence. On average one decision yields 2.7 chunks.
Each chunk in Qdrant gets a composite ID (doc_id × 1000 + chunk_index) — no collisions, and a single payload filter query pulls all chunks of a specific decision.
Concurrency and Throttling
Voyage has a rate limit — 2000 RPM per key for voyage-3.5. We have two keys and round-robin between them, giving a theoretical 4000 RPM ceiling. In practice we hold concurrency 50 and get a steady 63 documents per second. That's ~170 requests per minute per key — comfortably under the rate limit.
We tried concurrency 70 — first two million were fine, then the process stalled on the GIL (13% CPU, no progress, no errors — just stuck on a thread lock). Dropped to 50 — ran smooth, no deadlocks, no 429s.
Every 100 documents triggers a batch to Voyage (batch_size=500 chunks/request), gets embeddings, composes Qdrant points, and does one upsert. On Voyage error (429, network) — exponential backoff with jitter, max 5 retries. On Qdrant error — retry the same batch.
Checkpoint and Resume
At 33.7M documents any failure — network, OOM, container crash — means hours of lost work. So:
- Every 1000 processed documents the pipeline writes a checkpoint JSON:
{last_doc_id, processed_docs, total_chunks, total_tokens, timestamp} - On startup — reads checkpoint, resumes with
WHERE doc_id > last_doc_id - All metrics (docs, chunks, tokens, cost) accumulate across checkpoints
This has saved us twice. First time — when postgres-prod ran out of memory (more on that below). Second time — when Qdrant restarted and lost its API key from env. Both times we just restarted from the same checkpoint with no duplicated work.
Prod Incident: Postgres OOM
At 2.86M documents postgres-prod fell into recovery mode. Root cause: config mismatch — shared_buffers=16GB, container memory limit 12G. PG tried to allocate more than it had; OOM killer killed the process.
Fix in PR #1453: mem_limit: 24G, shm_size: 16g. After restarting the container with the new limits PG came up in 4 seconds and stopped falling over. The episode highlighted an infra pattern: postgresql.conf parameters (shared_buffers, work_mem, maintenance_work_mem) must align with container limits. Otherwise the system runs fine until the first load spike, then falls into recovery.
We also bumped swap on the local dev machine from 8GB to 24GB — heavy Voyage API traffic generates a lot of temporary objects in the Python process memory, especially while Qdrant is rebuilding its index in the background.
Cost
One civil document averages 2.7 chunks × 850 tokens = 2300 tokens. At voyage-3.5 pricing of 6 cents per million tokens, one document costs 0.014 cents — roughly 138 microdollars.
As of today, 14.3M documents out of 33.7M are processed — that's 42% of the cohort. We've spent approximately 1,980 dollars on the Voyage API and about 63 hours of pipeline runtime. Remaining 19.4M documents cost roughly 2,680 dollars and 85 hours (3.5 days of continuous processing). Total cost of the full CPC cohort vectorization — around 4,660 dollars.
Plus the EC2 r6a.xlarge for Qdrant — ~\0.20/hr (on-demand), roughly \145/month. Cheaper than OOM incidents on prod.
For scale: the same budget on OpenAI text-embedding-3-large would get us only a quarter of the volume. Voyage wins specifically at this scale.
What It Gives Users
Semantic search already works across 44M+ vectors today. Once the civil cohort is fully indexed, the collection will hold 63M+ chunks. A lawyer types a natural-language query — "case law on voiding a sale contract due to seller incapacity" — and the system returns the most relevant decisions from the right jurisdiction, with key paragraph extracts and EDRSR links.
That's a different class of product compared to FTS. FTS finds documents where a phrase appears. Semantic search finds documents where your situation is being discussed — even when the court used entirely different words.
TL;DR
- 33.7M civil EDRSR cases → Voyage voyage-3.5 → Qdrant (14.3M / 33.7M = 42% done)
- 44M+ vectors in Qdrant on a dedicated EC2 (r6a.xlarge, 32 GB RAM)
- 63 docs/sec, concurrency 50, two API keys round-robin
- ~4,660 dollars total cost for full CPC vectorization + ~$145/mo EC2
- Checkpoint/resume JSON, survived two incidents already
- After completion — 63M+ vectors in one collection, unified semantic search over all Ukrainian judicial practice
Runs in tmux on a dedicated EC2, checkpoint fires every 1000 docs. Snapshot sync to prod Qdrant every 6 hours via cron. Boring reliable engineering, not heroics.