2 TB of Ukrainian Law + DeepSeek V3 860B on GCP: What We'd Get
We have ~1.5 TB of EDRSR with vectors + ~550 GB of registries, legislation, Spanish sources, and EU-Lex running in prod. If we push all of this through an MoE model the size of DeepSeek V3, scaled to 860B on TPU v5p — what comes out? We break down the dataset, architecture, compute cost, and model properties.
2 TB of Ukrainian Law + DeepSeek V3 860B on GCP: What We'd Get
In production we have ~1.5 TB of full-text court decisions and their vector embeddings, plus another ~550 GB of other legal data: registries, legislation, business entities, a Spanish case law corpus, EU-Lex. If we take this corpus and train an MoE model the size of DeepSeek V3, scaled to 860B parameters, on GCP — what comes out? We break down the dataset, architecture, compute cost, and the properties such a model would have on Ukrainian law.
What's in the Dataset
The entire corpus is what's already running in SecondLayer's production. No extra scrapes, no Common Crawl, no noise.
EDRSR — the dataset core, ~1.5 TB. The Unified State Register of Court Decisions of Ukraine. 96.2 million full-text decisions (1,079 GB in PostgreSQL TOAST), 471 GB of vectors in Qdrant (voyage-3.5, 1024-dim), 28 GB of metadata (court, judge, date, case category, proceeding type, statute code). Breakdown by jurisdiction: civil 33.7M, administrative 14M+, criminal 12M+, commercial 6M+, misdemeanors 6M+. Largest annual cohort — 2024 (115 GB of TOAST text).
OpenReyestr — 43 GB. Ukrainian public registries: 16.7M legal entities (EDR), ownership structures (beneficiaries, shareholders), debtors (State Enforcement Service), NAIS registries. This is the foundation for SneakyPiper — our due-diligence platform — but here it serves as raw corpus for the model.
Legislation — ~40 GB. The Constitution, major codes (Civil, Criminal, Criminal Procedure, Civil Procedure, Commercial Procedure, Administrative Procedure, Labor, Tax, Customs), laws, and secondary legislation. All structurally annotated: articles, parts, clauses, revision dates with effective-date tracking. This isn't flat text: we know that Article 124 of the Constitution took effect on a specific date, carries particular references, and is cited in a precise number of decisions.
Supreme Court review practices + lu_court_decisions — ~25 GB. SC plenary decisions, practice overviews, Grand Chamber rulings. This is the most valuable slice — the legal positions that lower courts follow.
Spanish open data — ~50 GB. BOE (official gazette), AEAT (tax rulings), Tribunal Constitucional (Constitutional Court of Spain), BORME (companies register, section C), CENDOJ (criminal law), Fiscalia, Consejo de Estado, EU-Lex ES. A multilingual bonus: the model gets European legal context in its second working language.
SecondLayer opendata shards — ~30 GB. NIPO (patents/trademarks), DPA data, spending.gov.ua, parliamentary open data (Rada: deputies, bills, votes, legislation texts from zakon.rada.gov.ua), CourtSchedule, CourtExperts.
Total — roughly 2 TB of raw text. After deduplication, boilerplate filtering (standard decision headers, "enters into force upon" clauses, signatures), OCR fixes, and normalization, we expect ~800–1,000 GB of clean tokenized corpus.
In tokens (SentencePiece BPE trained on Ukrainian): approximately 280–330 billion tokens. For comparison, the original DeepSeek V3 was trained on 14.8T tokens, mostly English. Our corpus is 50x smaller, but it's focused, domain-specific, structured, and nearly unique: Common Crawl contains orders of magnitude less Ukrainian legal text.
Why DeepSeek V3 and What 860B Means
DeepSeek V3 is a Mixture-of-Experts (MoE) architecture from DeepSeek: 671B total parameters, 37B active per token. Hot inference is cheaper than dense models of the same scale because only a fraction of experts activates on each forward pass. For our use case — tens of millions of inference calls per month in production — that's critical.
860B is a hypothetical scale: we take the V3 topology and expand it by roughly 1.28x. Specifically: keep 61 layers, increase routed experts from 256 to ~330, retain top-8 routing + 1 shared expert, sigmoid router-gate, balance-loss-free training (as in V3-R1). Total parameters ~860B, active per token ~47B. Still inference-friendly.
Why this particular expansion? First, for a narrow-domain corpus more experts mean better specialized routing: one expert for "filing a claim under CPC," another for "tax rulings," a third for "Supreme Court reasoning in cassation orders." Second, 860B leaves headroom capacity for multilingual coverage (Ukrainian + Spanish + Russian + English) without domain degradation. Third, MoE on TPU v5p scales very cleanly — unlike dense models of the same parameter count.
We'd use the architectural features from the original V3: Multi-Head Latent Attention (MLA) instead of GQA — this reduces KV-cache by roughly 9x, enabling long context (256K tokens) without petabytes of RAM. Multi-Token Prediction (MTP) head as an auxiliary loss during training — improves sampling and unlocks speculative decoding at inference.
Training on GCP: Config and Cost
GCP has TPU v5p pods — the best platform for MoE training, better than H100 clusters in per-chip memory (95 GB HBM3 vs 80 GB) and inter-chip interconnect bandwidth (ICI). For an 860B MoE with 280B tokens, here's the estimate.
Minimum production config: v5p-2048 (2,048 chips, 512 hosts). On this pod, one epoch over 280B tokens completes in roughly 3–4 days. Full pre-training at 3 epochs — 9–12 days of compute time. Hyperparameter search on smaller models (70B/200B variants) — another 5–7 days on v5p-512.
v5p pricing is approximately \4.20 per chip-hour on-demand, \2.50 on a 3-year commitment. At 12 days on v5p-2048, the pre-training run alone comes to \2.5–4.2M. Add another \200–500K for experiments + supervised fine-tuning + DPO/RLHF on a separate judicial instruction dataset. Checkpoint storage in GCS runs ~100–200 GB per checkpoint; over a week you'll accumulate several TB.
Alternative — A3 Ultra (H100 Mega) on GCP. 768 H100s (48 a3-megagpu-8g instances) are roughly equivalent to v5p-1024 in throughput, but worse for MoE efficiency due to NVLink vs ICI. Price is comparable but slightly worse. So — v5p.
Data: the source corpus lives in GCS as multi-stream TFRecord chunks (256 MB each); tokenization happens on-the-fly in the data loader via the JAX/Flax/Paxml stack. This is standard for TPU training, unlike PyTorch/FSDP on H100. Pipeline: TPU chip -> HBM -> TensorCore, no round-trip to host DRAM on the hot path.
Expected Model Properties
What do we get by running this corpus through this much compute?
First: native Ukrainian legal reasoning. As of today, no frontier model truly knows Ukrainian law — not GPT-4o, not Claude Opus 4.7, not Gemini 2.5. They hallucinate Civil Code articles, confuse pre- and post-2022 code revisions, and can't distinguish administrative from civil proceedings. Our model would ingest 280B tokens of Ukrainian legal text — hundreds of times more than any frontier model's pre-training dataset contains.
Second: fine-grained citation. Because the corpus is structured (each chunk carries its doc_id, category, date, article reference), the model learns not just "there's an article in the code somewhere..." but rather "pursuant to Article 611 of the Civil Code of Ukraine (revision of 17.06.2020), in cases concerning recovery of penalties..." This isn't retrieval-augmented; it's a property the model develops in its activations from the pre-training signal itself.
Third: reasoning over precedents. With 96M decisions carrying full metadata (cassation/appellate/first instance, judicial district, reporting judge, date), the model learns how lower courts apply Supreme Court legal positions, how practice evolves over time, and where splits exist between chambers. This is no longer just "information synthesis" — it's legal reasoning trained on real decisions.
Fourth: graph logic for beneficiaries and connections. 16.7M entities in OpenReyestr + SneakyPiper relationship graphs provide raw material for the model to internally build a knowledge graph of the Ukrainian business world. With proper formatting of training samples (triples like "company–beneficiary–ownership %" as text), the model learns to generate hypotheses such as "if person X is the ultimate beneficiary of 3 companies sharing the same attorney, it's worth checking connections with the offshore registry."
Fifth: multilingual bridge function. The Spanish corpus (~50 GB) + EU-Lex ES + Ukrainian legislative texts creates a mapping between EU and Ukrainian criminal-law concepts — useful for extradition matters, MLAT requests, and cases with a foreign element. This isn't professional translation; it's a shared reasoning space.
Sixth: radically lower hallucination on domain queries. We expect that on a test set measuring "correct answer with article/precedent citation" we'd achieve 85–92% accuracy — compared to 40–55% for general-purpose frontier models. This is an experimental estimate, but on small variants (7B/70B fine-tuned on a corpus subset) we already see these numbers.
What the model would NOT do better than frontier models: general reasoning outside jurisprudence, math, code, creative writing in non-legal genres, niche English-language context. For those, production retains multi-model orchestration: lightweight queries go to a quick model, complex legal queries to our own, general queries to Claude/GPT.
What This Means for SecondLayer in Production
Right now we run multi-agent orchestration: intent classifier, retrieval planner, embedding via Voyage, Qdrant search, context building, query to GPT-4o/Claude, post-processing. This is expensive ($0.01–0.05 per query), slow (3–8 seconds per response), and dependent on OpenAI/Anthropic not cutting off Ukraine tomorrow.
With our own model:
- Inference at half the cost of OpenAI at comparable domain quality, because we don't pay for tokens that went into general pre-training
- 1–2 second latency instead of 3–8, because the query no longer travels trans-Atlantic through a retrieval pipeline
- Self-hosted on EU servers, GDPR-compliant, with no dependency on an external provider
- Ability to fine-tune for new task types (tax, labor, attorney ethics) without paying for retraining frontier models
The key insight: what we currently have on disk isn't just "data." It's the world's largest domain corpus for training a Ukrainian legal AI model. No foreign player has this corpus and won't have it for years. No open dataset (Pile, RedPajama, Dolma, FineWeb) comes close to containing this much judicial practice from any jurisdiction.
The question isn't whether it's worth doing. The question is when and with whom. $3–5M for pre-training is seed-to-Series-A territory — this is done with a single strategic investor who sees the Ukr-legal-AI market as a distinct category. We already have the pipeline, the corpus, and the team that keeps prod running on 96M decisions without downtime.
Next — compute.
Author: Volodymyr Ovcharov. legal.org.ua