RAG Highlights, Training Orients: What to Do About Heterogeneity in Court Practice
A comment under the previous article nailed it: "the problem has shifted from access to practice to managing its heterogeneity." Precise framing. We break down why authority weights in RAG are only half the answer, what training your own model actually adds, and why production needs both layers.
RAG Highlights, Training Orients: What to Do About Heterogeneity in Court Practice
A comment under the article about EDRSR vectorization made a sharp observation: "the problem has shifted from simple access to practice to managing its heterogeneity." That's a precise framing. We break down why authority weights in RAG are only half the answer, and what training your own model on this corpus actually adds.
The Problem: The Corpus Honestly Reflects Chaos
96 million court decisions in open access isn't just a large database. It's a mirror of the actual state of legal practice. And that mirror reveals:
- Splits between Supreme Court chambers. The Civil Cassation Court holds position A, the Commercial Cassation Court holds B, for years. The Plenum resolves it after 2-3 years, but until then lower courts apply different standards.
- Temporal drift. A position before vs. after the 2022 code revision, before vs. after a 2023 Grand Chamber ruling. A semantically identical phrase in a 2018 decision and a 2024 decision means different things.
- Poorly reasoned decisions that are formally binding. A one-paragraph justification that no one appealed — it's official, but from a quality-of-reasoning standpoint it's almost noise.
- Lower court inertia. Even after a consolidating Plenum position, some courts drag on with old practice for years.
- Contradictions within the same time period. Two decisions from the same chamber a month apart that directly contradict each other.
Flat retrieval — whether FTS, kNN on embeddings, or a hybrid — doesn't distinguish any of this. It returns the top-K by similarity, and the lawyer sorts out what carries weight and what's noise on their own.
First Layer: RAG with Authority Weights
Our current answer is to attach an offline-computed payload weight to each chunk in Qdrant, derived from several signals.
Court level. Grand Chamber of the Supreme Court > SC chamber > appellate > first instance. The basic hierarchy.
Reasoning density. This isn't text length. It's the proportion of paragraphs containing statutory references, precedent tracing, legislative citations, or application of a legal test. Computed via regex + an ML classifier trained on expert-annotated samples of "strong reasoning" vs. "boilerplate."
Citation index. How many other decisions cite this one. We build a citation graph across the corpus; node weight is PageRank seeded from authoritative sources (Grand Chamber).
Reversal status. If a decision was overturned on cassation — its weight drops. If its position was explicitly rejected by a later Supreme Court ruling — it drops even further.
Alignment with the Supreme Court. How closely the legal position in a chunk matches the prevailing Supreme Court position on that topic as of the date of the decision.
These weights go into the payload, and retrieval shifts from "here are the 10 most similar" to "here are the 10 most similar with authority weights and doctrinal cluster membership." The lawyer sees: in my topic there are positions A and B. Position A has a weight of 0.82 (Grand Chamber, dense reasoning, 340 citations), position B has 0.41 (lone appellate court, three citations, boilerplate). The lawyer decides how to build their argument.
This is a step forward. But it's still a tool — the lawyer needs to know how to read the weights.
The Limits of This Approach
The problem with external weights is that they're scalar and context-blind.
In a narrow topic where the Grand Chamber hasn't weighed in, a fresh, well-reasoned first-instance decision may be the best available resource — but its formula-derived weight will be low.
Two positions can have similar weights, but one is "preservation of the past" while the other is "a trend gaining momentum." The weight doesn't show this.
A contradiction between two decisions both scoring 0.7 isn't explicitly flagged — the lawyer has to spot it themselves in the payload.
Weights are a good filter, but they don't navigate heterogeneity. They merely rank it.
Second Layer: Training a Domain Model
In the previous article we discussed what training an MoE model the size of DeepSeek V3 on 2 TB of corpus looks like. Here — what that training actually adds compared to RAG + weights.
Weighted sampling during pre-training. During pre-training we don't feed the model the entire corpus sequentially. We sample decisions with high authority weights 3-5x more frequently. The model sees strong argumentation as the statistically dominant pattern and absorbs it not as a filter but as its default style. This shifts the distribution of internal activations — the model writes with strong reasoning by default, not because we asked it to.
DPO on pairs from senior lawyers. After pre-training comes supervised fine-tuning, then Direct Preference Optimization on pairs (answer A, answer B) to the same question, with "better answer" labels from experienced practicing lawyers. This literally bakes editorial judgment into the model weights. RAG can't do this — it returns top-K and hands the choice to an LLM that has no domain-specific quality criteria.
Conflict as output, not collateral noise. A model trained on a corpus with explicit annotation of "position A vs. position B on topic X" produces on forward pass: "there's a split on this topic. The Civil Cassation Court holds A (examples: decisions 1, 2, 3). The Commercial Cassation Court holds B (examples 4, 5). The 2023 Grand Chamber Plenum leaned toward B. Lower courts inertially apply A, especially in regions X, Y. For your fact pattern, rely on B, because Z." This is reasoning over doctrine, not searching for similar chunks.
Temporal competence. Retrieval with date as a filter means explicitly specifying "search before 2022." A model with 280B tokens of Ukrainian law, where date is part of each decision's context, learns: "before the 2020 revision of Article 611 of the Civil Code the position was Y, after — Z." This is powerful for questions like "how is Article N currently applied" — where the whole point is that "currently" has its own history.
Cross-doctrinal coherence. The model sees connections between doctrines in a single forward pass: "the position on your question conflicts with the Supreme Court's position on the adjacent issue X — note that in your fact pattern this could play a role." This isn't "find similar" — it's finding logical dissonances in practice.
Important Caveat: Training Without Filtering = Confident Hallucinations
You can't just train a model on the entire corpus and expect legal reasoning to emerge magically. If we don't filter noise and poorly reasoned decisions at the input stage, the model absorbs them as "normal" argumentation — and starts confidently reproducing weak legal reasoning. This is worse than honest RAG, which at least leaves the choice to the lawyer.
That's why the pipeline must be surgical.
Authority-weighted sampling during pre-training — strong material appears more frequently. SFT dataset — only from senior lawyers, not rank-and-file annotators. The eval set includes "multi-valid" cases where the correct answer is "here are positions with weights, here's the trend, rely on B given the context." The model learns to flag contradictions, not silently pick a side.
This is important to say out loud, because conversations about domain models usually sound like "we'll train it and everything will be fine." It won't. You'll get a different set of problems if you don't build epistemic caution into the training procedure itself.
Delivery in Production: Both Layers
In a production system these aren't mutually exclusive.
RAG with weights stays for questions where full source transparency is needed: the lawyer wants to see every specific decision with numbers and metadata. This is when they're preparing a court filing.
Domain model — for initial navigation, reasoning over doctrine, explaining "what matters in this topic and what to rely on." This is when a lawyer enters unfamiliar territory or needs a quick synthesis.
Orchestration in production decides which layer to activate depending on the query type. Simple precedent search — RAG. A question like "how has practice formed in this area and where is it heading" — the model. Combination — switching between them within a single session.
Why the Commenter Was Right
The problem has indeed shifted. Five years ago the market asked: "let me search EDRSR faster and more accurately." That was about access.
Now, with access solved, the corpus exhaustively indexed, and vector search working — the question becomes different: "how do I not just find what's relevant, but understand what I can actually rely on and why." This is no longer a retrieval problem. It's an epistemic problem.
RAG with authority weights is the first instrument of response. It gives the lawyer a transparent picture with rankings.
Domain model training is the second instrument. It transforms the model from a search engine into a co-lawyer that navigates the doctrinal landscape on its own and explains its choices.
The end goal isn't replacing the lawyer with a model. The goal is giving the lawyer a tool that understands the heterogeneity of practice and highlights what can be reliably relied upon — and where you need to go to primary sources and verify manually.
From access to reliance. That's the right framing for the next iteration.
Author: Volodymyr Ovcharov. legal.org.ua