Semantic Search Across 5,000+ Legislation Articles: Embeddings, Chunking, and Qdrant
Keywords find what you already know. Semantic search finds what you need. We split 12 Ukrainian codes into 5,191 articles, vectorized each one using VoyageAI embeddings, and now the query "liability for poor-quality repairs" finds articles that contain none of those words.
Semantic Search Across 5,000+ Legislation Articles
Keywords find what you already know. Semantic search finds what you need.
The Problem with Keywords
A lawyer searches for "liability for poor-quality apartment repairs." Classic search looks for these words. But Article 858 of the Civil Code talks about "defects in work" and "client's claims against the contractor." Zero keyword match — but that is exactly the right article.
Semantic search understands meaning, not words.
How We Built It
Step 1: Legislation Sectioning
12 Ukrainian codes are not 12 documents. They are 5,191 articles, each a self-contained unit of knowledge. Our SemanticSectionizer breaks codes into logical sections:
- Article — the primary unit (90% of cases)
- Part of article — when an article is too long (>2,000 tokens)
- Chapter/Section — for search context
Each section is stored with metadata: code name, article number, title, hierarchical path (Book → Section → Chapter → Article).
Step 2: Vectorization
Each section passes through VoyageAI voyage-3.5:
- Input: article text + title + contextual path
- Output: 1024-dimensional vector
- Storage: Qdrant with metadata for filtering
Step 3: Search
User query → embedding → cosine similarity in Qdrant → top-N results with relevance threshold > 0.75.
Metadata filtering — a lawyer can narrow down to a specific code, chapter, or type of provision.
Real Examples
| Query | Keyword search finds | Semantic search finds | |——-|———————|———————-| | "liability for poor-quality repairs" | Nothing | Art. 858 CC (defects in contractor's work) | | "when you can stop paying alimony" | Nothing | Art. 188, 190, 196 FC (exemption from payment) | | "protection against wrongful dismissal" | Articles with the word "dismissal" | + Art. 235 LC (reinstatement), Art. 237-1 (compensation) |
Cache and Freshness
- Texts are downloaded from the official Verkhovna Rada API
- Cache TTL: 30 days
- When an article changes — automatic re-indexing
- 5,191 articles x 1,024 dimensions = ~21MB in Qdrant
Semantic search does not replace exact search — it complements it. Together they provide the complete picture.