Follow-up to our tokenizer fertility study. Five experiments across SIB-200, EU Acts (24 languages), and ULP datasets. Tokenizer fertility is domain-invariant (1.63x on news vs 1.60x on legal). Few-shot degradation is task-dependent, not language-intrinsic. Ukrainian costs 20-40% more to tokenize than cognate Slavic languages.
ACADEMIC15 min read (experiments in progress)
#Few-Shot Learning#Tokenizer#Ukrainian NLP#Cross-Lingual#SIB-200#Slavic Languages
Half a billion citation edges extracted from 100.7 million Ukrainian court decisions reveal that judicial citation structure encodes legal domain boundaries without supervision and predicts future legislative importance with near-perfect accuracy (AUC = 0.9984).
Edit-traces from production agentic workflows produce alignment signal that is denser, more outcome-predictive, and distributionally unlike conventional RLHF preference data. 80.7% of edits are substantive rewrites; binary rejection correlates with 78% positive outcomes — the strongest oversight signal.
Ontology-based filtering of human oversight signal predicts downstream outcome quality: sessions classified as full oversight by a formal domain constitution exhibit 3-6x higher rejection rate, concentrating the most informative alignment action.
ACADEMIC40 min read (full paper)
#Cybernetics & Systems Analysis#Ontology#OWL 2 DL#Alignment#Formal Methods
Sixty percent of context tokens in current LLM agentic sessions are wasted — redundant re-explanation of decisions already made in prior sessions. The key insight: the memory layer produces alignment data (retrieval-correction signal), not just consumes it.
Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. Qwen 3 consumes 60% more tokens than Llama-family; NVIDIA Nemotron Super 3 (120B) outperforms Mistral Large 3 at 1/3 the cost.
ACADEMIC35 min read (full paper)
#arXiv preprint#Tokenizer#Ukrainian NLP#Foundation Models#Legal AI
A comment under the previous article nailed it: "the problem has shifted from access to practice to managing its heterogeneity." Precise framing. We break down why authority weights in RAG are only half the answer, what training your own model actually adds, and why production needs both layers.
We have ~1.5 TB of EDRSR with vectors + ~550 GB of registries, legislation, Spanish sources, and EU-Lex running in prod. If we push all of this through an MoE model the size of DeepSeek V3, scaled to 860B on TPU v5p — what comes out? We break down the dataset, architecture, compute cost, and model properties.
EDRSR is the open-access Unified State Register of all Ukrainian court decisions. 44M+ vectors in Qdrant, 14.3M civil cases already processed out of 33.7M. Here's the pipeline: chunking, concurrency, checkpoint/resume, a dedicated EC2 for Qdrant, and the cost math.
Our OSINT product SneakyPiper.com runs due diligence for US businesses. Under the hood: 16.7M OpenSanctions entities, 31K AI-classified dark-web forum subjects, a live feed of ransomware victims and GitHub credential leaks. Here's what lives in production — by the numbers.
Google Cloud asks 5 questions before allocating GPUs. We break them down into 9 ML competencies — from LoRA on 70B and continued pre-training DeepSeek-V3 685B to RLHF with constitutional alignment and capacity planning for a $200K+ training run. Concrete examples from our real stack.
TECH12 min
#Machine Learning#LLM#Hiring#RLHF#Fine-tuning#Vertex AI
Concrete task buckets waiting for contributors: OpenData adapters, ML experiments, frontend, performance, tests. Our only "interview" is your first pull request. AI-assisted code is welcome — we write with Claude Code every day.
LEX AI is opening its platform as open source. We welcome strong engineers — AI/ML, backend, data, frontend — to contribute or join the team. What's already open, who we're looking for, and how to get involved.
Your laptop is not a 32-CPU machine. npm install competes with Docker for disk. TypeScript OOMs on a large monorepo, and Playwright cannot exploit parallelism. We break down how to move GitHub Actions runners to AWS — from c7g Spot to actions-runner-controller on EKS — and get a 3-5× build speedup without local hell.
Harvey spent $100M+ and 10B tokens fine-tuning a case law model with OpenAI. We connected Opus to 100M+ court decisions from EDRSR via RAG. Both paths work — but for different realities.
TECH22 min
#LLM#Fine-tuning#RAG#Claude Opus#Harvey AI#OpenAI#Google#DeepSeek#EDRSR#Legal AI
800+ sessions, 10,000+ messages, 1,200+ commits, 328,000 lines of code, 40,000+ bash commands — and zero hired developers. Real usage statistics of 50 days of continuous work with Claude Code building a legal tech platform.
How to ensure that a model with access to 50M+ records doesn't become a tool for pressuring the innocent? Asimov's Three Laws adapted for legal AI, threat scenarios, and architectural solutions for RLHF training on GCP.
5 categories cover 90% of the EDRSR corpus. How Long Tail destroys RLHF, why the model becomes a "civilist," and what strategies we are implementing on GCP for $240K over 6 months.
How Articles 3, 28, 32, 62 of the Constitution become reward functions in RLHF training. Presumption of innocence as a hardcoded rule, constitutional collisions, and a benchmark of 500+ scenarios.
Three separate models — judge, prosecutor, advocate — with information isolation reproduce adversarial proceedings. Instance specialization, result trees, and adversarial training on GCP.
30 articles, 9 sections, open license. Lex AI initiates an industry standard for LegalTech models — from presumption of innocence to wartime protections, with direct implementation in the reward model.
3 services, 1 PostgreSQL, shared Redis, one docker-compose — and the illusion of independence. How to spot a distributed monolith in your own architecture, when it's actually useful, and when it's time for real separation.
Multi-IP import, automated scheduler, freshness monitoring, international expansion — data pipeline engineering for open data across 6 jurisdictions. From the first 404 to stable nightly updates of 110+ tables.
An in-depth analysis of 5 Grand Chamber of the Supreme Court cases and TCC fine rulings based on full decision texts and separate opinions of justices. Found factual errors, overlooked separate opinions by Justices Mazur, Pohribnyi, and Yemets, a key proportionality finding, and inaccuracies regarding party composition.
5 parallel white-hat agents audited the platform for GDPR and OWASP Top 10 compliance. Found 23 vulnerabilities — from SQL injection to Google Ads firing before consent. Fixed 10 critical issues in one session. Full security architecture breakdown: Cloudflare, TLS 1.3, CSP, rate limiting, WebAuthn, E2EE.
EDRSR, sanctions, patents, attorneys, judges, legislation, parliament, registries — every open data source currently running in production. What we have, how to use it, and what's coming next.
We compiled 66 queries, each activating a specific platform tool — from court decision search to trademark verification. Plus 20 complex queries using 2–3 tools simultaneously. All designed for minimal LLM usage — maximum precision, minimum cost.
56 tools instead of 12 browser tabs. Semantic search across 45M decisions. Full-text analysis in seconds. Due diligence in one query. Not a replacement for a lawyer — an exoskeleton for their mind.
Importing Spanish legal data from BOE and CENDOJ. Geo-detection of locale. Automatic localization in 4 languages. New MCP tools for Spanish legislation. From Kyiv to Madrid — one codebase.
ECDSA + SHA256 for hashing. Redis key mismatch between start and verify. QR code and deep link. Business data updates on every login. 4 fixes in 24 hours. A real integration story — unfiltered.
11 state registries from data.gov.ua imported into the platform: enforcement proceedings, debtors, notaries, bankruptcy, legal acts and more — all accessible to lawyers through AI chat.
We launched platform.legal.org.ua — a portal for developers who want to integrate legal AI into their products. API keys, usage analytics, documentation for 56 tools, examples for Python and TypeScript. MCP SSE, REST, batch — three transports to choose from. From signup to first request — 5 minutes.
126,934 decisions under Art. 407 of the Criminal Code. 26,926 cases on draft evasion. 1,721 cassation rulings. Full-text search across 110M+ documents. Legislative texts in 2 seconds. Appeal chains. All on one platform.
60 million full texts. 283 GB across 4 shards. Custom RTF parser with depth-tracking for Windows-1251 Cyrillic. Two-phase ETL with idempotent upsert via temp tables. Application-level sharding by doc_id with independent backup domains. PostgreSQL shared memory exhaustion and three layers of defense. All on open government data.
One SDK instead of two libraries. IAM instead of API keys. Data in the EU instead of the US. A single bill instead of two invoices. Here is how we moved the entire fallback layer to AWS Bedrock — and why it changed more than we expected.
LEX AI now checks counterparties in the Unified Debtors Registry and verifies banks through the NBU registry — automatically, in a single request. 18 registries instead of 16.
The frontend parsed evidence from response text using regex — mobile Safari froze for a second. We moved evidence extraction to the backend, added an SSE evidence event, and now the client simply renders ready-made objects. Time to first evidence: from 2.1s to 0.8s.
Cloud Run with autoscaling to zero. Cloud SQL with automatic backups. Qdrant on a dedicated VM. All infrastructure at $280-430/mo with the ability to scale from 10 to 10,000 users without architecture changes.
Attorney verification via the Unified Attorney Registry (ERAU) in 2 seconds. 3-step onboarding. Consultation request with documents from the vault. Real-time chat between client and attorney. Escrow payment via Monobank. 10% platform commission. Full cycle — from "I need a lawyer" to a paid consultation.
One token. One command. 56 legal AI tools right in Claude Desktop. Court practice search, legislation analysis, counterparty verification — without opening a browser. Create a token in your profile, paste a command in the terminal, and LEX AI becomes an extension of your desktop.
We integrated OpenAI and Anthropic with round-robin routing. On the architecture diagram it looked perfect. In production it nearly killed our product. The same prompt produced different results depending on the provider. Debugging a 5-step agentic cycle? That is not engineering — it is archaeology. We ripped it all out. Hardcoded a single provider. Best line of code all year.
One endpoint. Three services. 58 MCP tools. Triple transport: stdio for Claude Desktop, HTTP REST for web apps, SSE for streaming. Every tool call goes through an 11-step pipeline with cost tracking at each stage. The number of tools will grow. The architecture does not care.
Keywords find what you already know. Semantic search finds what you need. We split 12 Ukrainian codes into 5,191 articles, vectorized each one using VoyageAI embeddings, and now the query "liability for poor-quality repairs" finds articles that contain none of those words.
AI confidently cites non-existent articles and fabricates case numbers. In the legal domain, this is not just an error — it is malpractice. We built two layers of protection: HallucinationGuard verifies every claim, CitationValidator validates every citation. Zero tolerance for fabrication.
We started as a REST API with 10 endpoints. Now we have 70 MCP tools across 3 services with triple transport. MCP gave us what REST could not: a standard way for AI to discover and use tools on its own. AI becomes the client, not you.
A passport on your smartphone — now the key to legal AI. We integrated Diia.Signature for authentication: deep link on mobile, QR code on desktop, ECDSA + SHA256 for hashing, and lawyers verify their identity with the same app they use to show documents at checkpoints. No passwords. No registration. One tap — and you are in.
A lawyer stores contracts in Nextcloud, correspondence in Google Drive, and searches court practice in EDRSR. Three different systems, three different windows, zero connection between them. MCP Connect unifies everything in one interface: AI analyzes your contract from Nextcloud, finds relevant practice from EDRSR, and verifies the counterparty in registries — in a single request.
AI will not replace lawyers. But the lawyer across the street who uses AI? That is your real competition. Their practice analysis covers 300 cases instead of 30. Their due diligence checks 16 registries in 2 seconds. They are not billing fewer hours — they are billing the same hours for a dramatically better outcome.
You search for "compensation for apartment flooding" and miss the case where the court writes about "tortious liability for property damage resulting from engineering infrastructure failure." Keywords find words. Semantic search finds meaning.
A human reviews 30-40 decisions per session. AI processes 200-300 per minute. But it is not about speed — it is about completeness. When you see the full picture rather than a fragment, strategic decisions become qualitatively different.
Counterparty verification: 4 registry websites, 30 minutes of manual work, and you can still miss enforcement proceedings. Or: one request, 2 seconds, 18 registries, full picture — EDRPOU, founders, beneficiaries, debtors, enforcement proceedings, bankruptcy, NBU banks.
Lawyers cannot use ChatGPT for client matters — data ends up on OpenAI servers. We built a platform where every matter is isolated, every action is in an audit trail, legal holds block deletion, and GDPR is not a checkbox — it is architecture.