Few-Shot Degradation in Morphologically Rich Languages: Cross-Domain and Cross-Lingual Evidence from Ukrainian
Follow-up to our tokenizer fertility study. Five experiments across SIB-200, EU Acts (24 languages), and ULP datasets. Tokenizer fertility is domain-invariant (1.63x on news vs 1.60x on legal). Few-shot degradation is task-dependent, not language-intrinsic. Ukrainian costs 20-40% more to tokenize than cognate Slavic languages.
Few-Shot Degradation in Morphologically Rich Languages
Cross-Domain and Cross-Lingual Evidence from Ukrainian
Volodymyr Ovcharov -- LEX AI LLC, Kyiv, Ukraine
All five experiments are complete.
Abstract
Our previous study demonstrated that few-shot prompting degrades foundation model performance by up to 26 percentage points on Ukrainian legal text. We present five experiments across three open HuggingFace datasets (SIB-200, EU Acts in Ukrainian, ULP) that test whether this effect is domain-specific or a broader property of morphologically rich languages. Using the same seven foundation models evaluated via AWS Bedrock, we show that:
(1) Tokenizer fertility is domain-invariant -- the 1.6x spread between most and least efficient tokenizers persists on news text (SIB-200, 1.63x) and EU legislation across 24 languages (4.6x spread from English to Greek).
(2) Few-shot degradation does not uniformly reproduce on news topic classification -- the effect is model-specific and task-dependent, not a blanket property of Ukrainian.
(3) Ukrainian is 20-40% more expensive to tokenize than cognate Slavic languages (Polish, Czech) on identical EU legal text, combining morphological complexity with pre-training underrepresentation.
(4) Cross-lingual experiments on Polish, Russian, and Czech further isolate the morphological vs. undertrained language hypotheses.
For practitioners: tokenizer analysis is domain-invariant (measure once, apply everywhere), but few-shot validation must be task-specific, not language-specific.
Motivation
Our first paper found three surprising results on 273 Ukrainian court decisions:
- Tokenizer fertility varies 1.6x across models
- NVIDIA Nemotron Super 3 (120B) outperforms Mistral Large 3 (675B) at 1/3 the cost
- Few-shot prompting degrades performance by up to 26 percentage points
A natural question: is this specific to legal text, or a general property of Ukrainian? This follow-up answers that question with five experiments across completely different domains and languages.
Models
Same seven models as Paper 1, all via AWS Bedrock:
| Model | Provider | Architecture |
|---|---|---|
| Llama 4 Maverick | Meta | 400B, 17B active, MoE |
| Llama 3.3 70B | Meta | 70B dense |
| Mistral Large 3 | Mistral | 675B, 41B active, MoE |
| Nemotron Super 3 | NVIDIA | 120B, 12B active, hybrid |
| Nova Pro | Amazon | Undisclosed |
| Qwen3 235B | Qwen | 235B, 22B active, MoE |
| Qwen3 32B | Qwen | 32B dense |
Experiment 1: Cross-Domain Fertility (COMPLETE)
Dataset: SIB-200 Ukrainian subset -- 1,004 news topic sentences, concatenated into ~6,000-character blocks to match Paper 1's measurement protocol.
Question: Is the 1.6x fertility spread a property of legal text, or of the tokenizer itself?
Results
| Model | Legal (Paper 1) | News (SIB-200) | Delta |
|---|---|---|---|
| Llama 4 Maverick | 2.43 | 2.20 | -0.23 |
| Llama 3.3 70B | 2.65 | 2.33 | -0.32 |
| Mistral Large 3 | 3.06 | 2.50 | -0.56 |
| Nemotron Super 3 | 3.08 | 2.52 | -0.56 |
| Nova Pro | 2.85 | 3.27 | +0.42 |
| Qwen3 235B | 3.89 | 3.58 | -0.32 |
| Qwen3 32B | 3.90 | 3.58 | -0.32 |
| Max/Min ratio | 1.61x | 1.63x |
Key finding: The fertility spread is preserved (1.63x on news vs 1.61x on legal). Model rankings are stable: both Qwen models remain the least efficient, both Llama models remain the most efficient. Fertility is a tokenizer property, not a domain artifact. A single measurement on any representative Ukrainian text predicts the cost ranking across all domains.
Fertility is consistently lower on news than legal text for 6/7 models, reflecting the higher proportion of domain-specific terminology in court decisions (article numbers, procedural formulas, institutional names).
Experiment 2: Cross-Domain Few-Shot Validation (COMPLETE)
Dataset: SIB-200 Ukrainian, test split (204 examples), 7-class topic classification (science/technology, travel, politics, sports, health, entertainment, geography).
Question: Does the few-shot degradation from Paper 1 reproduce on a completely different task and domain?
Results (all 7 models complete)
| Model | Zero-shot | Few-shot | Delta | Legal Delta (Paper 1) |
|---|---|---|---|---|
| Llama 3.3 70B | 86.8% | 90.2% | +3.4pp | +0.4pp |
| Llama 4 Maverick | 85.8% | 77.9% | -7.8pp | -6.2pp |
| Mistral Large 3 | 84.8% | 80.4% | -4.4pp | -3.3pp |
| Nova Pro | 84.3% | 88.7% | +4.4pp | -2.6pp |
| Qwen3 235B | 83.8% | 88.7% | +4.9pp | -26.0pp |
| Qwen3 32B | 80.9% | 84.8% | +3.9pp | -6.6pp |
| Nemotron Super 3 | 80.4% | 88.7% | +8.3pp | -12.8pp |
Few-shot helps 5/7 models, degrades 2/7. This is the opposite of legal text, where few-shot degraded 5/7 models.
Key finding: The effect is model-specific and task-dependent, not a blanket property of Ukrainian. The most striking reversals:
- Nemotron Super 3: -12.8pp on legal, +8.3pp on news
- Qwen3 235B: -26.0pp on legal, +4.9pp on news
- Llama 4 Maverick: consistently degrades (-6.2pp legal, -7.8pp news) -- the only model that degrades on both tasks
The two models that degrade on SIB-200 (Maverick -7.8pp, Mistral -4.4pp) also degraded on legal text. Models that improved on SIB-200 (Nemotron, Qwen3, Nova Pro) had degraded on legal. This suggests the legal task's formulaic structure and severe class imbalance (84% "granted") interact with few-shot anchoring in a way that general topic classification does not.
Experiment 3: Cross-Lingual Fertility (COMPLETE)
Dataset: EU Acts in Ukrainian -- 3M translation units of EU legislation in 24 EU languages paired with Ukrainian. Same legal content, different languages -- a perfectly controlled fertility comparison.
Question: How does Ukrainian's tokenizer penalty compare to other European languages?
Results (Qwen3 32B -- least efficient tokenizer)
| Family | Language | Fertility |
|---|---|---|
| Baseline | English | 1.24 |
| Irish | 1.26 | |
| Romance | Spanish | 1.58 |
| French | 1.62 | |
| Portuguese | 1.74 | |
| Italian | 1.86 | |
| Romanian | 2.35 | |
| Germanic | Dutch | 2.04 |
| German | 2.18 | |
| Danish | 2.36 | |
| Swedish | 2.36 | |
| Slavic | Polish | 2.64 |
| Slovenian | 2.65 | |
| Croatian | 2.68 | |
| Bulgarian | 2.77 | |
| Slovak | 3.07 | |
| Czech | 3.13 | |
| Semitic | Maltese | 2.68 |
| Baltic | Lithuanian | 3.27 |
| Latvian | 3.44 | |
| Uralic | Estonian | 3.35 |
| Hungarian | 3.48 | |
| Finnish | 3.57 | |
| Target | Ukrainian | 3.75 |
| Hellenic | Greek | 5.66 |
Cross-model comparison (Ukrainian fertility)
| Model | Ukrainian | English | Ratio |
|---|---|---|---|
| Llama 4 Maverick | 2.07 | 1.24 | 1.67x |
| Llama 3.3 70B | 2.22 | 1.21 | 1.83x |
| Mistral Large 3 | 2.56 | 1.24 | 2.06x |
| Nemotron Super 3 | 2.59 | 1.26 | 2.06x |
| Nova Pro | 3.32 | 1.23 | 2.70x |
| Qwen3 235B | 3.75 | 1.24 | 3.02x |
| Qwen3 32B | 3.75 | 1.24 | 3.02x |
Key findings:
Clear language family hierarchy: Latin-script analytic languages (English, Romance) are most efficient (1.2-1.9). Germanic at 2.0-2.4. Slavic at 2.6-3.1. Uralic/Baltic at 3.3-3.6.
Ukrainian is the most expensive Slavic language (3.75), despite similar morphological complexity to Polish (2.64) or Czech (3.13). The 20-40% gap suggests Ukrainian's penalty is not purely morphological but also reflects underrepresentation in pre-training data.
Greek is catastrophically inefficient (5.66) -- a script penalty for non-Latin characters that appear infrequently in training data.
Processing the same EU regulation in Ukrainian costs 3x more than in English, purely due to tokenizer design.
Experiment 4: Slavic Control (COMPLETE)
Dataset: SIB-200 parallel corpus -- same examples in Polish, Russian, Czech, Ukrainian (same index_id across all 4 languages).
Question: Does few-shot degradation affect all Slavic languages, or only Ukrainian?
Results: Few-Shot Delta (pp) Across 4 Slavic Languages
| Model | UK | PL | RU | CZ | Avg |
|---|---|---|---|---|---|
| Nemotron Super 3 | +8.3 | +15.7 | +8.8 | +9.3 | +10.5 |
| Qwen3 32B | +3.9 | +5.9 | +7.4 | +7.8 | +6.2 |
| Nova Pro | +4.4 | +5.1 | +3.0 | +1.5 | +3.5 |
| Qwen3 235B | +4.9 | +3.4 | +1.5 | +2.9 | +3.2 |
| Llama 3.3 70B | +3.4 | +3.0 | -0.5 | +5.0 | +2.7 |
| Mistral Large 3 | -4.4 | -1.0 | +2.9 | +1.0 | -0.4 |
| Llama 4 Maverick | -7.8 | -6.3 | -15.7 | -3.8 | -8.4 |
The decisive finding
The pattern is identical across all 4 Slavic languages:
- 4 models improve on all 4 languages (Nemotron, Qwen3 32B/235B, Nova Pro)
- 1 model degrades on all 4 languages (Maverick, avg -8.4pp)
- 2 models show mixed effects (Llama 3.3, Mistral)
This rules out both hypotheses from Paper 1:
- NOT "Ukrainian-specific" -- Maverick degrades equally on Polish and Czech
- NOT "morphological complexity" -- the pattern is model-intrinsic, not language-driven
The few-shot effect is a property of model architecture, not of the input language. Maverick's attention consistently anchors on surface patterns regardless of language; Nemotron consistently leverages demonstrations for task specification.
Experiment 5: Linguistic Competence (COMPLETE)
Dataset: ULP -- 347 expert-curated multiple-choice questions on Ukrainian grammar and orthography.
Question: Does tokenizer fertility correlate with grammatical competence?
Results
| Model | Fertility | ZS | FS | Delta |
|---|---|---|---|---|
| Llama 4 Maverick | 2.43 | 57.3% | 53.2% | -4.2pp |
| Llama 3.3 70B | 2.65 | 33.4% | 36.9% | +3.5pp |
| Nova Pro | 2.85 | 42.9% | 41.6% | -1.4pp |
| Mistral Large 3 | 3.06 | 51.0% | 49.4% | -1.6pp |
| Nemotron Super 3 | 3.08 | 27.7% | 36.3% | +8.7pp |
| Qwen3 235B | 3.89 | 42.1% | 44.2% | +2.1pp |
| Qwen3 32B | 3.90 | 34.6% | 31.7% | -2.9pp |
Correlation: Spearman rho = -0.43 (p = 0.34) -- negative trend (better tokenizer = better grammar) but not significant with only 7 data points.
Key findings:
- Maverick (lowest fertility) = best grammar (57.3%) -- 6pp above second-best
- Nemotron is the outlier: great at legal classification (96% in Paper 1) but worst at grammar (27.7%)
- Few-shot degrades 4/7 models on grammar -- MCQ format is vulnerable to anchoring
- Fertility is a necessary but not sufficient predictor -- architecture and training data matter as much
Discussion
Fertility is a tokenizer property
Experiment 1 demonstrates that tokenizer fertility rankings are invariant across domains. The 1.63x spread on news text is virtually identical to the 1.61x on legal text, and model rankings are preserved. A model that fragments Ukrainian words into more subword tokens does so regardless of whether the text discusses tort law or football.
Experiment 3 extends this across 24 EU languages: on identical legal content, fertility varies by 4.6x between English (1.24) and Greek (5.66). Within the Slavic family, Ukrainian (3.75) is 20-40% more expensive than Polish (2.64) or Czech (3.13), suggesting that Ukrainian's penalty combines morphological complexity with underrepresentation in pre-training data.
Few-shot degradation is task-dependent
Contrary to our initial hypothesis, Experiment 2 shows that few-shot prompting improves performance for some models on SIB-200 topic classification -- the same models and language that showed severe degradation on legal text. The most striking reversal is Qwen3 235B: -26pp on legal case outcome classification, +4.9pp on news topic classification.
This suggests the degradation documented in Paper 1 is an interaction between task structure and in-context learning, not a blanket property of Ukrainian morphology. Tasks with formulaic, imbalanced output spaces (84% "granted" in legal outcomes) are vulnerable to few-shot anchoring; tasks with diverse, balanced categories are not.
Practical recommendations
- Tokenizer analysis is domain-invariant. Measure fertility once on any representative text; the ranking holds across all domains.
- Few-shot validation is task-specific. Don't assume few-shot helps (English default) or hurts (Ukrainian legal finding). Validate per task.
- Budget for the tokenizer tax. Ukrainian costs 3x English on the same content. The gap is 20-40% vs. cognate Slavic languages.
- Parameter counts remain poor proxies for non-English performance.
Cost
| Experiment | Est. API Calls | Est. Cost |
|---|---|---|
| Exp 1: SIB-200 fertility | ~1,200 | ~$2 |
| Exp 2: SIB-200 eval | ~2,856 | ~$8 |
| Exp 3: EU Acts fertility | ~16,800 | ~$17 |
| Exp 4: Slavic control | ~8,568 | ~$20 |
| Exp 5: ULP eval | ~4,858 | ~$12 |
| Total | ~34,282 | ~$59 |
All experiments conducted on AWS infrastructure funded by an AWS Activate grant.
Datasets
All datasets used in this study are publicly available on HuggingFace:
- SIB-200 -- topic classification in 205 languages
- EU Acts in Ukrainian -- EU legislation in 24 language pairs
- ULP -- Ukrainian grammar/orthography benchmark
- Ukrainian Court Decisions -- our Paper 1 dataset (14,452 decisions)
This is a follow-up to Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text. Experiment code available at github.com/overthelex/rlhf-signals.