ACADEMIC 2026-05-14 15 min read (experiments in progress)

Few-Shot Degradation in Morphologically Rich Languages: Cross-Domain and Cross-Lingual Evidence from Ukrainian

Follow-up to our tokenizer fertility study. Five experiments across SIB-200, EU Acts (24 languages), and ULP datasets. Tokenizer fertility is domain-invariant (1.63x on news vs 1.60x on legal). Few-shot degradation is task-dependent, not language-intrinsic. Ukrainian costs 20-40% more to tokenize than cognate Slavic languages.

Few-Shot Degradation in Morphologically Rich Languages

Cross-Domain and Cross-Lingual Evidence from Ukrainian

Volodymyr Ovcharov -- LEX AI LLC, Kyiv, Ukraine

All five experiments are complete.

Abstract

Our previous study demonstrated that few-shot prompting degrades foundation model performance by up to 26 percentage points on Ukrainian legal text. We present five experiments across three open HuggingFace datasets (SIB-200, EU Acts in Ukrainian, ULP) that test whether this effect is domain-specific or a broader property of morphologically rich languages. Using the same seven foundation models evaluated via AWS Bedrock, we show that:

(1) Tokenizer fertility is domain-invariant -- the 1.6x spread between most and least efficient tokenizers persists on news text (SIB-200, 1.63x) and EU legislation across 24 languages (4.6x spread from English to Greek).

(2) Few-shot degradation does not uniformly reproduce on news topic classification -- the effect is model-specific and task-dependent, not a blanket property of Ukrainian.

(3) Ukrainian is 20-40% more expensive to tokenize than cognate Slavic languages (Polish, Czech) on identical EU legal text, combining morphological complexity with pre-training underrepresentation.

(4) Cross-lingual experiments on Polish, Russian, and Czech further isolate the morphological vs. undertrained language hypotheses.

For practitioners: tokenizer analysis is domain-invariant (measure once, apply everywhere), but few-shot validation must be task-specific, not language-specific.

Motivation

Our first paper found three surprising results on 273 Ukrainian court decisions:

Tokenizer fertility varies 1.6x across models
NVIDIA Nemotron Super 3 (120B) outperforms Mistral Large 3 (675B) at 1/3 the cost
Few-shot prompting degrades performance by up to 26 percentage points

A natural question: is this specific to legal text, or a general property of Ukrainian? This follow-up answers that question with five experiments across completely different domains and languages.

Models

Same seven models as Paper 1, all via AWS Bedrock:

Model	Provider	Architecture
Llama 4 Maverick	Meta	400B, 17B active, MoE
Llama 3.3 70B	Meta	70B dense
Mistral Large 3	Mistral	675B, 41B active, MoE
Nemotron Super 3	NVIDIA	120B, 12B active, hybrid
Nova Pro	Amazon	Undisclosed
Qwen3 235B	Qwen	235B, 22B active, MoE
Qwen3 32B	Qwen	32B dense

Experiment 1: Cross-Domain Fertility (COMPLETE)

Dataset: SIB-200 Ukrainian subset -- 1,004 news topic sentences, concatenated into ~6,000-character blocks to match Paper 1's measurement protocol.

Question: Is the 1.6x fertility spread a property of legal text, or of the tokenizer itself?

Results

Model	Legal (Paper 1)	News (SIB-200)	Delta
Llama 4 Maverick	2.43	2.20	-0.23
Llama 3.3 70B	2.65	2.33	-0.32
Mistral Large 3	3.06	2.50	-0.56
Nemotron Super 3	3.08	2.52	-0.56
Nova Pro	2.85	3.27	+0.42
Qwen3 235B	3.89	3.58	-0.32
Qwen3 32B	3.90	3.58	-0.32
Max/Min ratio	1.61x	1.63x

Key finding: The fertility spread is preserved (1.63x on news vs 1.61x on legal). Model rankings are stable: both Qwen models remain the least efficient, both Llama models remain the most efficient. Fertility is a tokenizer property, not a domain artifact. A single measurement on any representative Ukrainian text predicts the cost ranking across all domains.

Fertility is consistently lower on news than legal text for 6/7 models, reflecting the higher proportion of domain-specific terminology in court decisions (article numbers, procedural formulas, institutional names).

Experiment 2: Cross-Domain Few-Shot Validation (COMPLETE)

Dataset: SIB-200 Ukrainian, test split (204 examples), 7-class topic classification (science/technology, travel, politics, sports, health, entertainment, geography).

Question: Does the few-shot degradation from Paper 1 reproduce on a completely different task and domain?

Results (all 7 models complete)

Model	Zero-shot	Few-shot	Delta	Legal Delta (Paper 1)
Llama 3.3 70B	86.8%	90.2%	+3.4pp	+0.4pp
Llama 4 Maverick	85.8%	77.9%	-7.8pp	-6.2pp
Mistral Large 3	84.8%	80.4%	-4.4pp	-3.3pp
Nova Pro	84.3%	88.7%	+4.4pp	-2.6pp
Qwen3 235B	83.8%	88.7%	+4.9pp	-26.0pp
Qwen3 32B	80.9%	84.8%	+3.9pp	-6.6pp
Nemotron Super 3	80.4%	88.7%	+8.3pp	-12.8pp

Few-shot helps 5/7 models, degrades 2/7. This is the opposite of legal text, where few-shot degraded 5/7 models.

Key finding: The effect is model-specific and task-dependent, not a blanket property of Ukrainian. The most striking reversals:

Nemotron Super 3: -12.8pp on legal, +8.3pp on news
Qwen3 235B: -26.0pp on legal, +4.9pp on news
Llama 4 Maverick: consistently degrades (-6.2pp legal, -7.8pp news) -- the only model that degrades on both tasks

The two models that degrade on SIB-200 (Maverick -7.8pp, Mistral -4.4pp) also degraded on legal text. Models that improved on SIB-200 (Nemotron, Qwen3, Nova Pro) had degraded on legal. This suggests the legal task's formulaic structure and severe class imbalance (84% "granted") interact with few-shot anchoring in a way that general topic classification does not.

Experiment 3: Cross-Lingual Fertility (COMPLETE)

Dataset: EU Acts in Ukrainian -- 3M translation units of EU legislation in 24 EU languages paired with Ukrainian. Same legal content, different languages -- a perfectly controlled fertility comparison.

Question: How does Ukrainian's tokenizer penalty compare to other European languages?

Results (Qwen3 32B -- least efficient tokenizer)

Family	Language	Fertility
Baseline	English	1.24
	Irish	1.26
Romance	Spanish	1.58
	French	1.62
	Portuguese	1.74
	Italian	1.86
	Romanian	2.35
Germanic	Dutch	2.04
	German	2.18
	Danish	2.36
	Swedish	2.36
Slavic	Polish	2.64
	Slovenian	2.65
	Croatian	2.68
	Bulgarian	2.77
	Slovak	3.07
	Czech	3.13
Semitic	Maltese	2.68
Baltic	Lithuanian	3.27
	Latvian	3.44
Uralic	Estonian	3.35
	Hungarian	3.48
	Finnish	3.57
Target	Ukrainian	3.75
Hellenic	Greek	5.66

Cross-model comparison (Ukrainian fertility)

Model	Ukrainian	English	Ratio
Llama 4 Maverick	2.07	1.24	1.67x
Llama 3.3 70B	2.22	1.21	1.83x
Mistral Large 3	2.56	1.24	2.06x
Nemotron Super 3	2.59	1.26	2.06x
Nova Pro	3.32	1.23	2.70x
Qwen3 235B	3.75	1.24	3.02x
Qwen3 32B	3.75	1.24	3.02x

Key findings:

Clear language family hierarchy: Latin-script analytic languages (English, Romance) are most efficient (1.2-1.9). Germanic at 2.0-2.4. Slavic at 2.6-3.1. Uralic/Baltic at 3.3-3.6.
Ukrainian is the most expensive Slavic language (3.75), despite similar morphological complexity to Polish (2.64) or Czech (3.13). The 20-40% gap suggests Ukrainian's penalty is not purely morphological but also reflects underrepresentation in pre-training data.
Greek is catastrophically inefficient (5.66) -- a script penalty for non-Latin characters that appear infrequently in training data.
Processing the same EU regulation in Ukrainian costs 3x more than in English, purely due to tokenizer design.

Experiment 4: Slavic Control (COMPLETE)

Dataset: SIB-200 parallel corpus -- same examples in Polish, Russian, Czech, Ukrainian (same index_id across all 4 languages).

Question: Does few-shot degradation affect all Slavic languages, or only Ukrainian?

Results: Few-Shot Delta (pp) Across 4 Slavic Languages

Model	UK	PL	RU	CZ	Avg
Nemotron Super 3	+8.3	+15.7	+8.8	+9.3	+10.5
Qwen3 32B	+3.9	+5.9	+7.4	+7.8	+6.2
Nova Pro	+4.4	+5.1	+3.0	+1.5	+3.5
Qwen3 235B	+4.9	+3.4	+1.5	+2.9	+3.2
Llama 3.3 70B	+3.4	+3.0	-0.5	+5.0	+2.7
Mistral Large 3	-4.4	-1.0	+2.9	+1.0	-0.4
Llama 4 Maverick	-7.8	-6.3	-15.7	-3.8	-8.4

The decisive finding

The pattern is identical across all 4 Slavic languages:

4 models improve on all 4 languages (Nemotron, Qwen3 32B/235B, Nova Pro)
1 model degrades on all 4 languages (Maverick, avg -8.4pp)
2 models show mixed effects (Llama 3.3, Mistral)

This rules out both hypotheses from Paper 1:

NOT "Ukrainian-specific" -- Maverick degrades equally on Polish and Czech
NOT "morphological complexity" -- the pattern is model-intrinsic, not language-driven

The few-shot effect is a property of model architecture, not of the input language. Maverick's attention consistently anchors on surface patterns regardless of language; Nemotron consistently leverages demonstrations for task specification.

Experiment 5: Linguistic Competence (COMPLETE)

Dataset: ULP -- 347 expert-curated multiple-choice questions on Ukrainian grammar and orthography.

Question: Does tokenizer fertility correlate with grammatical competence?

Results

Model	Fertility	ZS	FS	Delta
Llama 4 Maverick	2.43	57.3%	53.2%	-4.2pp
Llama 3.3 70B	2.65	33.4%	36.9%	+3.5pp
Nova Pro	2.85	42.9%	41.6%	-1.4pp
Mistral Large 3	3.06	51.0%	49.4%	-1.6pp
Nemotron Super 3	3.08	27.7%	36.3%	+8.7pp
Qwen3 235B	3.89	42.1%	44.2%	+2.1pp
Qwen3 32B	3.90	34.6%	31.7%	-2.9pp

Correlation: Spearman rho = -0.43 (p = 0.34) -- negative trend (better tokenizer = better grammar) but not significant with only 7 data points.

Key findings:

Maverick (lowest fertility) = best grammar (57.3%) -- 6pp above second-best
Nemotron is the outlier: great at legal classification (96% in Paper 1) but worst at grammar (27.7%)
Few-shot degrades 4/7 models on grammar -- MCQ format is vulnerable to anchoring
Fertility is a necessary but not sufficient predictor -- architecture and training data matter as much

Discussion

Fertility is a tokenizer property

Experiment 1 demonstrates that tokenizer fertility rankings are invariant across domains. The 1.63x spread on news text is virtually identical to the 1.61x on legal text, and model rankings are preserved. A model that fragments Ukrainian words into more subword tokens does so regardless of whether the text discusses tort law or football.

Experiment 3 extends this across 24 EU languages: on identical legal content, fertility varies by 4.6x between English (1.24) and Greek (5.66). Within the Slavic family, Ukrainian (3.75) is 20-40% more expensive than Polish (2.64) or Czech (3.13), suggesting that Ukrainian's penalty combines morphological complexity with underrepresentation in pre-training data.

Few-shot degradation is task-dependent

Contrary to our initial hypothesis, Experiment 2 shows that few-shot prompting improves performance for some models on SIB-200 topic classification -- the same models and language that showed severe degradation on legal text. The most striking reversal is Qwen3 235B: -26pp on legal case outcome classification, +4.9pp on news topic classification.

This suggests the degradation documented in Paper 1 is an interaction between task structure and in-context learning, not a blanket property of Ukrainian morphology. Tasks with formulaic, imbalanced output spaces (84% "granted" in legal outcomes) are vulnerable to few-shot anchoring; tasks with diverse, balanced categories are not.

Practical recommendations

Tokenizer analysis is domain-invariant. Measure fertility once on any representative text; the ranking holds across all domains.
Few-shot validation is task-specific. Don't assume few-shot helps (English default) or hurts (Ukrainian legal finding). Validate per task.
Budget for the tokenizer tax. Ukrainian costs 3x English on the same content. The gap is 20-40% vs. cognate Slavic languages.
Parameter counts remain poor proxies for non-English performance.

Cost

Experiment	Est. API Calls	Est. Cost
Exp 1: SIB-200 fertility	~1,200	~$2
Exp 2: SIB-200 eval	~2,856	~$8
Exp 3: EU Acts fertility	~16,800	~$17
Exp 4: Slavic control	~8,568	~$20
Exp 5: ULP eval	~4,858	~$12
Total	~34,282	~$59

All experiments conducted on AWS infrastructure funded by an AWS Activate grant.

Datasets

All datasets used in this study are publicly available on HuggingFace:

SIB-200 -- topic classification in 205 languages
EU Acts in Ukrainian -- EU legislation in 24 language pairs
ULP -- Ukrainian grammar/orthography benchmark
Ukrainian Court Decisions -- our Paper 1 dataset (14,452 decisions)

This is a follow-up to Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text. Experiment code available at github.com/overthelex/rlhf-signals.