Skip to content

Research-Grade MT Evaluation

The MT Eval Harness is a language-agnostic evaluation framework for machine translation, purpose-built for low-resource languages where commercial metrics fall short. It produces standardized JSON reports with chrF++, BLEU, exact match, and semantic validation scores.

Designed as a companion to i18n-rosetta: translation methods developed and validated inside the harness can be exported as rosetta-compatible plugins, creating a direct pipeline from research to production i18n.

RESEARCH TOOLING — LOW-RESOURCE MT

You can't improve what you can't measure.

MT Eval Harness

REF: OVERVIEWPROJECT DESCRIPTION

Research-Grade MT Evaluation

The MT Eval Harness is a language-agnostic evaluation framework for machine translation, purpose-built for low-resource languages where commercial metrics fall short. It produces standardized JSON reports with chrF++, BLEU, exact match, and semantic validation scores.

Designed as a companion to i18n-rosetta: translation methods developed and validated inside the harness can be exported as rosetta-compatible plugins, creating a direct pipeline from research to production i18n.

Zero language-specific dependencies. Bring your own corpus, bring your own translation provider — the harness evaluates anything that produces text.

REF: FEATURESTECHNICAL IMPLEMENTATION

How It Works

From Corpus to Confidence Score

Feed the harness a parallel corpus (source + reference translations) and a translation provider. It runs each entry through the provider, then scores the output against the reference using multiple complementary metrics — chrF++ for character-level similarity, BLEU for n-gram precision, and optional FST validity for morphologically complex languages.

Results are written to timestamped JSON reports and can be visualized via the built-in interactive dashboard, which supports multi-run comparison, per-entry linguistic analysis, and automated quality tracking.

The plugin export pipeline packages validated translation methods — including prompts, preprocessing gates, and coaching data — as i18n-rosetta compatible bundles with full provenance metadata.

Built for researchers working at the frontier of low-resource machine translation.