NEW: Updated with findings from Lokalise’s comprehensive blind-comparison study and the latest WMT24 research demonstrating LLM superiority over traditional machine translation.
🔖 Bookmark this article
Our product team is continuously performing research and sharing insights on Large Language Models (LLM) for translation, so we’ll update this article regularly. Alternatively, sign up for our newsletter to stay up-to-date on the latest translation and localization trends.
Large Language Models (LLMs) are now consistently outperforming traditional machine translation tools, so now begs the question, which is the best LLM for translation?
Testing from the product team at Lokalise reveals differences in translation accuracy between two of the most popular LLMs, as well as traditional machine translation tools.
When we talk about translation quality, it’s important to note that it’s inherently subjective. Two linguists might evaluate the same translation quite differently based on their personal preferences and interpretations.
So, we designed this experiment to overcome subjectivity, enabling us to draw conclusions we’re confident about.

The results: LLMs vs traditional machine translation
The research team at Lokalise conducted an extensive blind-comparison study covering three language pairs (English to German, Polish, and Russian) across five translation models to test their hypothesis that LLMs have surpassed ‘more traditional’ MT systems for translation:
- LLMs: Claude Sonnet 3.5 and GPT-4o
- Traditional machine translation: Google Translate, DeepL, and Microsoft Translator
Our hypothesis holds: LLMs demonstrated superior performance across all tested language pairs, with translation quality marked as ‘good’ between 55.7% and 80% of the time, even without any contextual information.
There was a clear winner: Claude Sonnet 3.5 ranked #1 across all languages tested, achieving ‘good’ translations approximately 78% of the time across Polish, German, and Russian.
This aligns with the WMT24 conference findings, which identified Claude 3.5-Sonnet as the top-performing system, winning in 9 out of 11 language pairs tested.
WMT24 conference validates LLM superiority
The 2024 Conference on Machine Translation (WMT24), the main event for machine translation and machine translation research, provides independent validation of our findings.
Their comprehensive evaluation of 242 translation systems, including 8 large language models, and online MT systems for all language pairs, confirmed that LLMs are outperforming traditional machine translations.
💡Key WMT24 findings
• Claude 3.5-Sonnet emerged as the overall best-performing system, winning in 9 language pairs. It even outperforms GPT-4 (wins in 5 language pairs), which is a much more expensive model
• The majority of participating teams now use LLMs as part of their translation systems
• LLMs consistently outperformed traditional neural machine translation approaches
Curious to learn how AI translation works? Read the linked jargon-free guide.
How Lokalise tested LLMs for translation
To ensure scientific rigor, we used multiple evaluation methodologies, including the Bradley-Terry model, one of the most respected statistical methods for ranking items based on pairwise comparisons. It’s a probabilistic framework that models the probability that translation A will be preferred over translation B using strength parameters.
This approach allowed us to establish a clear hierarchy of translation quality based on maximum likelihood estimation.
The language pairs we tested
We focused on high-resource languages to test, not only the hypothesis that LLMs have surpassed ‘more traditional’ MT systems for translation but also that LLM translation is already ‘good enough’ for high-resource languages.
We used Large Language Models to translate from English into three languages:
- English to German
- English to Polish
- English to Russian
Then we asked human annotators to evaluate translation quality through pairwise comparisons of translations from different engines. That means, for every translation, native speakers compared the variants from different engines and highlighted the best one.
600+ pairwise comparisons were carried out by multiple human annotators for each language pair.
Interestingly, while Russian is technically a high-resource language, our ‘goodness’ scores were significantly lower than for German and Polish, suggesting that resource availability doesn’t always correlate directly with translation performance.
So, hypothesis 2 holds somewhat.
Curious to learn more? Discover what’s the difference between NLP vs. LLM.
📝 Sidenote: Pairwise comparison is often easier for human evaluators to make relative judgments (“A is better than B”).
Key finding: Human vs machine agreement
One of the most significant discoveries from our research involves evaluator agreement.
We found that the gap between human inter-annotator agreement and AI ranking system agreement is minimal, whether measured by Cohen’s Kappa (a statistical measure that evaluates inter-rater reliability while accounting for chance agreement) or Average Jaccard Similarity (which measures overlap between evaluation sets).
In some cases, human annotators actually showed stronger agreement with AI ranking models than with fellow human annotators.
This suggests that AI systems can now evaluate translation quality with near-human reliability, a milestone that has significant implications for automated translation quality assessment.
Why LLMs outperform traditional translation tools
LLMs like Claude Sonnet 3.5 and GPT-4o bring several advantages to translation tasks that traditional machine translation tools don’t:
- Contextual understanding: Unlike traditional translation systems, LLMs grasp the broader context of text, enabling more natural-sounding outputs
- Cultural nuance: These models can better preserve idioms, cultural references, and tone across languages
- Adaptability: LLMs demonstrate greater flexibility when handling specialized terminology or uncommon language patterns
- Consistency: Our testing showed remarkable consistency in quality, with the best LLM achieving good translations 78% of the time across multiple language pairs
As large language models (LLMs) evolve and new models are released, the gap between LLM and traditional machine translation quality is only going to get bigger.
The WMT24 research confirms this trend, showing that traditional neural machine translation (NMT) is increasingly being integrated with or replaced by LLM-based approaches, typically via fine-tuning or as part of data cleaning/post-editing pipelines
Today, LLM code translation is increasingly common, which speaks volumes about how far the technology has come.
The time has come for machine translation to move over and Large Language Model translation to fill its bionic boots!
📚 Further reading: Can LLM translate text accurately?
Current LLM and AI translation rankings
Based on the latest research findings and WMT24 results, here’s how the leading translation systems currently rank:
Rank | Engine | System | Language pairs won | Notes |
1 | Claude-3.5-Sonnet | LLM (Anthropic) | 9 | Top overall performer |
2 | Gemini-1.5-Pro | LLM (Google) | 0 | Upper-middle, no wins, refused Icelandic, below Claude. |
3 | GPT-4 (ChatGPT) | LLM (OpenAI) | 0 | No wins, below Claude. Still performed competitively, often in the upper-middle cluster, but not at the very top. |
4 | DeepL | Commercial AI system | 0 | Included as a baseline, did not win any language pairs, competitive in some directions. Good quality but falling behind LLM-based solutions |
5 | Google Translate | Commercial AI system | 0 | Included as a baseline, did not win any language pairs, consistently outperformed by LLMs |
6 | MT Translator | Participant AI system | 0 | Participated, but did not win any language pairs. Not matching top LLMs |
How to choose the right LLM for translation
🗒️ Note: Model versions are constantly evolving. While our study tested Claude Sonnet 3.5, there are already newer versions available that haven’t been tested.
While Claude Sonnet 3.5 emerged as the clear winner in our testing, the best solution may depend on factors such as:
- Language pairs required
- Content type (technical documentation, marketing copy, legal text)
- Integration capabilities (e.g. with a translation and localization platform)
- Budget constraints
- Volume and consistency requirements
With Large Language Models constantly evolving and releasing new versions, it’s hard to stay on top of which translation model to choose depending on your needs.
That’s where Lokalise comes in. Our proprietary AI orchestration tool blends multiple AI engines and automatically picks the best one for your language pair and content type, resulting in an 80% first-pass acceptance rate. In other words, translations that are ready to publish without human review.
At Lokalise, we’re LLM agnostic, ensuring our customers always get the best performing LLM per language pair, without being locked into one model as AI technology continues to improve.
We also integrate with 60+ modern tools, so you can plug AI orchestration into your workflow in a matter of minutes.
When it comes to budget, customers have registered savings of up to 80% using AI translation with Lokalise instead of going the traditional linguistic route.

Stay tuned for more LLM translation research and insights
Our latest research, combined with findings from WMT24, clearly demonstrates that LLM-based solutions outperform traditional machine translation.
Remember though that LLMs need context for translation, in the same way that humans need context. That said, LLM-powered translations beat traditional machine translations even without context.
Without context, LLM-powered translations were rated “good” 78% of the time in our testing.
For even higher accuracy, consider a localization platform that already integrates with LLMs like Claude and GPT-4, allowing you to leverage the best model for each specific translation task. This approach ensures you’re always using the most capable system available while managing the entire translation process in one place.
As AI technology continues advancing, we expect to see even greater improvements in translation quality, with LLMs becoming the standard for professional translation workflows.
We’ll continue to update our LLM translation insights, so stay tuned.