What is the best LLM for translation? A comparison of top AI translation models

Rachel Wolff,Updated on March 12, 2025·6 min read

Want the latest scoop on localization and global growth?

The results: LLMs vs traditional machine translation

The research team at Lokalise conducted an extensive blind-comparison study covering three language pairs (English to German, Polish, and Russian) across five translation models to test their hypothesis that LLMs have surpassed ‘more traditional’ MT systems for translation:

LLMs: Claude Sonnet 3.5 and GPT-4o
Traditional machine translation: Google Translate, DeepL, and Microsoft Translator

Our hypothesis holds: LLMs demonstrated superior performance across all tested language pairs, with translation quality marked as ‘good’ between 55.7% and 80% of the time, even without any contextual information.

There was a clear winner: Claude Sonnet 3.5 ranked #1 across all languages tested, achieving ‘good’ translations approximately 78% of the time across Polish, German, and Russian.

This aligns with the WMT24 conference findings, which identified Claude 3.5-Sonnet as the top-performing system, winning in 9 out of 11 language pairs tested.

Get the best LLM-powered translations

Plug LLM-powered translation into your workflow in a matter of minutes with Lokalise.

Try Lokalise AI for free

WMT24 conference validates LLM superiority

The 2024 Conference on Machine Translation (WMT24), the main event for machine translation and machine translation research, provides independent validation of our findings.

Their comprehensive evaluation of 242 translation systems, including 8 large language models, and online MT systems for all language pairs, confirmed that LLMs are outperforming traditional machine translations.

💡Key WMT24 findings

• Claude 3.5-Sonnet emerged as the overall best-performing system, winning in 9 language pairs. It even outperforms GPT-4 (wins in 5 language pairs), which is a much more expensive model
• The majority of participating teams now use LLMs as part of their translation systems
• LLMs consistently outperformed traditional neural machine translation approaches

Curious to learn how AI translation works? Read the linked jargon-free guide.

How Lokalise tested LLMs for translation

To ensure scientific rigor, we used multiple evaluation methodologies, including the Bradley-Terry model, one of the most respected statistical methods for ranking items based on pairwise comparisons. It’s a probabilistic framework that models the probability that translation A will be preferred over translation B using strength parameters.

This approach allowed us to establish a clear hierarchy of translation quality based on maximum likelihood estimation.

The language pairs we tested

We focused on high-resource languages to test, not only the hypothesis that LLMs have surpassed ‘more traditional’ MT systems for translation but also that LLM translation is already ‘good enough’ for high-resource languages.

We used Large Language Models to translate from English into three languages:

English to German
English to Polish
English to Russian

Then we asked human annotators to evaluate translation quality through pairwise comparisons of translations from different engines. That means, for every translation, native speakers compared the variants from different engines and highlighted the best one.

600+ pairwise comparisons were carried out by multiple human annotators for each language pair.

Interestingly, while Russian is technically a high-resource language, our ‘goodness’ scores were significantly lower than for German and Polish, suggesting that resource availability doesn’t always correlate directly with translation performance.

So, hypothesis 2 holds somewhat.

Curious to learn more? Discover what’s the difference between NLP vs. LLM.

📝 Sidenote

Pairwise comparison is often easier for human evaluators to make relative judgments (“A is better than B”).

Key finding: Human vs machine agreement

One of the most significant discoveries from our research involves evaluator agreement.

We found that the gap between human inter-annotator agreement and AI ranking system agreement is minimal, whether measured by Cohen’s Kappa (a statistical measure that evaluates inter-rater reliability while accounting for chance agreement) or Average Jaccard Similarity (which measures overlap between evaluation sets).

In some cases, human annotators actually showed stronger agreement with AI ranking models than with fellow human annotators.

This suggests that AI systems can now evaluate translation quality with near-human reliability, a milestone that has significant implications for automated translation quality assessment.

Why LLMs outperform traditional translation tools

LLMs like Claude Sonnet 3.5 and GPT-4o bring several advantages to translation tasks that traditional machine translation tools don’t:

Contextual understanding: Unlike traditional translation systems, LLMs grasp the broader context of text, enabling more natural-sounding outputs
Cultural nuance: These models can better preserve idioms, cultural references, and tone across languages
Adaptability: LLMs demonstrate greater flexibility when handling specialized terminology or uncommon language patterns
Consistency: Our testing showed remarkable consistency in quality, with the best LLM achieving good translations 78% of the time across multiple language pairs

As large language models (LLMs) evolve and new models are released, the gap between LLM and traditional machine translation quality is only going to get bigger.

The WMT24 research confirms this trend, showing that traditional neural machine translation (NMT) is increasingly being integrated with or replaced by LLM-based approaches, typically via fine-tuning or as part of data cleaning/post-editing pipelines

Today, LLM code translation is increasingly common, which speaks volumes about how far the technology has come.

The time has come for machine translation to move over and Large Language Model translation to fill its bionic boots!

📚 Further reading

Can LLM translate text accurately?

Current LLM and AI translation rankings

Based on the latest research findings and WMT24 results, here’s how the leading translation systems currently rank:

Rank	Engine	System	Language pairs won	Notes
1	Claude-3.5-Sonnet	LLM (Anthropic)	9	Top overall performer
2	Gemini-1.5-Pro	LLM (Google)	0	Upper-middle, no wins, refused Icelandic, below Claude.
3	GPT-4 (ChatGPT)	LLM (OpenAI)	0	No wins, below Claude. Still performed competitively, often in the upper-middle cluster, but not at the very top.
4	DeepL	Commercial AI system	0	Included as a baseline, did not win any language pairs, competitive in some directions. Good quality but falling behind LLM-based solutions
5	Google Translate	Commercial AI system	0	Included as a baseline, did not win any language pairs, consistently outperformed by LLMs
6	MT Translator	Participant AI system	0	Participated, but did not win any language pairs. Not matching top LLMs

How to choose the right LLM for translation

🗒️ Note

Model versions are constantly evolving. While our study tested Claude Sonnet 3.5, there are already newer versions available that haven’t been tested.

While Claude Sonnet 3.5 emerged as the clear winner in our testing, the best solution may depend on factors such as:

Language pairs required
Content type (technical documentation, marketing copy, legal text)
Integration capabilities (e.g. with a translation and localization platform)
Budget constraints
Volume and consistency requirements

With Large Language Models constantly evolving and releasing new versions, it’s hard to stay on top of which translation model to choose depending on your needs.

That’s where Lokalise comes in. Our proprietary AI orchestration tool blends multiple AI engines and automatically picks the best one for your language pair and content type, resulting in an 80% first-pass acceptance rate. In other words, translations that are ready to publish without human review.

At Lokalise, we’re LLM agnostic, ensuring our customers always get the best performing LLM per language pair, without being locked into one model as AI technology continues to improve.

We also integrate with 60+ modern tools, so you can plug AI orchestration into your workflow in a matter of minutes.

When it comes to budget, customers have registered savings of up to 80% using AI translation with Lokalise instead of going the traditional linguistic route.

Stay tuned for more LLM translation research and insights

Our latest research, combined with findings from WMT24, clearly demonstrates that LLM-based solutions outperform traditional machine translation.

Remember though that LLMs need context for translation, in the same way that humans need context. That said, LLM-powered translations beat traditional machine translations even without context.

Without context, LLM-powered translations were rated “good” 78% of the time in our testing.

For even higher accuracy, consider a localization platform that already integrates with LLMs like Claude and GPT-4, allowing you to leverage the best model for each specific translation task. This approach ensures you’re always using the most capable system available while managing the entire translation process in one place.

As AI technology continues advancing, we expect to see even greater improvements in translation quality, with LLMs becoming the standard for professional translation workflows.

We’ll continue to update our LLM translation insights, so stay tuned.

Insights·Translation

Author

Rachel Wolff

Lead copywriter

Meet Rachel, our Content Manager and Lead Copywriter, who pivoted from advertising to SaaS and has never looked back.

Born and raised in the UK, Rachel has lived in London, Paris, Buenos Aires, and now Brussels. Through city-hopping, traveling, and her studies in French and Journalism, she’s picked up French and Spanish, and is now inventing her own language with help from her three-year-old daughter: Franglospanish!

Outside work, Rachel enjoys making (and eating) fresh pasta, drawing, and spending as much time as possible outside, cycling, hiking, or running.

Insights·Localization

Transcreation vs Localization: Which Approach is Right for You

When Coca-Cola launched its famous “Share a Coke” campaign in China, it tanked. Since most Chinese consumers don’t go by just one name, the idea of printing common first names on bottles didn’t work well. So, the brand adapted this campaign to print social labels like “Comedian” and “Fashionista.”

Updated on July 18, 2025·Shreelekha Singh

Insights·Localization

UI Localization: How to Make Your App Feel Native Everywhere

When I visited Norway for the first time, my biggest challenge was driving on the right-hand side of the road. Every turn, every signal, every instinct felt off. I wasn’t exactly lost. But it took way more effort than cruising through the left-side driving lanes in India. That’s exactly how people feel when they try to navigate your website or app in a language they don’t speak fluently. Sure, they can use it. But it feels clunky, disorienting, and far from intuitive.

Updated on July 1, 2025·Shreelekha Singh

Insights·Localization

How to create user-friendly global experiences through UX localization

Picture this: you live in Japan and you’re planning a weekend getaway in the mountains. You download that new car rental app you’ve heard so much about from your American friends — and great, it’s in Japanese! That makes things easier. But wait a second… pick-up times are in AM/PM format, and you can never remember if 12 PM is noon or midnight. When you try to enter your address, there’s no field for the prefecture! And the app insists on knowing which U.S. state issued your driver’s li

Updated on June 30, 2025·Ambra Santoro

What is the best LLM for translation? A comparison of top AI translation models

Want the latest scoop on localization and global growth?

Related posts

Stolen Evenings: The True Cost of Business Demands on Our Families

E-learning translation 101: How to build content that travels well

The Developer Delay Report: How Much Time US Dev Teams Lose to Tech Frustrations

The results: LLMs vs traditional machine translation

WMT24 conference validates LLM superiority

How Lokalise tested LLMs for translation

The language pairs we tested

Key finding: Human vs machine agreement

Why LLMs outperform traditional translation tools

Current LLM and AI translation rankings

How to choose the right LLM for translation

Stay tuned for more LLM translation research and insights

Rachel Wolff

Transcreation vs Localization: Which Approach is Right for You

UI Localization: How to Make Your App Feel Native Everywhere

How to create user-friendly global experiences through UX localization

Stop wasting time with manual localization tasks.

Launch global products days from now.

Case studies

Product

Support

Company

Legal

Follow

What is the best LLM for translation? A comparison of top AI translation models

Want the latest scoop on localization and global growth?

Related posts

Stolen Evenings: The True Cost of Business Demands on Our Families

E-learning translation 101: How to build content that travels well

The Developer Delay Report: How Much Time US Dev Teams Lose to Tech Frustrations

The results: LLMs vs traditional machine translation

WMT24 conference validates LLM superiority

How Lokalise tested LLMs for translation

The language pairs we tested

Key finding: Human vs machine agreement

Why LLMs outperform traditional translation tools

Current LLM and AI translation rankings

How to choose the right LLM for translation

Stay tuned for more LLM translation research and insights

Rachel Wolff

Related articles

Transcreation vs Localization: Which Approach is Right for You

UI Localization: How to Make Your App Feel Native Everywhere

How to create user-friendly global experiences through UX localization

Stop wasting time with manual localization tasks.

Launch global products days from now.

Case studies

Product

Support

Company

Legal

Follow