Multilingual support in LLMs is not a nice-to-have

I remember back in November 2022, when ChatGPT was released and everyone was going crazy about how smart and humanlike it was. I sat down and started talking to it in Czech. After reading so many raving reviews and hyperbolic predictions, I was shocked by how quickly I ran into the model’s limits in Czech. Now don’t get me wrong: I know my native tongue is not a top-tier language, given it only has about 10 million native speakers. I don’t expect AI models to be as proficient in Czech as in English or other major languages. But it still made me think: if I am getting a much worse experience from large language models than English speakers are, what is the experience like for people who speak languages with even less digital support?

It set me on a path toward working on multilingual LLM capabilities and the challenge (and fun) of measuring them. We were already helping some major labs train AI models in various languages, and some of our clients used to give even longer-tail languages a lot of attention; some still do. But I feel like the “race to AGI” has completely overshadowed the importance of truly inclusive multilingual systems.

Let’s start with OpenAI and its latest model, GPT-5.4: the press release, which is almost a 4,000-word document, mentions the words “language”, “linguistic” or “multilingual” exactly zero times. OpenAI used to give some attention to multilingual performance of its models: it translated the Massive Multitask Language Understanding (MMLU) benchmark into 14 languages and used to report the scores in its press releases and models’ system cards. For the newest models, the scores are not mentioned in the press release, and even worse, not even in the system card. The only acknowledgment that the model can handle non-English text appears in a footnote. Neither the note nor the linked help article makes clear whether “support” refers to UI localization or actual model capability, and, if the latter, whether that applies to text, voice, or both.

OpenAI is not alone in this. Anthropic also makes no mention of multilingual capabilities in its press release for Claude Opus 4.6, unless you count “multilingual coding”, that is, the knowledge of different programming languages. Anthropic does produce a 213-page system card, and it actually does comment on the multilingual MMLU score! Sadly, the commentary spans two sentences and does not even provide per-language statistics.

When it comes to documenting which specific languages are supported, Anthropic is even worse than OpenAI. Anthropic has a Multilingual support page, which seems like a good start, until you realize that’s just a place to put results for the multilingual MMLU benchmark. At the time of writing, that page does not contain results for Claude Opus 4.5, Opus 4.6, or Sonnet 4.6. Anthropic also does not provide the full list or even a number. Instead, it just says “Note that Claude is capable in many languages beyond those benchmarked below.” Will it work well in Czech? Or Basque? Who knows! The most we get from Anthropic is that “Claude processes input and generates output in most world languages that use standard Unicode characters.”

There are AI labs that do better than the two behemoths. Mistral, for example, touts its Mistral Large 3 as “The next generation of open multimodal and multilingual AI” and claims its model supports more than 40 languages. The model page and the technical documentation provide no additional information (in fact, they do not even repeat the figure), but at least you can get a partial list at Hugging Face. Not from Mistral’s own docs, mind you, but from Hugging Face.

Not everyone does this badly, though. Google publishes a full list of supported languages, both generally and by model family. Similarly, Alibaba publishes a full table of supported languages in its model announcements. Both also report multilingual MMLU scores in their model cards; aggregated rather than per-language, but at least present.

Why does this matter? It is easy to argue that English is the lingua franca, that many knowledge workers already use it, and that this is especially true in higher-income markets. But that view is short-sighted. If frontier AI labs expect their products to become truly universal, multilingual support cannot remain an afterthought. Sooner or later, it becomes both a product-quality issue and a growth issue. Once the English-speaking market is saturated, the next wave of users will come from other languages and cultural contexts. And if labs are serious about serving them, they need to start treating multilingual capability as something to measure, document and improve, not something users are left to discover on their own.

To be fair, multilingual performance has genuinely improved. Models handle Czech far better today than they did in 2022, and the gap between English and non-English scores on standardized benchmarks has narrowed. I have some reservations about those benchmarks that I’ll address in a separate post, but the trend is clear. The question is how much of that improvement is concentrated effort and how much is just a side effect of more linguistically diverse pre-training data or better model architectures. And I’m not claiming labs do no multilingual work internally; I’m saying the fact that they don’t consider it worth reporting tells you where it sits among their priorities. What isn’t publicly reported isn’t being held to a standard. More importantly, multilingualism in LLMs is far from a solved problem, despite the lack of attention it receives.

I have spent much of the last several years working on this problem directly. My team and I built a multilingual benchmark that deliberately moves away from automated public benchmarks such as multilingual MMLU. Those benchmarks are useful, but they mostly test multiple-choice comprehension rather than real conversational ability. Our study focuses instead on open-ended language generation and manipulation, evaluated at scale by language professionals. I do not usually write about my day job on this blog, but this is one of the rare cases where it feels directly relevant. It is about as close to a passion project as work gets for me. We call it the “Multilingual LLM synthetic data generation study,” and the full 80-page report is available here.

Of course, releasing a report once a year is not enough. We are already working on something that should shed more light on models’ multilingual performance on a more continuous basis. Perhaps the most important thing, though, is to get people talking about this. While I do not believe LLMs are the road to AGI, and I have been vocal on this very blog about their negative aspects, I also believe there is a great deal of potential in them in their current form. There are underserved communities across the world that could benefit from easier access to language models that can communicate with them well.

Czech has gotten much better since 2022. But if models still trip up in a language with 10 million speakers and a strong online presence, imagine what it’s like in Kinyarwanda, Fijian, or Kyrgyz.

蔓生庭院

Multilingual support in LLMs is not a nice-to-have

关系图谱