I had a related, but orthogonal question about multilingual LLMs.
When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.
For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.
Someone apparently did observe ChatGPT (I think it was ChatGPT) switch to Chinese for some parts of it's reasoning/calculations and then back to English for the final answer. That's somehow even weirder than the LLM giving different answers depending on the input.
I've seen this happen as well with o3-mini, but I'm honestly not sure what triggered it. I use it all the time but have only had it switch to Chinese during reasoning maybe twice.
I get strange languages sprinkled through my Gemini responses, including some very obscure ones. It just randomly changes language for one or two words.
Is it possible the "vector" is more accurate in another language? Like espirit d'esclair or schadenfreude, or any number of other things that are a single word in a language but paragraphs or more in others?
I saw Claude 3.7 write a comment in my code in Russian followed by, likely from a previous modification, the English text “Russian coding” for no reason.
> the LLM giving different answers depending on the input.
LLMs are actually designed to have some randomness in their responses.
To make the answer reproducible, set the temperature to O (eliminating randomness) and provide a static seed (ensuring consistent results) in the LLM's configuration.
The influence of the (pseudo-)random number generator is called "temperature" in most models.
Setting it to 0 in theory eliminates all randomness, and instead of choosing one from a list of next words that may be predicted, always only the MOST PROBABLY word would be chosen.
However, in practice, setting the temperature to 0 in most GUIs does not actually set the temperature to 0, but to a "very small" value ("epsilon"), the reason being to avoid a division by zero exception/crawsh in a mathematical formula.
So don't be surprised if you cannot get rid of random behavior entirely.
It's not necessary in most inference engines I've seen to set the temperature to 0—the randomness in the temperature is drawn from the seed, so a static seed will work for any temperature.
See my other comment. The answer is transfer learning: leveraging massive amounts of data in one language like Python, a few bridges to another language like Ruby, and obtain a “native” result in the other language.
But in this case the LLM is not exposed to explicit translation pairs between these two languages and rather by seeing enough examples in similar contexts, LLMs transfer some of their learnings in Python to Ruby (for better or worse results)
There are likely some languages that are genuinely easier or more difficult for LLMs.
For example consider Pascal or C89 requiring all variables to be declared at the start of the function body. That makes it much harder to generate code in a linear fashion. In Python you can just make up a variable the moment you decide you need it. In Pascal or C89 you would have to go back and change previous code, which LLMs can't easily do.
Similar things likely apply to strict typing. Typing makes it easier to reason about existing code, but it makes it harder to write new code if you don't have the ability to go back and change your mind on a type choice.
Both could be solved if we selected tokens in a beam search, searching for the path with the highest combined token probability instead of greedily selecting one token at a time. But that's much more expensive and I'm not sure anyone still does that with large-scale LLMs.
You could ask the LLM to first work out the solution in pseudocode, then translate to Pascal (or whatever). That way the variables are known after the initial pseudocode pass.
Human programmers also did this more frequently in those days than probably is the case now.
> Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
This likely plays a major - probably dominant - role.
It's interesting to think of other factors too though. The relatively concise syntax of those languages might make them easier for LLMs to work with. If resources are in any way token limited then reading and writing Spring Boot apps is going to be burdensome.
Those languages also have a lot of single file applications, which might make them easier for LLMs to learn. So much of iOS development for example is split across many files and I wonder if that affects the quality of the training data.
Also worth considering: there's a wider range of "acceptable" output programs when dealing with such forgiving scripting languages. If asked to output C then there are loads of finicky bits it could mess up, pointer accesses, writing past the end of an array, using uninitialized memory, using a value it already freed, missing a free, etc. All things that the language runtime handles in Python or JS. There's a higher cognitive load it needs to take on.
Indeed. I've thought from the beginning that LLMs should focus specifically on ONE language for this exact reason (i.e. mediocre/bad duplication of data in multiple languages). All other languages than English essentially "syphon" off capacity/layers/weights that could otherwise have held more genuine data/knowledge. Other languages should not come into the picture afaics - dedicated translation LLMs/existing-solutions can handle this aspect just fine and there's just no salient reason to fold partial-multi-language-capacity in through fuzzy/unorganised training.
Given the fact that LLMs like most neural networks work by passing their input through layers, wouldn't this be expected? There's no going back to an earlier layer and if the first layers are in some sense needed for "translating" [0] to English, any other functionality in those layers cannot be used.
[0] I am simplifying here, but it would make sense for an LLM to learn this, even though the intermediate representation is not exactly English, given the fact that much of the internet in English and the empirical fact that they are good at translating.
For most low-resource languages, support in LLMs is trained through translation pairs between english and the other languages, because translation data is easier to come across than say, conversations about coding, history, physics, basically the kind of data that is usually used for instruct training.
This kind of training data typically looks like ChatGPT style conversations where all the prompts are all templated like “Translate the following text from X to Y: [text]” and the LLM’s expected answer is the translated text.
LLMs can generalize through transfer learning (to a certain extent) from these translation pairs to some understanding (strong) and even answering (weak) in the target language. It also means that the LLM’s actual sweet spot is in translation itself since that’s what was trained in, not just a generalization.
I'd mentally put this in the same box as "chain of thought", where models perform better when explicitly describing the reasoning steps. The only difference in your case being that the model is undertrained in non-English data, so it's "next token prediction" of non-English prompts is less robust, and thus explicitly converting to English and then back makes it better.
This is probably the case for the "deep reasoning" models as well. If you for example try DeepSeek R1, it will likely reason in either English or Chinese (where it presumably is well trained) even if the prompt is in other languages.
Don't speak French, but interesting that it's not quite felt like an insufferable American tourist not in the group chat, in your language. LLMs all belong in that spectrum in my primary language.
Some studies are trying to ensure that the model reasons through abstractions instead of linguistic representations. (Of course the phenomenon of reasoning in substantially different quality depending on input language signals a fault - reasoning is beyond "spoken" language.)
Fascinating phenomenon. It's like a new Sapir–Whorf hypothesis. Do language models act differently in different languages due to those languages or the training materials?
This is one of those subtle clues that the LLM does not actually 'know' anything. It is providing you the best consensus answer to your prompt using the data upon which the weights rest, is that data was input primarily as english then you are going to get better results asking in english. It is still Searle's Chinese Room except you need to first go to the 'Language X -> English' room and then deliver its output to the general query room before delivering the next result to the 'English -> Language X' room.
Anthropic’s research did find that Claude seemed to have an inner language agnostic ”language” though. And that the larger a LLM got, the more it could realize the innate meaning of words between language barriers as well as expand upon its internal non-specific language representation.
So, part of its improved performance as they grow in parameter count is probably not only due to expanded raw material that it is trained upon, but a greater ability to ultimately ”realize” and connect apparent meanings of words, so that a German speaker might benefit more and more from training material in Korean.
> These results show that features at the beginning and end of models are highly language-specific (consistent with the {de, re}-tokenization hypothesis
[31] ), while features in the middle are more language-agnostic. Moreover, we observe that compared to the smaller model, Claude 3.5 Haiku exhibits a higher degree of generalization, and displays an especially notable generalization improvement for language pairs that do not share an alphabet (English-Chinese, French-Chinese).
However, they do see that Claude 3.5 Haiku seemed to have an English ”default” with more direct connections. It’s possible that a LLM needs to go a more roundabout way via generalizations to communicate in alternative languages and where this causes a dropoff in performance the smaller the model is?
The modern Standard Chinese language is almost syntactically "identical" to English, for some reason. French was direct ancestor to medieval British language that came to be the modern English.
My point is, those language pairs aren't random examples. Chinese isn't something completely foreign and new thing when it comes to difference between it and English.
Exactly. I found it surprising how soon it was implied "Imagine you're the smartest and most creative person in the world, ..." would somehow result in the most creative output.
It's clear from the start that language modelling is not yet there. It can't reason about low level structure (letters, syllables, rhyme, rhythm), it can't map all languages to a singular clear representation. Representation is mushy distributed mess out of which you get good or bad results.
It's brilliant how relevant the responses are and when they're correct, but the underlying process is driven by very weird internal representations.
It would be great if we could get to a point where we can use a language encoder and decoder, with a language agnostic knowledge model in between. But since it's generally more efficient to train the whole model end to end, such modularity would probably come at a performance price, and I don't see any private (or "non profit") companies take that approach anytime soon.
My supervising professor for the PhD program I left did a paper on the Chinese Room and argued that to a large degree understanding of the task was the ability to compress it many orders of magnitude. In that sense the LLMs are succeeding because despite their supposively massive parameter sets they are absolutely tiny compared to the Chinese Room version.
Both, but primarily due to the lack of training materials. 10 or so million native speakers of my language will never be able to generate the same amount of training material as over a billion English speakers do.
There is a steep drop in quality in any non-English language, but in general less native speakers = worse results. They tend to have a certain "voice" which is extremely easy to spot and the accuracy of results goes out the window (way worse than in English).
Right, but it’s interesting that means its reasoning abilities potentially drop off when it’s talking Thai, or its knowledge of WW2 history in the Eastern Theatre might drop off when speaking French, where the same model has no trouble with the same questions in English. My French and Thai are both rudimentary, but I’m working from the same set of facts and reasoning ability in both languages. Will it give different answers on what the greatest empire that ever existed was if you ask it in Mandarin vs Italian vs Mongolian?
They have more data in English than Spanish. LLMs don't know or reason or follow instructions. They merely render text continuations that are coherent with the expectations you set when prompting. The fact that they are not able to sustain the illusion in languages with less available training data than English should make that clear.
Was pretty good with Latvian (better than other models this size as well as variants of Llama or Qwen that I could run) and I assume probably with other EU languages as well.
I've just tried it in one of the supported languages, and it seems to respond far better than any model under 24B that I've tried before. With its licensing, it sounds much more exciting to me than the OP.
More diversity in the LLM space is always good. In my experience though, speaking as a native speaker of one of the less-used European languages, Mistral's models already use it pretty well.
I live in a country with 3 national languages and I happen to use all of them + English + another one where most of our clients are based. Mistral is the only model atm which doesn’t make a mess of it all. It’s not perfect, but it doesn’t force me to “pretranslate” things.
As a native of another small European language, no state of the art model comes anywhere close to not being laughably bad, so more work in this space is definitely welcomed as far as I'm concerned.
Really? In my experience, Le Chat eventually devolves into spanglish when trying to speak Spanish, so I would have expected worse from Mistral for minority languages.
yes, that was my point, thank you. I did not know it was focused on the scandinavian countries until recently, I guess it makes sense now why it has the languages it does.
Meltemi is ok, but it's "old" and not that good by today's standards.
If you need a good Greek local LLM try https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct.
Yes, I know it's based on LLama and not a foundation model, but it is still a LOT better than Meltemi.
I mean, Mistral AI is a Paris-based company, and theirs was considered on par or better than other open weight models such as llama3.1 and qwen2.5, and mistral-24b is currently beating oh-so-great gemma3-27b depending on tasks.
Also, Stable Diffusion was originally (and still is I believe) developed in Munich.
It's true though that raising capital and finding investors works wayyy better in the US (kindof needless to say on HN) and so was getting top talent - at least in the past. Don't get me started on energy prices ;) but I don't believe those contribute significantly in the end anyway.
You don't think American companies raising hundreds of millions to ten billion for training models contributed to their model performance or market positions?
I think a pile of money and talent is largely the cause of where they're at.
But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?
I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU
Why would they want more languages from outside of the EU when they've clearly stated they only target the 24 official languages of the European Union?
For example: Slovene language. You simply don't have enough data on it. But if you add all the data that is available on related languages, you will get a higher quality. LLM fails with this property for low-resource languages.
I'm not sure I'm convinced. I speak a small European language and the general experience is that LLMs are often wrong exactly because they think they can just borrow from a related language. The result is even worse and often makes no sense whatsoever. In other words, as far as translations go, confidently incorrect is not useful.
Interestingly, we then collectively decided that, in many cases, imperfect artisanal things were better than perfect industrially produced things. So maybe people will start intentionally putting mistakes into their texts to prove they're not machines.
I'm already reluctant to use the em-dash correctly because so many people think only LLMs know how to use it.
They compared with Llama 3.1 and found that to be better on average for their tasks like European MMLU. And Llama 3.1 is the worst in the batch with Qwen 2.5 and Gemma 3 being significantly better.
I had a related, but orthogonal question about multilingual LLMs.
When I ask smaller models a question in English, the model does well. When I ask the same model a question in Turkish, the answer is mediocre. When I ask the model to translate my question into English, get the answer, and translate the answer back to Turkish, the model again does well.
For example, I tried the above with Llama 3.3 70B, and asked it to plan me a 3-day trip to Istanbul. When I asked Llama to do the translations between English <> Turkish, the answer was notably better.
Anyone else observed a similar behavior?
Someone apparently did observe ChatGPT (I think it was ChatGPT) switch to Chinese for some parts of it's reasoning/calculations and then back to English for the final answer. That's somehow even weirder than the LLM giving different answers depending on the input.
Reminds me of this funny video: https://www.youtube.com/watch?v=NY3yWXWjYjA ("You know something has gone wrong when he switches to Chinese")
I've seen this happen as well with o3-mini, but I'm honestly not sure what triggered it. I use it all the time but have only had it switch to Chinese during reasoning maybe twice.
I've seen Grok sprinkle random Chinese characters into responses I asked for in ancient Greek and Latin.
I get strange languages sprinkled through my Gemini responses, including some very obscure ones. It just randomly changes language for one or two words.
Is it possible the "vector" is more accurate in another language? Like espirit d'esclair or schadenfreude, or any number of other things that are a single word in a language but paragraphs or more in others?
Possibly. I have seen Claude switching to Russian for a word or two when it is about revolution!
Isn't it just it getting increasingly incoherent as non-English data fraction increases?
Last I checked, none of open weight LLMs has languages other than English as its sole dominant language represented in the dataset.
I saw Claude 3.7 write a comment in my code in Russian followed by, likely from a previous modification, the English text “Russian coding” for no reason.
> the LLM giving different answers depending on the input.
LLMs are actually designed to have some randomness in their responses.
To make the answer reproducible, set the temperature to O (eliminating randomness) and provide a static seed (ensuring consistent results) in the LLM's configuration.
The influence of the (pseudo-)random number generator is called "temperature" in most models.
Setting it to 0 in theory eliminates all randomness, and instead of choosing one from a list of next words that may be predicted, always only the MOST PROBABLY word would be chosen.
However, in practice, setting the temperature to 0 in most GUIs does not actually set the temperature to 0, but to a "very small" value ("epsilon"), the reason being to avoid a division by zero exception/crawsh in a mathematical formula. So don't be surprised if you cannot get rid of random behavior entirely.
> the reason being to avoid a division by zero exception/crawsh in a mathematical formula
Why don't they just special-case it?
It's not necessary in most inference engines I've seen to set the temperature to 0—the randomness in the temperature is drawn from the seed, so a static seed will work for any temperature.
In had it doing the reasoning in Turkish and English despite the question being in German.
i’ve seen that with deepseek
I suspect this also happens in programming languages. Subjectively I get the feeling that LLMs prefer to write in Python or JS.
Would be interesting to see whether they actually score better in leetcode questions when using python.
See my other comment. The answer is transfer learning: leveraging massive amounts of data in one language like Python, a few bridges to another language like Ruby, and obtain a “native” result in the other language.
But in this case the LLM is not exposed to explicit translation pairs between these two languages and rather by seeing enough examples in similar contexts, LLMs transfer some of their learnings in Python to Ruby (for better or worse results)
Based on my very very limited understanding of how LLMs work, surely they don't "prefer" anything, and just use what they have been trained on?
Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
Maybe the LLM preferring English is because of a similar phenomenon - it has been trained on mostly western, English speaking internet?
There are likely some languages that are genuinely easier or more difficult for LLMs.
For example consider Pascal or C89 requiring all variables to be declared at the start of the function body. That makes it much harder to generate code in a linear fashion. In Python you can just make up a variable the moment you decide you need it. In Pascal or C89 you would have to go back and change previous code, which LLMs can't easily do.
Similar things likely apply to strict typing. Typing makes it easier to reason about existing code, but it makes it harder to write new code if you don't have the ability to go back and change your mind on a type choice.
Both could be solved if we selected tokens in a beam search, searching for the path with the highest combined token probability instead of greedily selecting one token at a time. But that's much more expensive and I'm not sure anyone still does that with large-scale LLMs.
You could ask the LLM to first work out the solution in pseudocode, then translate to Pascal (or whatever). That way the variables are known after the initial pseudocode pass.
Human programmers also did this more frequently in those days than probably is the case now.
> Presumably there is a lot more public info about, and code in Javascript and Python, hence this "preference"
This likely plays a major - probably dominant - role.
It's interesting to think of other factors too though. The relatively concise syntax of those languages might make them easier for LLMs to work with. If resources are in any way token limited then reading and writing Spring Boot apps is going to be burdensome.
Those languages also have a lot of single file applications, which might make them easier for LLMs to learn. So much of iOS development for example is split across many files and I wonder if that affects the quality of the training data.
Also worth considering: there's a wider range of "acceptable" output programs when dealing with such forgiving scripting languages. If asked to output C then there are loads of finicky bits it could mess up, pointer accesses, writing past the end of an array, using uninitialized memory, using a value it already freed, missing a free, etc. All things that the language runtime handles in Python or JS. There's a higher cognitive load it needs to take on.
Indeed. I've thought from the beginning that LLMs should focus specifically on ONE language for this exact reason (i.e. mediocre/bad duplication of data in multiple languages). All other languages than English essentially "syphon" off capacity/layers/weights that could otherwise have held more genuine data/knowledge. Other languages should not come into the picture afaics - dedicated translation LLMs/existing-solutions can handle this aspect just fine and there's just no salient reason to fold partial-multi-language-capacity in through fuzzy/unorganised training.
Given the fact that LLMs like most neural networks work by passing their input through layers, wouldn't this be expected? There's no going back to an earlier layer and if the first layers are in some sense needed for "translating" [0] to English, any other functionality in those layers cannot be used.
[0] I am simplifying here, but it would make sense for an LLM to learn this, even though the intermediate representation is not exactly English, given the fact that much of the internet in English and the empirical fact that they are good at translating.
For most low-resource languages, support in LLMs is trained through translation pairs between english and the other languages, because translation data is easier to come across than say, conversations about coding, history, physics, basically the kind of data that is usually used for instruct training.
This kind of training data typically looks like ChatGPT style conversations where all the prompts are all templated like “Translate the following text from X to Y: [text]” and the LLM’s expected answer is the translated text.
LLMs can generalize through transfer learning (to a certain extent) from these translation pairs to some understanding (strong) and even answering (weak) in the target language. It also means that the LLM’s actual sweet spot is in translation itself since that’s what was trained in, not just a generalization.
I'd mentally put this in the same box as "chain of thought", where models perform better when explicitly describing the reasoning steps. The only difference in your case being that the model is undertrained in non-English data, so it's "next token prediction" of non-English prompts is less robust, and thus explicitly converting to English and then back makes it better.
This is probably the case for the "deep reasoning" models as well. If you for example try DeepSeek R1, it will likely reason in either English or Chinese (where it presumably is well trained) even if the prompt is in other languages.
I sometimes dream that they would internally reason in Ithkuil and gain amazing precision.
ChatGPT is very informal and talks like a millennial when I ask questions in French. I hate it.
In ChatGPT settings, you can set your preferences, e.g. choose between tu/vous, and ask it to be more formal.
This should fix your issue, right?
Is there a phenomenon where middle-aged people are very informal or slang-y in France? Usually the kids are the ones creating new lingo in English.
Out of curiosity, does vous/tu change its behaviour?
Don't speak French, but interesting that it's not quite felt like an insufferable American tourist not in the group chat, in your language. LLMs all belong in that spectrum in my primary language.
sorry u hate a whole generation
That's not a "generation", that is a "portrait" (a characterization).
The most french response of all... "Euh, en fait.."
Some studies are trying to ensure that the model reasons through abstractions instead of linguistic representations. (Of course the phenomenon of reasoning in substantially different quality depending on input language signals a fault - reasoning is beyond "spoken" language.)
In the past hours a related, seemingly important article appeared - see https://www.quantamagazine.org/to-make-language-models-work-...
This important paper from Anthropic includes evidence that part (but only part) of reasoning is cross-lingual:
https://www.anthropic.com/research/tracing-thoughts-language...
Fascinating phenomenon. It's like a new Sapir–Whorf hypothesis. Do language models act differently in different languages due to those languages or the training materials?
This is one of those subtle clues that the LLM does not actually 'know' anything. It is providing you the best consensus answer to your prompt using the data upon which the weights rest, is that data was input primarily as english then you are going to get better results asking in english. It is still Searle's Chinese Room except you need to first go to the 'Language X -> English' room and then deliver its output to the general query room before delivering the next result to the 'English -> Language X' room.
Anthropic’s research did find that Claude seemed to have an inner language agnostic ”language” though. And that the larger a LLM got, the more it could realize the innate meaning of words between language barriers as well as expand upon its internal non-specific language representation.
So, part of its improved performance as they grow in parameter count is probably not only due to expanded raw material that it is trained upon, but a greater ability to ultimately ”realize” and connect apparent meanings of words, so that a German speaker might benefit more and more from training material in Korean.
> These results show that features at the beginning and end of models are highly language-specific (consistent with the {de, re}-tokenization hypothesis [31] ), while features in the middle are more language-agnostic. Moreover, we observe that compared to the smaller model, Claude 3.5 Haiku exhibits a higher degree of generalization, and displays an especially notable generalization improvement for language pairs that do not share an alphabet (English-Chinese, French-Chinese).
Source: https://transformer-circuits.pub/2025/attribution-graphs/bio...
However, they do see that Claude 3.5 Haiku seemed to have an English ”default” with more direct connections. It’s possible that a LLM needs to go a more roundabout way via generalizations to communicate in alternative languages and where this causes a dropoff in performance the smaller the model is?
The modern Standard Chinese language is almost syntactically "identical" to English, for some reason. French was direct ancestor to medieval British language that came to be the modern English.
My point is, those language pairs aren't random examples. Chinese isn't something completely foreign and new thing when it comes to difference between it and English.
Exactly. I found it surprising how soon it was implied "Imagine you're the smartest and most creative person in the world, ..." would somehow result in the most creative output.
It's clear from the start that language modelling is not yet there. It can't reason about low level structure (letters, syllables, rhyme, rhythm), it can't map all languages to a singular clear representation. Representation is mushy distributed mess out of which you get good or bad results.
It's brilliant how relevant the responses are and when they're correct, but the underlying process is driven by very weird internal representations.
It would be great if we could get to a point where we can use a language encoder and decoder, with a language agnostic knowledge model in between. But since it's generally more efficient to train the whole model end to end, such modularity would probably come at a performance price, and I don't see any private (or "non profit") companies take that approach anytime soon.
My supervising professor for the PhD program I left did a paper on the Chinese Room and argued that to a large degree understanding of the task was the ability to compress it many orders of magnitude. In that sense the LLMs are succeeding because despite their supposively massive parameter sets they are absolutely tiny compared to the Chinese Room version.
Searle's "Chinese Room" was as wrong then as it is now
Similar or better than the performance of most so called humans so I guess we're all a collection of Chinese room switchboxes.
Both, but primarily due to the lack of training materials. 10 or so million native speakers of my language will never be able to generate the same amount of training material as over a billion English speakers do.
There is a steep drop in quality in any non-English language, but in general less native speakers = worse results. They tend to have a certain "voice" which is extremely easy to spot and the accuracy of results goes out the window (way worse than in English).
Right, but it’s interesting that means its reasoning abilities potentially drop off when it’s talking Thai, or its knowledge of WW2 history in the Eastern Theatre might drop off when speaking French, where the same model has no trouble with the same questions in English. My French and Thai are both rudimentary, but I’m working from the same set of facts and reasoning ability in both languages. Will it give different answers on what the greatest empire that ever existed was if you ask it in Mandarin vs Italian vs Mongolian?
They absolutely do. They know more in English than in Spanish, I've seen that on all models, since the beginning.
They have more data in English than Spanish. LLMs don't know or reason or follow instructions. They merely render text continuations that are coherent with the expectations you set when prompting. The fact that they are not able to sustain the illusion in languages with less available training data than English should make that clear.
I have observed this and this is what I would expect to have happened thinking from first principles.
I wonder how this compares to RWKV-V5 7B
https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers
Maybe someone should edit the title to mention this is from 2024: [Submitted on 30 Sep 2024 (v1), last revised 15 Oct 2024 (this version, v2)]
I also quite liked the EuroLLM project: https://huggingface.co/blog/eurollm-team/eurollm-9b
Was pretty good with Latvian (better than other models this size as well as variants of Llama or Qwen that I could run) and I assume probably with other EU languages as well.
I've just tried it in one of the supported languages, and it seems to respond far better than any model under 24B that I've tried before. With its licensing, it sounds much more exciting to me than the OP.
More diversity in the LLM space is always good. In my experience though, speaking as a native speaker of one of the less-used European languages, Mistral's models already use it pretty well.
I live in a country with 3 national languages and I happen to use all of them + English + another one where most of our clients are based. Mistral is the only model atm which doesn’t make a mess of it all. It’s not perfect, but it doesn’t force me to “pretranslate” things.
As a native of another small European language, no state of the art model comes anywhere close to not being laughably bad, so more work in this space is definitely welcomed as far as I'm concerned.
Really? In my experience, Le Chat eventually devolves into spanglish when trying to speak Spanish, so I would have expected worse from Mistral for minority languages.
On this topic, don’t miss the quite useful benchmark:
https://euroeval.com
ah, yes... Europe, the continent with 10 countries
one of them with 50k population
Could you elaborate on what you wish to convey with this comment?
I guess its a sarcastic statement about EuroEval covering a fraction of the European languages, yet containing Faroese.
It was called ScandEval until recently.
yes, that was my point, thank you. I did not know it was focused on the scandinavian countries until recently, I guess it makes sense now why it has the languages it does.
There is also a Greek LLM from 2024.
Meltemi: A large foundation Language Model for the Greek language
https://huggingface.co/ilsp/Meltemi-7B-v1.5
Meltemi is ok, but it's "old" and not that good by today's standards. If you need a good Greek local LLM try https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct. Yes, I know it's based on LLama and not a foundation model, but it is still a LOT better than Meltemi.
I mean, Mistral AI is a Paris-based company, and theirs was considered on par or better than other open weight models such as llama3.1 and qwen2.5, and mistral-24b is currently beating oh-so-great gemma3-27b depending on tasks.
Also, Stable Diffusion was originally (and still is I believe) developed in Munich.
It's true though that raising capital and finding investors works wayyy better in the US (kindof needless to say on HN) and so was getting top talent - at least in the past. Don't get me started on energy prices ;) but I don't believe those contribute significantly in the end anyway.
You don't think American companies raising hundreds of millions to ten billion for training models contributed to their model performance or market positions?
I think a pile of money and talent is largely the cause of where they're at.
Can someone explain this? They just reduce the English text during pretraining to balance it out? Shouldn't that harm every other benchmark though?
>European versions of ARC
But this is an image-like benchmark. Has anyone looked at the article about the EU-ARC, what is the difference? Why can't you measure it on a regular one?
I glanced through it, didn't find it right away, but judging by their tokenizer, they are learning from scratch. In general, I don't like this approach for the task at hand. For large languages, there are already good models that they don't want to compare with. And for low-resource languages, it is very important to take more languages from this language group, which are not necessarily part of the EU
You might be confusing ARC-AGI and EU-ARC which is a language benchmark [1]
[1] https://arxiv.org/pdf/2410.08928
Why would they want more languages from outside of the EU when they've clearly stated they only target the 24 official languages of the European Union?
For example: Slovene language. You simply don't have enough data on it. But if you add all the data that is available on related languages, you will get a higher quality. LLM fails with this property for low-resource languages.
I'm not sure I'm convinced. I speak a small European language and the general experience is that LLMs are often wrong exactly because they think they can just borrow from a related language. The result is even worse and often makes no sense whatsoever. In other words, as far as translations go, confidently incorrect is not useful.
They train on 14 billion tokens in Slovene. Are you sure that's not enough?
Unfortunately, yes.
We need more tokens, more variety of topics in texts and more complexity.
We need one-shot learning.
(That amount is equivalent to 50000 books, which few nationals will have read.)
Upset that my mind went, "TEKKEN 7 LLM." Imagine Heihachi Mishima vibe-coding for you.
TIL there are european versions of ARC, HellaSwag, MMLU, and TruthfulQA.
A paper on languages that begins with a grammatical error in the first sentence does not inspire confidence:
> LLMs represents a disruptive technology
Hey, at least it's not generated by chatgpt :D
Funny how LLMs now write cleaner than humans in most cases.
I imagine there was a similar tipping point in the Industrial Revolution where machines started marking "better" manufactured items than artisans.
Interestingly, we then collectively decided that, in many cases, imperfect artisanal things were better than perfect industrially produced things. So maybe people will start intentionally putting mistakes into their texts to prove they're not machines.
I'm already reluctant to use the em-dash correctly because so many people think only LLMs know how to use it.
It's not that I think only LLMs know em-dashes but they abuse it so much I get annoyed everytime I see one
LLMs seem to use them much more than normal people.
If I were a teacher marking homework, em-dashes would be at least an amber flag for LLM use.
Given that it’s about non-English languages it is forgivable
They compared with Llama 3.1 and found that to be better on average for their tasks like European MMLU. And Llama 3.1 is the worst in the batch with Qwen 2.5 and Gemma 3 being significantly better.