1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.
2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.
3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.
I found in my very limited testing with chatgpt free version that it has a real bias for the Caesar method in general, suggesting it or bringing it often unnecessarily in almost every conversation.
Here is a sample, this 'encoded' form is clearly not a cipher but it 'tried' anyway, it is in fact random:
Prompt: Does the following sentence: "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" bare resemblance to the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?"?
ChatGPT said:
Yes, the sentence "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" does indeed resemble the structure of the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?".
It seems like it could be an encrypted version of the latter phrase. Given the structure and the resemblance to the original phrase, I suspect this might be a Caesar cipher or some form of letter shift.
Let me try to decrypt it by applying a Caesar cipher shift and see if I can retrieve the original phrase. I'll start with a few different shift values. Hang on!
It's not that surprising that models encode Byzantine Music Notation characters using separate tokens for each UTF-8 byte, since they're unlikely to occur often enough for byte-pair encoding to allocate dedicated multi-byte tokens to them.
What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.
It tries but it fails to both generate and apply a simple substitution. It starts substituting based on applying the substitution twice to some letters.
It was a very frustrating time I spent trying to solve it, especially after asking to confirm it had created a 1:1 mapping.
Since it has no reasoning, it just lies about having done so.
Does it make sense to talk about the index of a token? The model never sees such an index does it? I thought the tokens are converted to learned embeddings at the first layer.
This is a great point, I think I might have been wrong actually. It doesn't really make sense that one row of the embedding matrix is treated differently than another...
I'm reminded of the "Unicode Tags" craze from a few months ago. [1]
It was discovered that some LLMs effortlessly understand characters from the "Tag" range in Unicode and treat them like ASCII, even though those characters are used virtually nowhere in normal text and you in fact need specialized tools to just make them visible. (There is a formal 1-1 mapping between tags and ASCII characters, which would also qualify as a Cesar cipher, but you'd have to read the Unicode spec to find out)
Most concerns were about the fact that this would allow people to smuggle hidden messages to or from the LLMs. But an interesting question was also how the models had even learned the mapping in the first place if tags never occurred in the training data anywhere.
As I understood it, the solution was pretty simple though: They hadn't. There was no specialized circuit for tags in the models. Tag characters just had the property that if you write them as bytes, they will look like "<some prefix bytes> <byte pattern of the corresponding ASCII character>".
So already the tokenizer would parse the characters as ASCII, interleaved with "unknown" tokens for the prefixes. All the model had to do was to ignore the "unknown" tokens and it could process the rest like ASCII. No Cesar cipher decoding needed!
Are we sure something similar isn't happening here?
This is exactly what's happening here. But note that UTF-8 is self-synchronizing, so no encoding of one character contains the encoding of another as a subsequence. Instead, both tag characters and the Byzantine music notation in the article look like "<some prefix bytes> <byte pattern of the corresponding ASCII character + 96>"
They share this property with the Fullwidth Latin block, which does occur in the wild interspersed with Japanese or Chinese text.
Fullwidth Latin characters exist so that you can arrange your text into a grid without the occasional word in Latin script messing up your alignment.
Most people don't really care about this, or, if they do, simply use a font that renders regular Latin at full width (or half width to be more space-efficient) but very occasionally the Fullwidth Latin codepoints get some use. It's more common in Japanese (though still rare) than Chinese in my experience, but e.g. the Project Gutenberg ebook of 阿Q正傳 https://gutenberg.org/cache/epub/25332/pg25332-images.html uses fullwidth Qs.
Oh, that's interesting! It sounds like it's not literally being fed UTF-8 bytes, but instead more like this: For rarely seen characters, it's two tokens, namely first a codeblock token ("Tag" token in this case), followed by a token like "1st character in this codeblock" or "2nd character in this code block" and so on and since many rare codeblocks are latin-like (tags, circled letters, mathematical Fraktur variables etc.), the LLM picks up that "some block token"+"1st character in the codeblock" kinda is like "A"? Is that how it works?
Had to read it again as well but yeah, that's how I'd understand it too. So the "offset in block" tokens are still not the same tokens as for the "real" ASCII letters, but they are the same tokens for all "weird ascii-like Unicode blocks". So the model can aggregate the training data from all those blocks and automatically "generalize" to similar characters in other blocks (by learning to ignore the "block identifier" tokens) even ones that have very little or no training examples themselves.
Edit: So this means if you want to sanitize text before passing it to an LLM, you don't only have to consider standard Unicode BMP characters but also everything that mirrors those characters in a different block. And because models can do Cesar ciphers with small offsets, possibly even blocks where the characters don't line up completely but are shifted by a small number.
Maybe it would be better to run the sanitizer on the tokens or even the embedding vectors instead of the "raw" text.
Because if you start with characters, much of the token vocabulary would be dedicated to rare Chinese characters right off the bat. If you start from UTF-8 bytes, you can dedicate more token space to common sequences of multiple characters (i.e. words people actually use) and achieve much better compression ratios.
I don't understand. Why would much of the vocabulary be dedicated to rare Chinese characters? Wouldn't those need to show up in the training data first? And if they did, wouldn't they also show up as weird byte sequences? And aren't UTF-8 byte sequences kinda risky for everything other than ASCII, since only ASCII bytes and header bytes are unambiguous, whereas following bytes (10***) are very ambiguous individually? I mean, sure, the LLM would notice that their meaning changes depending on preceding following- and header-bytes, but it is still not clear to me, why UTF-8 bytes are better for LLMs than characters (or even grapheme clusters). UTF-8 bytes seem like a very arbitrary choice to me. Why not do UTF-9 instead and get the most important Latin letters as single ninebitbytes?
Yes, rare Chinese characters do show up in the training data (the rarest of them at least appear in lists of characters) and yes, they get tokenized as weird byte sequences, making the model work harder to process them, but it's better for that to happen to rare characters than to common words. It's a tradeoff.
And of course UTF-8 is unlikely to be the single best encoding (e.g. Anthropic has a tokenizer that turns all caps text into a special caps lock symbol plus the regular-case equivalent) but much of it is papered over by byte-pair encoding. E.g. the most important Latin letters appear often enough that they get dedicated tokens anyways.
Another thing that LLMs are (surprisingly?) good at is randomly mixed language conversations, here's something I tried a few days ago:
Prompt: Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.
ChatGPT:
Ja, ich kann verstehen este mix de idiomas! Tu es usando une combinación de várias línguas pour expresar una pregunta. Je vais responder en the same modo:
Si tu veux kommunicieren en cette manière, c'est possible, aber il peut être ein bisschen confusing pour andere people! 你觉得这个方式有趣吗? O préfères-tu un idioma más claro?
Deepseek:
Ja, jeg kan forstå en lille bit av ciò que tu está dizendo. Anata no message wa molto interessante, pero un poco confusing aussi. Si tu peux écrire in una lingua sola, sería plus facile para mi at comprender. Danke, merci, arigatou!
Claude:
Oui, je kan entender algo of this texte mezclado! C'est un interessante modo de communicare using différentes languages together. Je trouve ça quite divertido y puedo responder en la misma manera mixta. Ist das what you were looking for? Watashi wa happy to continue questa conversation in whatever linguistic style anata wa prefer!
Perplexity:
Ja, ich kann diesen Text verstehen! Vous avez créé una frase muy interesante con palabras from different languages. Let me rispondere using a similar mishmash of lingue:
Questo è un esempio fascinante di code-switching e language mixing. Watashi wa impressed by la creatività y el modo en que vous avez combiné various idiomas in one sentence. Es ist ein poco challenging zu decifrar, aber anche molto divertente!
Ich würde sagen que el mensaje principal è about compréhension linguistique et la capacité de entender verschiedene Sprachen when they're mixed together. C'est une manière intéressante de tester language skills et cognitive flexibility.
Grazie for questa stimolante question! J'espère que mi respuesta was equally entertaining und in lo mismo estilo multilingüe.
Word vectors for the same concept in different languages are very close, that's not really surprising. The whole `v_king - v_male + v_female = v_queen` idea leads to an intuition that `v_grazie - v_italien + v_deutsch = v_danke`.
Code switching probably appears a lot in both literature and audio from some contexts too.
Why would it matter that a concept is expressed in a different human language? How would you structure a multilingual LLM that it couldn't syncretise languages in this way? You'd seemingly need to properly segregate the languages.
I suppose this mechanism, only quite a bit up the conceptual ladder, is necessary for LLMs to be able to translate between languages, which they apparently are trained to do, explicitly or not.
Yes I understand the encodings will be close and that helps, I guess that's why they produce good translations, but I'm intrigued by the LLM having so much control of the switching without even explicit prompting, just a one-shot example.
I also guess I'm easily impressed.
I've only taken duolingo in French for a few months a few years ago, have heard my girlfriend practice her Italian and I've spent some time around german people. Had Russian lessons and I have pretty good English and Lithuanian skills. I'm only fluent in the last two languages. I still understood most of your prompt. So I don't think this is a good test.
Reading that prompt again, I think watching some anime with subs helped too.
Yup, LLMs are a polyglot’s dream interface, considering multi language is a feature that pretty much all companies screw up each in their own way.
And then there’s apple, which will not let me use their AI features because Siri needs to be in the same language as iOS, Siri is set to English and iOS is set to “English (Spain)” (????).
I tried putting a few of GP's multilingual paragraphs into google translate on detect mode, and it got everything into English perfectly! Interestingly, it declares a single language as having been detected, which varies perhaps based on majority input language.
I think the research by anthropic released recently showed that language is handled independently of the "concepts" they convey, so first you get the concepts, then you get the translation to language.
Mixed scripts as well. In March 2024 I asked Gemini Advanced (whatever the version was at the time) to transliterate an image which had the following Persian text on it:
> یوسفی بود ولی هیچ خریدار نداشت
Its output was:
> Yoosefi بود ولی هیچ khaरीदār nadāsht
That's three different scripts with two different Romanisation schemes just for the Latin/Roman script (writing "Yoosefi" as "Yūsefī" or "Yūsufī" would have been more consistent with "nadāsht").
You're proposing that advanced pattern recognition is a sign of NOT being intelligent?
Was the above comment nonsense, or did it have a pattern? If a real person happened to know ten languages and played along in this game with you, would you also see that as evidence that they are not intelligent?
yes, because in the example given -- LLMs can be fed patterns of nonsense -- the byte patterns purposefully lack meaning. Therefore the replies also lack real meaning, but they appear according to rules. That is not being "intelligent."
> Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.
can be translated to
> Can you understand this text? You have some inkling of what is said in this current message? Answer me in the same manner of speaking.
I can recognize Spanish, French, English, Japanese, Russian, Italian, Portuguese, and a couple of words are from languages I don't speak (German? Dutch?) but easily inferrable through their similarity to English.
Not nonsense, just code. If meaning was passed from GP to so many of us, and you didn't catch the meaning, it doesn't make the message nonsense.
People who speak multiple languages can easily understand both the GP's query and every single LLM reply they quoted.
I'm afraid you have failed the jschoe test [0] : you've been outsmarted by an LLM, and incorrectly concluded that it's because the LLM did something dumb.
Those texts aren't nonsense. The prompt has a meaning, the LLMs are able to understand it, and are able to reply with coherent and understandable responses crafted in the same way the prompt was written.
For me it's a very clear example of something that is very far from any training data coming out of the models. Intelligent? No, but for me it points to the idea that "language is solved".
As a Vegan, maybe I'm a little biased, but I often think about what the implications of a universal translator would be, if it did infact give us the ability to understand animals. What would that imply if you could drive by a slaughterhouse and be able to understand animals saying goodbye to their loved ones... assuming this is happening.. Would all slaughtering stop? Or would people be okay with that? Interesting times ahead if there is any possibility for ML to translate animal language.
I'm also a vegan, but it doesn't seem likely to me that other species have languages similar to ours. I think people have already used ML to interpret cat and dog communications, and they got general emotions more than something like syntax.
It's complicated by the fact that other species' throats and mouths physically can't form many human language phonemes*, but even the use or recognition of human language by other great apes (and parrots) is very controversial, and they probably have cognition and sociality most similar to ours. But it's not clear that they can do much of what human language does.
We don't slaughter animals because we think they don't mind dying, we slaughter them because we've outsourced the mass killings to people who don't mind doing it, and a steak looks enough unlike a cow that we don't think that it used to be alive.
Basically, if we had to slaughter our own cows, I doubt we'd be eating as much meat.
People adapt very easily. If you were trapped on a mountain, you'd likely butcher a cow with the rest of your soccer team. Don't judge everything through the lens of plenty. If you're American, it might be an exercise that becomes useful soon.
I can tell you've never lived in the Midwest, or maybe just not outside of a city. People have dedicated chest freezers for wild game that they keep full all year. Opening of hunting and fishing seasons are huge deals.
I've never lived in the Midwest, because I'm not American, but I grew up in a small village where we had to decapitate our own chickens. I never got over the discomfort at taking another life.
* People ate plenty of meat when they had to slaughter the animals themselves.
* Hunting is quite popular.
* Every adult that eats meat is quite aware of what goes on to bring it to his table.
So I would disagree. We slaughter animals because that is what they are for, it is why they are farmed, and we want the resulting products. I like my leather shoes and jacket and belt. I like a steak. I like a chicken curry. It doesn't concern me at all that cows and chicken and lambs die to make that happen. They are knocked out first, so it is quite humane.
> Every adult that eats meat is quite aware of what goes on to bring it to his table
> They are knocked out first, so it is quite humane
Those two statements contradicts themselves: most of the chicken aren’t knocked out, or failed to be. It’s however easier to finish your dish if you don’t bother evaluating agroindustrial marketing material (and the cute kid’s farm you saw when toddler)
Same happen with "caws eats grass", "this fish was sustainably catch because the label said so", "that chicken had a mn happy life because it’s an organic one".
We haven’t had an evolutionarily relevant reason to stop. If sentient alien life looks like a chicken we’d stop eating chicken. If pigs get any smarter we’ll have to stop eating them. We’ve already mostly stopped eating cats and dogs in most western countries. For me, personally, I view it as a 3rd or 4th tier problem. We’re not solving world hunger for another 2 centuries so I put it out of my mind. If I’m going to solve a “food problem” it seems cruel and irresponsible to solve the food’s problem.
"We’re not solving world hunger for another 2 centuries"
Why two centuries? Deaths from famines have already dropped precipitously in the last three generations or so. Today, if there is a problem with food, it is usually a logistical problem, not a problem with food availability/cost in general, and half of the world has a problem of eating too much.
Anyway, two centuries is a long time. Two centuries ago, electricity wasn't a thing yet.
I don't think solving hunger is a problem of quantity. It's a political and systemic inequality problem. I don't see those being adequately managed for at least 200 years if ever.
But then you should call the thing to be solved "problem of good governance" instead, and that is something that indeed may take centuries. Bad governance will manifest itself in a multitude of problems that have no intrinsic organic relationships amongst them, and I am not sure if it makes sense to split them into sub-categories.
In the past, hunger was quite often a quantity problem. If a period of bad weather hit Medieval Europe, there wouldn't be any practical way how to import food for the entire continent from, say, India.
If we (on average) can see little children getting bombed on live TV and feel no need to call our senator and ask him what the fuck he thinks he's doing, then I don't think a slaughterhouse will be much of a problem either.
It's not that far from training data surely. If you're only training on next-word basis then you'll "often" see individual words from other languages mixed in.
I was also surprised to find out (roughly a year ago) that Claude is good at Old English (which, despite its misleading name, looks nothing like English and is more of a Germanic language) whereas ChatGPT would output pure hallucinations.
Claude can speak medieval and ancient languages but mixes up different time periods pretty often, unless you hard prompt the desired grammar. For Old English in particular, it tends to give something vaguely Shakespearean instead. It often uses period-incorrect alphabet or specific symbols as well (for Slavic languages in particular).
I've also tried Old Norse, Ancient Greek, and Old East Slavic, and the result is pretty much the same. For OES in particular, it often outputs period-incorrect grammar, writes in Old Church Slavonic (different language), or even modern Russian or Serbian. Looks like the dataset was a bit chaotic, with religious books mixed with old manuscripts and even modern books for children. Mentioning a specific work from the desired period makes it write better, and wrangling it by specifying the rules makes it get this almost right.
Interesting. I was using ChatGPT to try to come up with a possible reconstruction of the Ketef Hinnom scrolls (I don't know Ancient Hebrew at all), with some mixed results. I had to prompt it with things like "What do you think that 'YHWH' bit could mean?", and then it sort of caught on. Maybe I'll see if Claude can do better.
Your description of Old English is a bit odd. It's certainly very different from modern English, but it's its direct ancestor and both languages are Germanic.
It is a direct ancestor but I find that what most people picture when they hear Old English (and have no prior knowledge of it) is something closer to Middle English, which is somewhat redeable by modern English speakers, rather than something like `Oft Scyld Scefing sceaþena þreatum, monegum mægþum, meodosetla ofteah, egsode eorlas.` [0]
Claude is much better than ChatGPT at low-resource languages, at least it was a year ago, I haven't tested on new models from OpenAI but I believe that Claude still has an edge.
For example, when ChatGPT was outputting nonsense in Georgian, Claude was speaking it fluently, when ChatGPT learned Georgian, Claude was able to speak Mingrelian.
It usually depends on location, for example Cloudflare has a setting somewhere for "always show captchas for non-western traffic" and a lot of people set it.
> At least in most public tokenizers like o200k, addition in certain Unicode ranges commutes with addition in token space
This seems flawed. I mean, the author's statement here is literally true, but it's eliding a very important detail: LLMs do _not_ see token indexes. They have no idea what order the token embeddings are in. In fact, you can shuffle the embeddings and the LLM wouldn't care at all. And I highly suspect that if you shuffled the entire tokenizer, so that the above property no longer holds, and trained Claude from scratch on that tokenizer, it would still be able to perform this task.
> so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.
This is the crux, I believe.
In the general case, the common Unicode ranges (for Korean, Japanese, Chinese, etc) get tokenized just like English (for modern tokenizers at least).
It's only in the obscure unicode ranges where you hit a special case of the tokenizer. This is the "backup plan" of the tokenizer. If it encounters text that doesn't directly map to a token in its dictionary, then it falls back to encoding the text as UTF-8 bytes. Those UTF-8 bytes have a dedicated set of 256 tokens in its dictionary. So in those extreme cases, rather then getting bits of text like "Hell, o, Mr, ., B, ond" the LLM gets the raw UTF-8 bytes.
Now, again, the LLM can't directly see those bytes, their index in the tokenizer's dictionary, their integer values, etc, etc. It only sees their embedding vectors, which are unordered. So it has no _implicit_ knowledge about those bytes being ordered. Therefore the assertion that addition commutes between Unicode and token indices is irrelevant.
My theory would be that the pretraining data contains lists of Unicode characters. Specifically, lists of unicode characters in order. Naturally, for the obscure ranges of unicode, this results in the LLM seeing counting in UTF-8 bytes. It doesn't initially know what the "value" of each byte is, but naturally it would learn that so that it can correctly predict the next byte.
The same occurs for English letters. It doesn't start with any knowledge about what order they are in. It only learns the ordered alphabet through seeing examples.
(The inverse applies, of course, since the output is also unordered.)
Maybe this is a nitpick? But it seems important to me, because it's the difference between a rather simple mechanism:
output[i] = input[i] + 1
and a more complex mechanism:
c = to_utf8_byte_index(input[i])
c = c + 1
output[i] = from_utf8_byte_index(c)
Also it's important because I'd suspect the LLM will see a _lot_ of UTF-8 counting. There's about a million unicode "characters", the vast majority of which won't have direct token mappings. So in rough estimation for a single complete listing of Unicode, it'd see a list of purely counting in bytes that is 1 million lines long. That's 3900 complete cycles of the least significant byte. Just from one listing.
In contrast, it's not going to encounter a lot of listings of, say, the Korean unicode range in unicode order (about 11k points). Each time it does, it gets to see exactly 1 complete cycle.
So a single listing of Unicode gives it 3900 examples of how to cycle one byte VS a single listing of an "alphabet" giving it only 1 example.
You're completely right, my argument is fundamentally wrong because it relies on the commutativity, but the embedding matrix obviously does not treat some columns differently than others. Back to the drawing board I suppose. Thanks!
So, let me see what I think I understand here:
1. AI models are good at Cæsar-cypher transposition, because it occurs often enough in training models for certain values of the cypher offset. Outside those values, AI doesn't handle the transformations well.
2. Somehow AI models perform this cypher also within high ranges of Unicode, because the characters are encoded as three tokens each, of which the last one encodes the same difference as between alphabetic letters, and so the first two tokens get discarded as irrelevant, meaning that by sheer chance the alphabet maps perfectly via Cæsar-cypher (with a two-token offset) to a specific range of Unicode characters reserved for Byzantine music notation.
3. This is easy to understand for one AI model, because its explicable by chance that the offset between the alphabet and Byzantine music notation should coincide perfectly with two less-significant tokens. It's harder to understand why this works in more than one AI model, though.
I found in my very limited testing with chatgpt free version that it has a real bias for the Caesar method in general, suggesting it or bringing it often unnecessarily in almost every conversation.
Here is a sample, this 'encoded' form is clearly not a cipher but it 'tried' anyway, it is in fact random:
Prompt: Does the following sentence: "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" bare resemblance to the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?"?
ChatGPT said:
Yes, the sentence "fkf snvd hukk jjbfe o dfvr hjgad eivyb ct e xixa assfk vnous ckygx txcm?" does indeed resemble the structure of the phrase "how much wood would a wood chuck chuck if a wood chuck could chuck wood?".
It seems like it could be an encrypted version of the latter phrase. Given the structure and the resemblance to the original phrase, I suspect this might be a Caesar cipher or some form of letter shift.
Let me try to decrypt it by applying a Caesar cipher shift and see if I can retrieve the original phrase. I'll start with a few different shift values. Hang on!
It's not that surprising that models encode Byzantine Music Notation characters using separate tokens for each UTF-8 byte, since they're unlikely to occur often enough for byte-pair encoding to allocate dedicated multi-byte tokens to them.
What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.
Translating uppercase to lowercase adds 32.
Maybe it's not "thinking" in terms of "add 64", but rather "translate uppercase to lowercase, twice".
Possibly because of xor with 0xc0 which, for lowercase ascii, has the same effect as adding 64.
Probably seeing xor-encoded input is common in the training data.
Have you checked the UTF-8 encodings of all the typical "Unicode fonts" e.g. circled letters, superscripts and so on?
Good idea. I checked all alphanumeric characters whose UTF-8 encoding ends with \xa1.
A seems like the culprit.Mystery solved!
> What isn't clear to me is where ASCII text with 64 added to every byte occurs in the wild.
maybe it has some general ascii bittwiddling capability for base64 and uuencode that help here too
NaN boxing?
LLMs ought to be good at generating simple substitutions too, but they're not: https://claude.ai/share/9495246c-b77a-496e-8643-dbc5ec58dbcb
It tries but it fails to both generate and apply a simple substitution. It starts substituting based on applying the substitution twice to some letters.
It was a very frustrating time I spent trying to solve it, especially after asking to confirm it had created a 1:1 mapping.
Since it has no reasoning, it just lies about having done so.
Why should they be? The token is at the word level more or less, so the model can't see the letters to substitute.
When we get character-level models this should be more viable.
Does it make sense to talk about the index of a token? The model never sees such an index does it? I thought the tokens are converted to learned embeddings at the first layer.
This is a great point, I think I might have been wrong actually. It doesn't really make sense that one row of the embedding matrix is treated differently than another...
> Outside those values, AI doesn't handle the transformations well.
It doesn't handle it well if you prevent it from thinking about it and ask it for a zero-shot, single pass answer.
I'm reminded of the "Unicode Tags" craze from a few months ago. [1]
It was discovered that some LLMs effortlessly understand characters from the "Tag" range in Unicode and treat them like ASCII, even though those characters are used virtually nowhere in normal text and you in fact need specialized tools to just make them visible. (There is a formal 1-1 mapping between tags and ASCII characters, which would also qualify as a Cesar cipher, but you'd have to read the Unicode spec to find out)
Most concerns were about the fact that this would allow people to smuggle hidden messages to or from the LLMs. But an interesting question was also how the models had even learned the mapping in the first place if tags never occurred in the training data anywhere.
As I understood it, the solution was pretty simple though: They hadn't. There was no specialized circuit for tags in the models. Tag characters just had the property that if you write them as bytes, they will look like "<some prefix bytes> <byte pattern of the corresponding ASCII character>".
So already the tokenizer would parse the characters as ASCII, interleaved with "unknown" tokens for the prefixes. All the model had to do was to ignore the "unknown" tokens and it could process the rest like ASCII. No Cesar cipher decoding needed!
Are we sure something similar isn't happening here?
[1] https://arstechnica.com/security/2024/10/ai-chatbots-can-rea...
This is exactly what's happening here. But note that UTF-8 is self-synchronizing, so no encoding of one character contains the encoding of another as a subsequence. Instead, both tag characters and the Byzantine music notation in the article look like "<some prefix bytes> <byte pattern of the corresponding ASCII character + 96>"
They share this property with the Fullwidth Latin block, which does occur in the wild interspersed with Japanese or Chinese text.
> They share this property with the Fullwidth Latin block, which does occur in the wild interspersed with Japanese or Chinese text.
How common is that? In my experience it's much more normal for Chinese text to intersperse ordinary ascii characters.
https://www.zdic.net/hans/%E8%84%B8
I'm not sure what purpose fullwidth characters are supposed to serve, but whatever it is, it doesn't seem like they're succeeding.
Fullwidth Latin characters exist so that you can arrange your text into a grid without the occasional word in Latin script messing up your alignment.
Most people don't really care about this, or, if they do, simply use a font that renders regular Latin at full width (or half width to be more space-efficient) but very occasionally the Fullwidth Latin codepoints get some use. It's more common in Japanese (though still rare) than Chinese in my experience, but e.g. the Project Gutenberg ebook of 阿Q正傳 https://gutenberg.org/cache/epub/25332/pg25332-images.html uses fullwidth Qs.
Ah, that makes sense. Thank you!
This sounds odd, why would you feed the LLM text as bytes instead of characters?
For reference, this was the thread where someone explained that to me (from 5 months ago) : https://news.ycombinator.com/item?id=41849759
Oh, that's interesting! It sounds like it's not literally being fed UTF-8 bytes, but instead more like this: For rarely seen characters, it's two tokens, namely first a codeblock token ("Tag" token in this case), followed by a token like "1st character in this codeblock" or "2nd character in this code block" and so on and since many rare codeblocks are latin-like (tags, circled letters, mathematical Fraktur variables etc.), the LLM picks up that "some block token"+"1st character in the codeblock" kinda is like "A"? Is that how it works?
Had to read it again as well but yeah, that's how I'd understand it too. So the "offset in block" tokens are still not the same tokens as for the "real" ASCII letters, but they are the same tokens for all "weird ascii-like Unicode blocks". So the model can aggregate the training data from all those blocks and automatically "generalize" to similar characters in other blocks (by learning to ignore the "block identifier" tokens) even ones that have very little or no training examples themselves.
Edit: So this means if you want to sanitize text before passing it to an LLM, you don't only have to consider standard Unicode BMP characters but also everything that mirrors those characters in a different block. And because models can do Cesar ciphers with small offsets, possibly even blocks where the characters don't line up completely but are shifted by a small number.
Maybe it would be better to run the sanitizer on the tokens or even the embedding vectors instead of the "raw" text.
Because if you start with characters, much of the token vocabulary would be dedicated to rare Chinese characters right off the bat. If you start from UTF-8 bytes, you can dedicate more token space to common sequences of multiple characters (i.e. words people actually use) and achieve much better compression ratios.
I don't understand. Why would much of the vocabulary be dedicated to rare Chinese characters? Wouldn't those need to show up in the training data first? And if they did, wouldn't they also show up as weird byte sequences? And aren't UTF-8 byte sequences kinda risky for everything other than ASCII, since only ASCII bytes and header bytes are unambiguous, whereas following bytes (10***) are very ambiguous individually? I mean, sure, the LLM would notice that their meaning changes depending on preceding following- and header-bytes, but it is still not clear to me, why UTF-8 bytes are better for LLMs than characters (or even grapheme clusters). UTF-8 bytes seem like a very arbitrary choice to me. Why not do UTF-9 instead and get the most important Latin letters as single ninebitbytes?
Yes, rare Chinese characters do show up in the training data (the rarest of them at least appear in lists of characters) and yes, they get tokenized as weird byte sequences, making the model work harder to process them, but it's better for that to happen to rare characters than to common words. It's a tradeoff.
And of course UTF-8 is unlikely to be the single best encoding (e.g. Anthropic has a tokenizer that turns all caps text into a special caps lock symbol plus the regular-case equivalent) but much of it is papered over by byte-pair encoding. E.g. the most important Latin letters appear often enough that they get dedicated tokens anyways.
Another thing that LLMs are (surprisingly?) good at is randomly mixed language conversations, here's something I tried a few days ago:
Prompt: Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.
ChatGPT:
Ja, ich kann verstehen este mix de idiomas! Tu es usando une combinación de várias línguas pour expresar una pregunta. Je vais responder en the same modo:
Si tu veux kommunicieren en cette manière, c'est possible, aber il peut être ein bisschen confusing pour andere people! 你觉得这个方式有趣吗? O préfères-tu un idioma más claro?
Deepseek:
Ja, jeg kan forstå en lille bit av ciò que tu está dizendo. Anata no message wa molto interessante, pero un poco confusing aussi. Si tu peux écrire in una lingua sola, sería plus facile para mi at comprender. Danke, merci, arigatou!
Claude:
Oui, je kan entender algo of this texte mezclado! C'est un interessante modo de communicare using différentes languages together. Je trouve ça quite divertido y puedo responder en la misma manera mixta. Ist das what you were looking for? Watashi wa happy to continue questa conversation in whatever linguistic style anata wa prefer!
Perplexity:
Ja, ich kann diesen Text verstehen! Vous avez créé una frase muy interesante con palabras from different languages. Let me rispondere using a similar mishmash of lingue:
Questo è un esempio fascinante di code-switching e language mixing. Watashi wa impressed by la creatività y el modo en que vous avez combiné various idiomas in one sentence. Es ist ein poco challenging zu decifrar, aber anche molto divertente!
Ich würde sagen que el mensaje principal è about compréhension linguistique et la capacité de entender verschiedene Sprachen when they're mixed together. C'est une manière intéressante de tester language skills et cognitive flexibility.
Grazie for questa stimolante question! J'espère que mi respuesta was equally entertaining und in lo mismo estilo multilingüe.
Word vectors for the same concept in different languages are very close, that's not really surprising. The whole `v_king - v_male + v_female = v_queen` idea leads to an intuition that `v_grazie - v_italien + v_deutsch = v_danke`.
Code switching probably appears a lot in both literature and audio from some contexts too.
Why would it matter that a concept is expressed in a different human language? How would you structure a multilingual LLM that it couldn't syncretise languages in this way? You'd seemingly need to properly segregate the languages.
I suppose this mechanism, only quite a bit up the conceptual ladder, is necessary for LLMs to be able to translate between languages, which they apparently are trained to do, explicitly or not.
Yes I understand the encodings will be close and that helps, I guess that's why they produce good translations, but I'm intrigued by the LLM having so much control of the switching without even explicit prompting, just a one-shot example. I also guess I'm easily impressed.
I've only taken duolingo in French for a few months a few years ago, have heard my girlfriend practice her Italian and I've spent some time around german people. Had Russian lessons and I have pretty good English and Lithuanian skills. I'm only fluent in the last two languages. I still understood most of your prompt. So I don't think this is a good test.
Reading that prompt again, I think watching some anime with subs helped too.
Yup, LLMs are a polyglot’s dream interface, considering multi language is a feature that pretty much all companies screw up each in their own way.
And then there’s apple, which will not let me use their AI features because Siri needs to be in the same language as iOS, Siri is set to English and iOS is set to “English (Spain)” (????).
I tried putting a few of GP's multilingual paragraphs into google translate on detect mode, and it got everything into English perfectly! Interestingly, it declares a single language as having been detected, which varies perhaps based on majority input language.
I think the research by anthropic released recently showed that language is handled independently of the "concepts" they convey, so first you get the concepts, then you get the translation to language.
Mixed scripts as well. In March 2024 I asked Gemini Advanced (whatever the version was at the time) to transliterate an image which had the following Persian text on it:
> یوسفی بود ولی هیچ خریدار نداشت
Its output was:
> Yoosefi بود ولی هیچ khaरीदār nadāsht
That's three different scripts with two different Romanisation schemes just for the Latin/Roman script (writing "Yoosefi" as "Yūsefī" or "Yūsufī" would have been more consistent with "nadāsht").
I get strong Belter Creole vibes from this one
Oh, this is a mental virus stronger than Ghiblifying all the things. Alas, ahora mina wa is perdú. Él kite iru.
this fits the supposition -- since LLMs can be fed patterns of nonsense and learn to reply in those patterns, LLMs are not intelligent.
YNews corollary : since posters cannot resist making new swathes of Look At This LLM Output, the open nature of tech boards is doomed in some ways (?)
You're proposing that advanced pattern recognition is a sign of NOT being intelligent?
Was the above comment nonsense, or did it have a pattern? If a real person happened to know ten languages and played along in this game with you, would you also see that as evidence that they are not intelligent?
yes, because in the example given -- LLMs can be fed patterns of nonsense -- the byte patterns purposefully lack meaning. Therefore the replies also lack real meaning, but they appear according to rules. That is not being "intelligent."
The prompt
> Kan du comprender questo text? Anata wa quelque inkling habe about lo que se parla dans la presente oraçao? Repondez med sam way of goboriu.
can be translated to
> Can you understand this text? You have some inkling of what is said in this current message? Answer me in the same manner of speaking.
I can recognize Spanish, French, English, Japanese, Russian, Italian, Portuguese, and a couple of words are from languages I don't speak (German? Dutch?) but easily inferrable through their similarity to English.
Not nonsense, just code. If meaning was passed from GP to so many of us, and you didn't catch the meaning, it doesn't make the message nonsense.
But in this case neither the input nor the output are actually nonsense!
People who speak multiple languages can easily understand both the GP's query and every single LLM reply they quoted.
I'm afraid you have failed the jschoe test [0] : you've been outsmarted by an LLM, and incorrectly concluded that it's because the LLM did something dumb.
[0]: https://news.ycombinator.com/context?id=43468092
Those texts aren't nonsense. The prompt has a meaning, the LLMs are able to understand it, and are able to reply with coherent and understandable responses crafted in the same way the prompt was written. For me it's a very clear example of something that is very far from any training data coming out of the models. Intelligent? No, but for me it points to the idea that "language is solved".
As a Vegan, maybe I'm a little biased, but I often think about what the implications of a universal translator would be, if it did infact give us the ability to understand animals. What would that imply if you could drive by a slaughterhouse and be able to understand animals saying goodbye to their loved ones... assuming this is happening.. Would all slaughtering stop? Or would people be okay with that? Interesting times ahead if there is any possibility for ML to translate animal language.
I'm also a vegan, but it doesn't seem likely to me that other species have languages similar to ours. I think people have already used ML to interpret cat and dog communications, and they got general emotions more than something like syntax.
It's complicated by the fact that other species' throats and mouths physically can't form many human language phonemes*, but even the use or recognition of human language by other great apes (and parrots) is very controversial, and they probably have cognition and sociality most similar to ours. But it's not clear that they can do much of what human language does.
Obligatory Gary Larson:
https://static1.cbrimages.com/wordpress/wp-content/uploads/2...
We don't slaughter animals because we think they don't mind dying, we slaughter them because we've outsourced the mass killings to people who don't mind doing it, and a steak looks enough unlike a cow that we don't think that it used to be alive.
Basically, if we had to slaughter our own cows, I doubt we'd be eating as much meat.
People adapt very easily. If you were trapped on a mountain, you'd likely butcher a cow with the rest of your soccer team. Don't judge everything through the lens of plenty. If you're American, it might be an exercise that becomes useful soon.
If I were trapped on a mountain, I'd likely butcher my soccer team. That's kind of the entire point, that I don't need to be slaughtering cows.
I can tell you've never lived in the Midwest, or maybe just not outside of a city. People have dedicated chest freezers for wild game that they keep full all year. Opening of hunting and fishing seasons are huge deals.
I've never lived in the Midwest, because I'm not American, but I grew up in a small village where we had to decapitate our own chickens. I never got over the discomfort at taking another life.
* People ate plenty of meat when they had to slaughter the animals themselves.
* Hunting is quite popular.
* Every adult that eats meat is quite aware of what goes on to bring it to his table.
So I would disagree. We slaughter animals because that is what they are for, it is why they are farmed, and we want the resulting products. I like my leather shoes and jacket and belt. I like a steak. I like a chicken curry. It doesn't concern me at all that cows and chicken and lambs die to make that happen. They are knocked out first, so it is quite humane.
> Every adult that eats meat is quite aware of what goes on to bring it to his table
> They are knocked out first, so it is quite humane
Those two statements contradicts themselves: most of the chicken aren’t knocked out, or failed to be. It’s however easier to finish your dish if you don’t bother evaluating agroindustrial marketing material (and the cute kid’s farm you saw when toddler)
Same happen with "caws eats grass", "this fish was sustainably catch because the label said so", "that chicken had a mn happy life because it’s an organic one".
Have you ever killed an animal with your own hands?
We haven’t had an evolutionarily relevant reason to stop. If sentient alien life looks like a chicken we’d stop eating chicken. If pigs get any smarter we’ll have to stop eating them. We’ve already mostly stopped eating cats and dogs in most western countries. For me, personally, I view it as a 3rd or 4th tier problem. We’re not solving world hunger for another 2 centuries so I put it out of my mind. If I’m going to solve a “food problem” it seems cruel and irresponsible to solve the food’s problem.
"We’re not solving world hunger for another 2 centuries"
Why two centuries? Deaths from famines have already dropped precipitously in the last three generations or so. Today, if there is a problem with food, it is usually a logistical problem, not a problem with food availability/cost in general, and half of the world has a problem of eating too much.
Anyway, two centuries is a long time. Two centuries ago, electricity wasn't a thing yet.
I don't think solving hunger is a problem of quantity. It's a political and systemic inequality problem. I don't see those being adequately managed for at least 200 years if ever.
But then you should call the thing to be solved "problem of good governance" instead, and that is something that indeed may take centuries. Bad governance will manifest itself in a multitude of problems that have no intrinsic organic relationships amongst them, and I am not sure if it makes sense to split them into sub-categories.
In the past, hunger was quite often a quantity problem. If a period of bad weather hit Medieval Europe, there wouldn't be any practical way how to import food for the entire continent from, say, India.
In this sense, hunger is being solved.
If we (on average) can see little children getting bombed on live TV and feel no need to call our senator and ask him what the fuck he thinks he's doing, then I don't think a slaughterhouse will be much of a problem either.
Unfortunately, you're probably right.
>very far from any training data
It's not that far from training data surely. If you're only training on next-word basis then you'll "often" see individual words from other languages mixed in.
It's like some sort of uber-pidgin.
In a high-dimensional enough space nothing is ever far from anything.
nb it doesn't even train on words, just subwords
language will be solved when LLMs are translating Whale's songs to human language imo.
I was also surprised to find out (roughly a year ago) that Claude is good at Old English (which, despite its misleading name, looks nothing like English and is more of a Germanic language) whereas ChatGPT would output pure hallucinations.
Claude can speak medieval and ancient languages but mixes up different time periods pretty often, unless you hard prompt the desired grammar. For Old English in particular, it tends to give something vaguely Shakespearean instead. It often uses period-incorrect alphabet or specific symbols as well (for Slavic languages in particular).
I've also tried Old Norse, Ancient Greek, and Old East Slavic, and the result is pretty much the same. For OES in particular, it often outputs period-incorrect grammar, writes in Old Church Slavonic (different language), or even modern Russian or Serbian. Looks like the dataset was a bit chaotic, with religious books mixed with old manuscripts and even modern books for children. Mentioning a specific work from the desired period makes it write better, and wrangling it by specifying the rules makes it get this almost right.
Interesting. I was using ChatGPT to try to come up with a possible reconstruction of the Ketef Hinnom scrolls (I don't know Ancient Hebrew at all), with some mixed results. I had to prompt it with things like "What do you think that 'YHWH' bit could mean?", and then it sort of caught on. Maybe I'll see if Claude can do better.
Your description of Old English is a bit odd. It's certainly very different from modern English, but it's its direct ancestor and both languages are Germanic.
It is a direct ancestor but I find that what most people picture when they hear Old English (and have no prior knowledge of it) is something closer to Middle English, which is somewhat redeable by modern English speakers, rather than something like `Oft Scyld Scefing sceaþena þreatum, monegum mægþum, meodosetla ofteah, egsode eorlas.` [0]
[0]: https://www.poetryfoundation.org/poems/43521/beowulf-old-eng...
Claude is much better than ChatGPT at low-resource languages, at least it was a year ago, I haven't tested on new models from OpenAI but I believe that Claude still has an edge.
For example, when ChatGPT was outputting nonsense in Georgian, Claude was speaking it fluently, when ChatGPT learned Georgian, Claude was able to speak Mingrelian.
If I have to do the "click on the motorcycle/traffic lights" captcha more than once I will instead click the back button.
Oh, are you getting a captcha when accessing the site this links to? If so, I didn't know this.
It usually depends on location, for example Cloudflare has a setting somewhere for "always show captchas for non-western traffic" and a lot of people set it.
Wow, I guess my hosting provider uses Cloudfare and that setting then.
> At least in most public tokenizers like o200k, addition in certain Unicode ranges commutes with addition in token space
This seems flawed. I mean, the author's statement here is literally true, but it's eliding a very important detail: LLMs do _not_ see token indexes. They have no idea what order the token embeddings are in. In fact, you can shuffle the embeddings and the LLM wouldn't care at all. And I highly suspect that if you shuffled the entire tokenizer, so that the above property no longer holds, and trained Claude from scratch on that tokenizer, it would still be able to perform this task.
> so all but one of these symbols is mapped to three tokens each, where the first two are the same and can be easily ignored by an attention head, and the third token increments exactly with the Unicode.
This is the crux, I believe.
In the general case, the common Unicode ranges (for Korean, Japanese, Chinese, etc) get tokenized just like English (for modern tokenizers at least).
It's only in the obscure unicode ranges where you hit a special case of the tokenizer. This is the "backup plan" of the tokenizer. If it encounters text that doesn't directly map to a token in its dictionary, then it falls back to encoding the text as UTF-8 bytes. Those UTF-8 bytes have a dedicated set of 256 tokens in its dictionary. So in those extreme cases, rather then getting bits of text like "Hell, o, Mr, ., B, ond" the LLM gets the raw UTF-8 bytes.
Now, again, the LLM can't directly see those bytes, their index in the tokenizer's dictionary, their integer values, etc, etc. It only sees their embedding vectors, which are unordered. So it has no _implicit_ knowledge about those bytes being ordered. Therefore the assertion that addition commutes between Unicode and token indices is irrelevant.
My theory would be that the pretraining data contains lists of Unicode characters. Specifically, lists of unicode characters in order. Naturally, for the obscure ranges of unicode, this results in the LLM seeing counting in UTF-8 bytes. It doesn't initially know what the "value" of each byte is, but naturally it would learn that so that it can correctly predict the next byte.
The same occurs for English letters. It doesn't start with any knowledge about what order they are in. It only learns the ordered alphabet through seeing examples.
(The inverse applies, of course, since the output is also unordered.)
Maybe this is a nitpick? But it seems important to me, because it's the difference between a rather simple mechanism:
output[i] = input[i] + 1
and a more complex mechanism:
c = to_utf8_byte_index(input[i]) c = c + 1 output[i] = from_utf8_byte_index(c)
Also it's important because I'd suspect the LLM will see a _lot_ of UTF-8 counting. There's about a million unicode "characters", the vast majority of which won't have direct token mappings. So in rough estimation for a single complete listing of Unicode, it'd see a list of purely counting in bytes that is 1 million lines long. That's 3900 complete cycles of the least significant byte. Just from one listing.
In contrast, it's not going to encounter a lot of listings of, say, the Korean unicode range in unicode order (about 11k points). Each time it does, it gets to see exactly 1 complete cycle.
So a single listing of Unicode gives it 3900 examples of how to cycle one byte VS a single listing of an "alphabet" giving it only 1 example.
You're completely right, my argument is fundamentally wrong because it relies on the commutativity, but the embedding matrix obviously does not treat some columns differently than others. Back to the drawing board I suppose. Thanks!
[dead]
[flagged]
[flagged]