A hackers perspective: Understanding Emoji, Character encoding and why Chipotle only lets you have 3 welsh flags. 🏴󠁧󠁢󠁷󠁬󠁳󠁿

Hello Internet! It’s me once again your resident Script Kitty, here to bring you another post that I have had on my back burner for awhile now but have chronically postponed due to my ADHD and executive functioning difficulties.

Today’s topic is something that I have been researching for awhile now, as someone in the Cybersecurity world I enjoy learning about how various systems and technology work and then thinking about how different scenarios and interactions would effect or break the systems. Which brings us to our main topic today Emoji! What are they?, How do they work?, and Why is this one 🏴󠁧󠁢󠁷󠁬󠁳󠁿 so special?

To explain emoji we have to look way back, at how computers display text in the first place, you see your computer is not storing the actual words or even the letters that make up the words. Instead because all information inside a computer is ultimately comprised of binary data (strings of 1’s and 0’s) computers use something called Character Encoding. You have probably even heard of one of the most impactful character encoding scheme ASCII or American Standard Code for Information Interchange, or as the IANA (Internet Assigned Numbers Authority) prefers it to be called US-ASCII (Source). The reason why you have probably heard of ASCII before is from the term ASCII art, the practice of arranging different ASCII characters to form images. Before Emojis this was the only way to convey symbolic information through text, but how does ASCII and by extension Emoji work?

ASCII and other character encoding schemes work by translating the binary information stored in your computer into different characters. For example the capital A is 065 and the space (and yes even things we would not normally think of as characters such as space need to be included in digital text) is 032. But the astute among you may have realized a discrepancy I said that the characters were stored as 1s & 0s but 65 and 32 don’t have either. In the computer these numbers are stored as binary numbers also known as base 2. 65 would be 1000001 and 32 would be 100000. In computer memory these numbers are stored as groups of 8 Binary Digits also known as bits, with a group of 8 bits being called a byte (and a group of 4 bits being called a nibble!). Because of this encoding schemes even convert numbers, with 1 in ASCII being 0110001 and 2 being 0110010 3 as 0110011 and so on. You may be thinking something like “Well that’s nice KillerKat but how does this relate to Emoji and Chipotle?” and to answer that we have to look at the limitations of ASCII, its all well and good if you want to say something like 1001000 1000101 1001100 1001100 1001111 (HELLO) but what if you want to say something like “¿Dónde está el gato de Internet?” or “ネット猫大好き” well in that case you would run into a problem. ASCII doesn’t support Spanish accent marks (not even ñ) or Japanese Kanji, however as evidenced by that fact you are reading this our modern systems can.

This is where Unicode comes in, it allows users to bridge the gap of different languages and have all computers be capable of displaying all supported languages. To quote Wikipedia (Yes I know an academic sin but this article is an overview not a research paper) “Unicode and its parallel standard, the ISO/IEC 10646 Universal Character Set, together constitute a modern, unified character encoding. Rather than mapping characters directly to octets (bytes), they separately define what characters are available, corresponding natural numbers (code points), how those numbers are encoded as a series of fixed-size natural numbers (code units), and finally how those units are encoded as a stream of octets. The purpose of this decomposition is to establish a universal set of characters that can be encoded in a variety of ways.” -Wikipedia. This may sound complicated at first glance but the important part to understand for our purposes is that characters are no longer 1 byte (a set of 8 binary digits) but are instead defined as possible characters within a standard shared set of characters, again an oversimplification but we just need to know 2 things: 1. Not every computer/system will support all Unicode characters 2. Unicode characters can be multiple bytes or “characters” long.

Now at this point you may have already guessed that Emojis are part of Unicode, unlike previous emoticons found in IM applications or cellphones that only worked on the same platform Emojis are part of the Unicode standard. Being part of the Unicode standard means that you can send Emojis to different platforms, but you may notice that many emojis look different between platforms. This is because the Unicode standard simply describes what the emoji is, and its up to the platform to create the Emoji images themselves. This also means that not every platform supports every Emoji, if you use an old Android device you may notice it doesn’t support newer emoji. This is one of the key behaviors observed with my experiment, a second thing to note is that not all Unicode characters are visible some such as the space do not show up as characters themselves but instead influence the spacing and design of other characters. However to a platform that does support these invisible characters they would appear just as any other unknown character, usually a ?, a box, an emoji of an alien, or something to that effect.

When Unicode added support for variations in Emojis such as different skin colors or genders they did not create entirely new Emojis, instead they used these invisible characters to specify these attributes after the original Emoji. If you have a platform that does not support these changes it would still be able to show you the original Emoji allowing for backwards compatibility and limited support for lightweight systems. Indeed that may be why this: 🏴󠁧󠁢󠁷󠁬󠁳󠁿 Emoji appears as a black flag instead of the flag of Wales. Because instead of adding an entirely new Emoji for the flags of the UK Unicode extends the black flag. All of the flags for countries do this, which meant that in older versions of Twitter they would count for 2 characters. However after complaints following the introduction of the skin color emojis Twitter fixed the issue so that emojis only counted as a single character. The reason why Twitter was counting these emojis as multiple characters is because they are comprised of multiple invisible characters and the emoji in question, and to a computer it looks exactly the same as a string of multiple characters.

Now if you are a hacker like myself (White hat of course) then you may have already had the same thought I did. If these Emojis are comprised of multiple characters of information but act like a single character can you use them to cause buffer overflows? Yes, I can confirm that indeed you can. At one point I added an Emoji to my name on a Chipotle online pickup order, and I noticed that it printed out 2 ?s on the label. This made me ask 2 questions, firstly can I cause a buffer overflow and secondly what is the Emoji with the most amount of characters? Well it turns out that the answer is yes, and our friend the Welsh flag Emoji!

Putting these two pieces of information together I created a new online order and found that any more than 3 welsh flags will overflow and return an error code, the limit seems to be around 39 or so “characters”. Below you can see an example of what prints out if you put 3 welsh flag emojis into the order field, the label maker seems to run out of space before it prints all of the characters. This presents the opportunity for a future test where I attempt to place in a string of valid characters and see if it gets cut off as well.

A Chipotle order label showing many ? characters because 3 welsh flags were interred into the order name field. Copyright KillerKat 2022

The next obvious step was to research if someone else has done any similar attacks and a quick google search reveals that yes, Similar Emoji Buffer overflows have been performed. As with most of my good ideas, great minds think alike and there is a quite staggering amount of minds out there ready to have the same ideas as you. Since the concept has been proven I plan to test a few different fields in various places (All above board of course).

I hope to be posting more here soon, I’ve been doing a lot of exciting things lately. The next project I hope to cover is a soldering kit for a Bluetooth-speaker / radio combo. And if you have any stories related to buffer overflows or Chipotle please leave them down in the comments below.

With that this is KillerKat once again signing off, Stay safe out there and remember to always check your input fields!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s