Do you ¿ UTF-8? It's easier than you think

March 04, 2020
misc

Understanding how UTF-8 works is one of those things that most programmers are a little fuzzy on. I know I often have to look up specific on it when dealing with a problem.

One of the most common UTF-8 related issues that I've seen has to do with MySQL's UTF8 encoding. Also known as how do I insert emoji into mysql?

The TLDR answer to that question is that you have to use the utf8mb4 (up to 4 bytes) encoding, because MySQL's utf8 encoding won't hold an emoji, it only stores up to 3 bytes. But the longer answer is sort of of interesting and not as hard as you might think to understand.

So UTF-8 can take 3 or 4 bytes to store?

Encoding a character with UTF-8 may take, 1, 2, 3, or 4 bytes (early versions of the spec went up to 6 bytes, but was later changed to 4).

What’s cool about UTF8 is that if you are only using basic ASCII characters (eg, character codes 0-127) then it only uses 1 byte. Here’s a handy table that shows how many bytes it takes to encode a given character code in UTF8:

Character Code (decimal)Bytes Used
0-1271 byte
128-20472 bytes
2048-655353 bytes
65536-11141114 bytes

So hopefully that helps you 😍 UFT8. BTW that emoji is 1F60D in hex, or 128525 in decimal, which means it takes 4 bytes to store in unicode.

So why can't MySQL uft8 store an emoji?

The utf8 encoding in MySQL can only hold up to 3 bytes UTF-8 characters, and UTF-8 actually supports up to 4 bytes. I don’t know why they choose to limit utf-8 to 3 bytes, but I will speculate that they probably added support while uft8 was still not officially standardized, and assumed that 3 bytes will be plenty big enough.

So to get the real UTF-8 in MySQL you need to use utf8mb4 encoding. Which can store all 4 bytes of a Unicode character, including emoji.

UTF-8 != Unicode

I've probably mistakenly used the terms UTF-8 and Unicode interchangeably in the past, it's a common mistake, so let's clarify the difference.

Unicode is a character set standard, it specifies what character a given character code maps to. In Unicode parlance they call the character codes code points. This is probably just to confuse you. There are many ways to encode a unicode string into binary, and this is where the different encodings come in to play: UTF-8, UTF-16, etc.

UTF-8 is a way of encoding the characters into bytes. Now if they decided to use 4 bytes for every character, it would have wasted a lot of space (since the most commonly used characters (at least in the english) can be represented using only 1 byte. Although UTF-8 is defined by Unicode and was designed for Unicode, you could invent another character mapping standard and use UTF-8 to store it.


Like this? Follow me ↯


This entry was:

Post a Comment




  






Foundeo Inc.