Do you ¿ UTF-8? It's easier than you think
Understanding how UTF-8 works is one of those things that most programmers are a little fuzzy on. I know I often have to look up specific on it when dealing with a problem.
One of the most common UTF-8 related issues that I've seen has to do with MySQL's UTF8 encoding. Also known as how do I insert emoji into mysql?
The TLDR answer to that question is that you have to use the
utf8mb4 (up to 4 bytes) encoding, because MySQL's
utf8 encoding won't hold an emoji, it only stores up to 3 bytes. But the longer answer is sort of of interesting and not as hard as you might think to understand.
So UTF-8 can take 3 or 4 bytes to store?
Encoding a character with UTF-8 may take, 1, 2, 3, or 4 bytes (early versions of the spec went up to 6 bytes, but was later changed to 4).
What’s cool about UTF8 is that if you are only using basic ASCII characters (eg, character codes 0-127) then it only uses 1 byte. Here’s a handy table that shows how many bytes it takes to encode a given character code in UTF8:
|Character Code (decimal)||Bytes Used|
So hopefully that helps you 😍 UFT8. BTW that emoji is
1F60D in hex, or
128525 in decimal, which means it takes 4 bytes to store in unicode.
So why can't MySQL uft8 store an emoji?
utf8 encoding in MySQL can only hold up to 3 bytes UTF-8 characters, and UTF-8 actually supports up to 4 bytes. I don’t know why they choose to limit
utf-8 to 3 bytes, but I will speculate that they probably added support while uft8 was still not officially standardized, and assumed that 3 bytes will be plenty big enough.
So to get the real UTF-8 in MySQL you need to use
utf8mb4 encoding. Which can store all 4 bytes of a Unicode character, including emoji.
UTF-8 != Unicode
I've probably mistakenly used the terms
Unicode interchangeably in the past, it's a common mistake, so let's clarify the difference.
Unicode is a character set standard, it specifies what character a given character code maps to. In Unicode parlance they call the character codes code points. This is probably just to confuse you. There are many ways to encode a unicode string into binary, and this is where the different encodings come in to play: UTF-8, UTF-16, etc.
UTF-8 is a way of encoding the characters into bytes. Now if they decided to use 4 bytes for every character, it would have wasted a lot of space (since the most commonly used characters (at least in the english) can be represented using only 1 byte. Although UTF-8 is defined by Unicode and was designed for Unicode, you could invent another character mapping standard and use UTF-8 to store it.