(Those might look similar, different, or not appear at all depending on the fonts and character sets available on your computer. In reality the characters we often refer to as apostrophes could be: When you see funny characters it’s because data encoded using UTF-8 is likely being interpreted as ISO-8859-1.įirst, let’s be clear as mud: there are apostrophes, and apostrophes. Messages remain smaller, but should one of those “other” characters be needed it can be incorporated by using its “longer” representation.Īll that is a lot of back story to the problem.
In UTF-8 the entire Unicode character set is broken down by an algorithm into byte sequences that are either 1, 2, 3 or 4 bytes long. The reason is simple: the vast majority of characters in common usage in Western languages fall into the 1 byte range. Possible, and in some cases even the right solution, but when you consider that the majority of communications, particularly in the western world, focus on the basic roman alphabet and a few numbers and punctuation, it starts to seem wasteful.Įnter “UTF-8”, for “8 bit Unicode Transformation Format”. “A” is still 65, but if we look at it in hexadecimal the single byte Ascii “A” is 41, while the two-byte Unicode “A” is 0041.Īt this point, it should be clear that switching from Ascii to Unicode would immediately double the size of every email, every document, and everything else that stored text. The problem, of course, is that there are way more than 256 possible characters. While we might spend most of our time with common characters like A-Z, a-z, 0-9 and a handful of punctuation, in reality the there are thousands of other possible characters – particularly if you think globally.Īt the other end of the spectrum is the “Unicode” encoding, which uses two (or more) bytes, giving many more possible different characters. The most common true 8-bit encoding used on the internet today is “ ISO-8859-1”.) (Technically ASCII actually only usesħ bits of that byte, or values from 0-127. The “ASCII” character set or encoding uses a single byte – values from 0 to 255 – to represent up to 256 different characters. The fundamental concept is that all characters are actually stored as numbers.