Strings are essential for developing applications. Yet they are not covered (enough) if you study computer science. In this article, I will explain what a string is, what Unicode, UTF-8, ASCII and encodings are and what their relationship is.
Just when I was about to finish this, I found Joel Spolskys article. It is better to read, but a lot longer than what I wrote here.
Datatype basics
Datatypes are not a hardware feature. The CPU knows a couple (well, a lot) of different commands. Those are called the instruction set of a CPU.
One of the
best known ones is the x86 instruction set.
If you search for "multiply" on this page, you get 50 results. MULPD
and MULSD
for the multiplication of doubles,
FIMUL
for integer multiplication, ...
Those commands work on registers. Registers are memory slots which can contain 32 bit or 64 bit, depending on which architecture your CPU uses. Hence the CPU instruction interprets the values of the registers in a different way, but the values themselves don't have types.
In compiled languages like C and C++, the compiler takes care of the type and creates a sequence of instructions which interpret the registers content as desired.
In Python, the interpreter takes care of it.
Simple Strings: Character Sequences
In the simplest case, one could store a string as an array of characters. Hence one has a continuous part of the memory, where each byte is interpreted as a character.
As we know, the memory doesn't have data types. Hence each byte just contains a natural number in \([0, 2^8 - 1]\). It is only how we interpret those numbers that makes it a character.
ASCII
ASCII is a standard. It maps 7 bit numbers (yes, 7, not 8 - ASCII is old) to characters. Hence it defines how all of those 128 possible numbers should be interpreted:
Number | Character | Description | Number | Character | Description |
0 | NULL (\0 ) |
1 | START OF HEADING | ||
2 | START OF TEXT | 3 | END OF TEXT | ||
4 | END OF TRANSMISSION | 5 | ENQUIRY | ||
6 | ACKNOWLEDGE | 7 | BELL | ||
8 | BACKSPACE | 9 | CHARACTER TABULATION (\t ) |
||
10 | LINE FEED (LF) | 11 | LINE TABULATION | ||
12 | FORM FEED (FF) | 13 | CARRIAGE RETURN (CR) (\r ) |
||
14 | SHIFT OUT | 15 | SHIFT IN | ||
16 | DATA LINK ESCAPE | 17 | DEVICE CONTROL ONE | ||
18 | DEVICE CONTROL TWO | 19 | DEVICE CONTROL THREE | ||
20 | DEVICE CONTROL FOUR | 21 | NEGATIVE ACKNOWLEDGE | ||
22 | SYNCHRONOUS IDLE | 23 | END OF TRANSMISSION BLOCK | ||
24 | CANCEL | 25 | END OF MEDIUM | ||
26 | SUBSTITUTE | 27 | ESCAPE | ||
28 | INFORMATION SEPARATOR FOUR | 29 | INFORMATION SEPARATOR THREE | ||
30 | INFORMATION SEPARATOR TWO | 31 | INFORMATION SEPARATOR ONE | ||
32 | SPACE | 33 | ! | EXCLAMATION MARK | |
34 | " | QUOTATION MARK | 35 | # | NUMBER SIGN |
36 | $ | DOLLAR SIGN | 37 | % | PERCENT SIGN |
38 | & | AMPERSAND | 39 | ' | APOSTROPHE |
40 | ( | LEFT PARENTHESIS | 41 | ) | RIGHT PARENTHESIS |
42 | * | ASTERISK | 43 | + | PLUS SIGN |
44 | , | COMMA | 45 | - | HYPHEN-MINUS |
46 | . | FULL STOP | 47 | / | SOLIDUS |
48 | 0 | 49 | 1 | ||
50 | 2 | 51 | 3 | ||
52 | 4 | 53 | 5 | ||
54 | 6 | 55 | 7 | ||
56 | 8 | 57 | 9 | ||
58 | : | 59 | ; | ||
60 | < | 61 | = | ||
62 | > | 63 | ? | ||
64 | @ | 65 | A | ||
66 | B | 67 | C | ||
68 | D | 69 | E | ||
70 | F | 71 | G | ||
72 | H | 73 | I | ||
74 | J | 75 | K | ||
76 | L | 77 | M | ||
78 | N | 79 | O | ||
80 | P | 81 | Q | ||
82 | R | 83 | S | ||
84 | T | 85 | U | ||
86 | V | 87 | W | ||
88 | X | 89 | Y | ||
90 | Z | 91 | [ | ||
92 | \ | 93 | ] | ||
94 | ^ | 95 | _ | ||
96 | ` | 97 | a | ||
98 | b | 99 | c | ||
100 | d | 101 | e | ||
102 | f | 103 | g | ||
104 | h | 105 | i | ||
106 | j | 107 | k | ||
108 | l | 109 | m | ||
110 | n | 111 | o | ||
112 | p | 113 | q | ||
114 | r | 115 | s | ||
116 | t | 117 | u | ||
118 | v | 119 | w | ||
120 | x | 121 | y | ||
122 | z | 123 | { | ||
124 | | | 125 | } | ||
126 | ~ | 127 | DELETE |
Latin-1
Latin-1 (aka ISO 8859-1) is an extension to ASCII. So it makes use of the last bit and defines the number 160 to 255. Yes, there is an undefined gap. So this extension helped a lot of languages (e.g. French and German). But it was not enough. And it was a mess, pretty soon. Other character encodings like ISO 8859-2, ISO 8859-3, ..., ISO 8859-16 were created. Now exchanging documents between those formats becomes a mess. You've probably experienced it when suddenly the replacement character � appears in an application. Or something like ¿½.
Unicode
Unicode is a standard (link). It defines a number - the unicode code point -, the glyph which belongs to this number, a short textual description of the glyph and an example how it could look like.
For example, the codepoint U+0041
has the description LATIN
CAPITAL LETTER A
as shown nicely on fileformat.info. This notation uses an hexadecimal base. Hence the
number is \((41)_{16} = 16 \cdot 4 + 1 = 65\) - just like ASCII! In fact, Unicode
is a superset of ASCII. And of Latin-1.
Unicode 10 contains 136 690 characters and 139 writing systems. It has Emoji, mathematical symbols and musical symbols.
You can search mathematical symbols with write-math.com and Emoticons with unicode.party.
An interesting concept in Unicode are combining characters. For example, a female firefighter is represented by the women code point and by the fire engine code point.
UTF-8
UTF-8 is a character encoding capable of encoding all possible Unicode code points. The name is short for Transformation Format - 8 bit. It uses between 1 and 4 bytes to represent Unicode code points.
So while Unicode defines an identifier for the concepts of characters, UTF-8 defines how those are stored in memory. There were issues with byte order (high endian, low endian; see Byte Order Mark (BOM)) which are fixed with UTF-8. UTF-8 uses between one and 6 bytes per character.
Alternatives:
UCS-2
: Use two bytes for a unicode character. Always.high-endian UCS-2
low-endian UCS-2
UCS-4
: Store each code point in 4 bytes - hence it blows up english text to have 4x the size.UTF-7
UTF-16
: Represent the Unicode code point by itself. Used by .NET and Java. Is more space-efficient when using a lot of Chinese (source)
Fonts
Fonts are a completely different story. Each character can have a font, but it stays the same. A font just deals with the appearence, whereas the Unicode code point defines what it is. The concept. The encoding defines how it is stored in the memory.
Interestingly, there can't be a single font which covers all unicode symbols, because OpenType is limited to 65536 glyphs.
The GNU Unicode Font covers over 34,000 characters (source)
Font family
Usually, you only want to define a font family. If you make the text bold, the font changes but the font family is still the same. By using font families it seems to be possible to cover more symbols than 65536. For example, noto seems to have a lot.
How to use it
C
L"Literal UCS-2 string"
C++
wchar_t ("wide char")
Python
Use either UCS2*
or UCS4
for unicode characters. Which one is used is a
compile time option.
The unicode(your_string)
function creates a unicode object from the given
encoded string.
Don't forget to put
# -*- coding: utf-8 -*-
at the beginning of all of your scripts.
Python 2 | Python 3 | What it does |
---|---|---|
unicode Object | str Object | handles text. Can be encoded (utf-8, latin-1) |
str Object | bytes | Plain sequence of bytes. Similar to strings in C. |
By using from __future__ import unicode_literals
you get the default behaviour
of Python 3 within Python 2: All "string literals"
are unicode strings. Otherwise,
they are byte strings.
A couple of really helpful examples from Bakuriu:
>>> len(u'à') # a single code point
1
>>> len('à') # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1')) # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8') # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�
Pitfalls
Some pitfalls are listed in this SO answer:
- Counting: The combining code points can generate some confusion on what you might expect and what you get.
- Similar look, but different: U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ
- Equality, the second: There is Latin A, CYRILLIC CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA. The same look, but not the same codepoint.
What about Collation?
In databases, you have to say which collation you want to use. It is about sorting. For English it is simple enough to sort by ASCII code, but how do you sort
André <- one char
André <- two chars
Andrế
Andrę́
...
Andreas