Strings

Strings are essential for developing applications. Yet they are not covered (enough) if you study computer science. In this article, I will explain what a string is, what Unicode, UTF-8, ASCII and encodings are and what their relationship is.

Just when I was about to finish this, I found Joel Spolskys article. It is better to read, but a lot longer than what I wrote here.

Datatype basics ¶

Datatypes are not a hardware feature. The CPU knows a couple (well, a lot) of different commands. Those are called the instruction set of a CPU.

One of the best known ones is the x86 instruction set. If you search for "multiply" on this page, you get 50 results. MULPD and MULSD for the multiplication of doubles, FIMUL for integer multiplication, ...

Those commands work on registers. Registers are memory slots which can contain 32 bit or 64 bit, depending on which architecture your CPU uses. Hence the CPU instruction interprets the values of the registers in a different way, but the values themselves don't have types.

In compiled languages like C and C++, the compiler takes care of the type and creates a sequence of instructions which interpret the registers content as desired.

In Python, the interpreter takes care of it.

Simple Strings: Character Sequences ¶

In the simplest case, one could store a string as an array of characters. Hence one has a continuous part of the memory, where each byte is interpreted as a character.

As we know, the memory doesn't have data types. Hence each byte just contains a natural number in $[0, 2^8 - 1]$. It is only how we interpret those numbers that makes it a character.

ASCII ¶

ASCII is a standard. It maps 7 bit numbers (yes, 7, not 8 - ASCII is old) to characters. Hence it defines how all of those 128 possible numbers should be interpreted:

Number	Character	Description	Number	Character	Description
0		NULL (`\0`)	1		START OF HEADING
2		START OF TEXT	3		END OF TEXT
4		END OF TRANSMISSION	5		ENQUIRY
6		ACKNOWLEDGE	7		BELL
8		BACKSPACE	9		CHARACTER TABULATION (`\t`)
10		LINE FEED (LF)	11		LINE TABULATION
12		FORM FEED (FF)	13		CARRIAGE RETURN (CR) (`\r`)
14		SHIFT OUT	15		SHIFT IN
16		DATA LINK ESCAPE	17		DEVICE CONTROL ONE
18		DEVICE CONTROL TWO	19		DEVICE CONTROL THREE
20		DEVICE CONTROL FOUR	21		NEGATIVE ACKNOWLEDGE
22		SYNCHRONOUS IDLE	23		END OF TRANSMISSION BLOCK
24		CANCEL	25		END OF MEDIUM
26		SUBSTITUTE	27		ESCAPE
28		INFORMATION SEPARATOR FOUR	29		INFORMATION SEPARATOR THREE
30		INFORMATION SEPARATOR TWO	31		INFORMATION SEPARATOR ONE
32		SPACE	33	!	EXCLAMATION MARK
34	"	QUOTATION MARK	35	#	NUMBER SIGN
36	$	DOLLAR SIGN	37	%	PERCENT SIGN
38	&	AMPERSAND	39	'	APOSTROPHE
40	(	LEFT PARENTHESIS	41	)	RIGHT PARENTHESIS
42	*	ASTERISK	43	+	PLUS SIGN
44	,	COMMA	45	-	HYPHEN-MINUS
46	.	FULL STOP	47	/	SOLIDUS
48	0		49	1
50	2		51	3
52	4		53	5
54	6		55	7
56	8		57	9
58	:		59	;
60	<		61	=
62	>		63	?
64	@		65	A
66	B		67	C
68	D		69	E
70	F		71	G
72	H		73	I
74	J		75	K
76	L		77	M
78	N		79	O
80	P		81	Q
82	R		83	S
84	T		85	U
86	V		87	W
88	X		89	Y
90	Z		91	[
92	\		93	]
94	^		95	_
96	`		97	a
98	b		99	c
100	d		101	e
102	f		103	g
104	h		105	i
106	j		107	k
108	l		109	m
110	n		111	o
112	p		113	q
114	r		115	s
116	t		117	u
118	v		119	w
120	x		121	y
122	z		123	{
124	\|		125	}
126	~		127		DELETE

Latin-1 ¶

Latin-1 (aka ISO 8859-1) is an extension to ASCII. So it makes use of the last bit and defines the number 160 to 255. Yes, there is an undefined gap. So this extension helped a lot of languages (e.g. French and German). But it was not enough. And it was a mess, pretty soon. Other character encodings like ISO 8859-2, ISO 8859-3, ..., ISO 8859-16 were created. Now exchanging documents between those formats becomes a mess. You've probably experienced it when suddenly the replacement character � appears in an application. Or something like ¿½.

Unicode ¶

Unicode is a standard (link). It defines a number - the unicode code point -, the glyph which belongs to this number, a short textual description of the glyph and an example how it could look like.

Unicode code points are just identifier. In contrast, in ASCII the numbers are both an identifier and how the character is represented in the memory.

For example, the codepoint U+0041 has the description LATIN CAPITAL LETTER A as shown nicely on fileformat.info. This notation uses an hexadecimal base. Hence the number is $(41)_{16} = 16 \cdot 4 + 1 = 65$ - just like ASCII! In fact, Unicode is a superset of ASCII. And of Latin-1.

Unicode 10 contains 136 690 characters and 139 writing systems. It has Emoji, mathematical symbols and musical symbols.

You can search mathematical symbols with write-math.com and Emoticons with unicode.party.

An interesting concept in Unicode are combining characters. For example, a female firefighter is represented by the women code point and by the fire engine code point.

UTF-8 ¶

UTF-8 is a character encoding capable of encoding all possible Unicode code points. The name is short for Transformation Format - 8 bit. It uses between 1 and 4 bytes to represent Unicode code points.

So while Unicode defines an identifier for the concepts of characters, UTF-8 defines how those are stored in memory. There were issues with byte order (high endian, low endian; see Byte Order Mark (BOM)) which are fixed with UTF-8. UTF-8 uses between one and 6 bytes per character.

Alternatives:

UCS-2: Use two bytes for a unicode character. Always.
- high-endian UCS-2
- low-endian UCS-2
UCS-4: Store each code point in 4 bytes - hence it blows up english text to have 4x the size.
UTF-7
UTF-16: Represent the Unicode code point by itself. Used by .NET and Java. Is more space-efficient when using a lot of Chinese (source)

Fonts ¶

Fonts are a completely different story. Each character can have a font, but it stays the same. A font just deals with the appearence, whereas the Unicode code point defines what it is. The concept. The encoding defines how it is stored in the memory.

Interestingly, there can't be a single font which covers all unicode symbols, because OpenType is limited to 65536 glyphs.

The GNU Unicode Font covers over 34,000 characters (source)

Font family ¶

Usually, you only want to define a font family. If you make the text bold, the font changes but the font family is still the same. By using font families it seems to be possible to cover more symbols than 65536. For example, noto seems to have a lot.

How to use it ¶

C ¶

L"Literal UCS-2 string"

C++ ¶

wchar_t ("wide char")

Python ¶

Use either UCS2* or UCS4 for unicode characters. Which one is used is a compile time option.

The unicode(your_string) function creates a unicode object from the given encoded string.

Don't forget to put

# -*- coding: utf-8 -*-

at the beginning of all of your scripts.

Python 2	Python 3	What it does
unicode Object	str Object	handles text. Can be encoded (utf-8, latin-1)
str Object	bytes	Plain sequence of bytes. Similar to strings in C.

By using from __future__ import unicode_literals you get the default behaviour of Python 3 within Python 2: All "string literals" are unicode strings. Otherwise, they are byte strings.

A couple of really helpful examples from Bakuriu:

>>> len(u'à')  # a single code point
1
>>> len('à')   # by default utf-8 -> takes two bytes
2
>>> len(u'à'.encode('utf-8'))
2
>>> len(u'à'.encode('latin1'))  # in latin1 it takes one byte
1
>>> print u'à'.encode('utf-8')  # terminal encoding is utf-8
à
>>> print u'à'.encode('latin1') # it cannot understand the latin1 byte
�

Pitfalls ¶

Some pitfalls are listed in this SO answer:

Counting: The combining code points can generate some confusion on what you might expect and what you get.
Similar look, but different: U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ
Equality, the second: There is Latin A, CYRILLIC CAPITAL LETTER A and GREEK CAPITAL LETTER ALPHA. The same look, but not the same codepoint.

What about Collation? ¶

In databases, you have to say which collation you want to use. It is about sorting. For English it is simple enough to sort by ASCII code, but how do you sort

André <- one char
André <- two chars
Andrế
Andrę́
...
Andreas