Python and Encodings

Working with encodings different from ASCII or UTF-8 has always been work which I don't like. It doesn't feel very constructive to just make Python read a file / print some output.

In the following, I will describe some strategies which might help you.

Test script

Copy the following text to a text file test.txt:

Die süße, kleine, lärmende Überfliegerin lebt in der Haute-Côte-Nord.
Dort hat es momemtan 32°C.

On Debian based systems you will get the information which type of encoding it has like this:

$ file test.txt
test.txt: UTF-8 Unicode text

This is probably the best result. But to make sure that we know how to deal with other encodings, you can change the encoding like this:

$ iconv -f UTF-8 -t ISO-8859-1 test.txt > test-iso-8859-1.txt
$ file test-iso-8859-1.txt
test-iso-8859-1.txt: ISO-8859 text

Source code encoding

A first important step is to define the source code encoding. This is done with a comment. The first lines of Python code should probably always look like this:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

PEP-0263 explains it.

Printing encoding problems

The following error occurs when you try to print non-UTF-8 stuff with Python via Sublime Text:

[Decode error - output not utf-8]

The same code, executed via ZSH, gives:

Die s��e, kleine, l�rmende �berfliegerin lebt in der Haute-C�te-Nord.
Dort hat es momemtan 32�C.

You can fix that by adjusting the code the following way:

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

# Make it work with Python 2 and Python 3
import sys

PY3 = sys.version > "3"

if not PY3:
    from future.builtins import open

# Specify the encoding while opening it
with open("test-iso-8859-1.txt", encoding="ISO-8859-1") as f:
    content = f.read()
content = content.encode("UTF-8", "replace")
print(content)

The two important points are

Specifying the encoding while opening the file
Encode the content with UTF-8

These three little steps helped me to deal with non-UTF-8 encodings.

Python and Encodings

Test script

Source code encoding

Printing encoding problems

See also

Published

Category

Tags

Contact