• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

JSON encoding/decoding with Python

Contents

  • JSON encoding/decoding with Python
    • The libraries
    • Maturity and Operational Safety
    • The Questions
    • The Benchmark
      • Deserialization Speed
      • Serialization Speed
    • A professional workflow with JSON
    • See also

JSON is a cornerstone for the exchange of data on the Internet. REST APIs use the standardized message format all around the world. Being a subset of JavaScript, it got a huge initial boost in its adoption right from the start. The fact that its syntax is pretty clear and easy to read also helped.

JSON has libraries in every language I know for serialization and deserialization. In Python, there are actually multiple libraries. In this article, I will compare them for you.

The libraries

CPython itself has a json module. It was originally developed by Bob Ippolito as simplejson and was merged into Python 2.4 (source). CPython is licensed under the Python Software Foundation License.

simplejson still exists as its own library and you can install it via pip. It is a pure Python library with an optional C extension. Simplejson is licensed under the MIT and the Academic Free License (AFL) license.

ujson is a binding to the C library Ultra JSON. Ultra JSON was developed by ESN (an Electronic Arts Inc. studio) and is licensed under the 3-clause BSD License. Ultra JSON has 3k stars on Github, 305 forks, 50 contributors, the last commit is only 12 days old and the last issue was opened 5 days ago. I’ve heard that it is in “maintenance mode” (source), indicating that there is no new development.

pysimdjson is a binding to the C++ library simdjson. SIMDjson received funding from Canada. simdjson has 12.2k stars on Github, 611 forks, 63 contributors, the last commit was 11 hours ago, and the last issue was opened 2 hours ago.

python-rapidjson is a binding to the C++ library RapidJSON. RapidJSON was developed by Tencent. RapidJSON has 9.8k stars on GitHub, 2.7k forks, 150 contributors, the last commit was about 2 months ago and the last issue was opened 17 days ago.

orjson is a Python package that relies on Rust to do the heavy lifting.

Maturity and Operational Safety

All mentioned libraries worked for the benchmark examples without issues. Switching the JSON module is not a super big deal, but still, I want to know that the module is supported.

CPython, simplejson, ujson, and orjson consider themselves production-ready.

python-rapidjson marks itself as alpha, but one maintainer says that is a mistake and will be fixed soon (source).

cPython JSON simplejson ujson orjson pysimdjson python-rapidjson
License Python Software Foundation License MIT / Academic Free License (AFL) BSD License MIT / Apache MIT MIT
Maturity
Version 3.8.6 3.17.2 3.2.0 3.4.0 3.0.0 0.9.1
Development Status Production/Stable Production/Stable Production/Stable Alpha Alpha
GH First release 1993-01-10 2006-01-01 2012-06-18 2018-11-23 2019-02-23 2017-03-02
CI-Pipeline GH, Travis, Azure GH, Travis, Appveyor GH, Travis Azure GH, Travis Appveyor
Operational Safety
GH Organization ✓ ✓ ✓ ✗ ✗ ✓
GH Contributors 1319 30 50 9 7 15
Last release 2020-09-23 2020-07-16 2020-09-08 2020-09-25 2020-08-21 2019-11-13
Last Commit 2020-09-25 2020-07-16 2020-09-19 2020-09-25 2020-08-31 2020-05-08
PyPI Maintainers 3 4 1 1 2
Users
GH Stars 33,700 1310 2966 1348 374 397
GH Forks 16,200 290 306 48 25 31
GH Used By - 47,164 14,760 613 11 661
StackOverflow Questions 279 6 3 - 319
Benchmarks
GeoJSON Read 48ms 45ms 22ms 19ms 14ms 83ms
GeoJSON Write 291ms 352ms 34ms 15ms 289ms 108ms
Twitter Read 6ms 6ms 6ms 5ms 6ms 9ms
Twitter Write 25ms 33ms 5ms 3ms 24ms 6ms
2MB Float List Read 36ms 37ms 16ms 9ms 7ms 66ms
2MB Float List Write 161ms 186ms 25ms 12ms 164ms 104ms

The Questions

One indicator of how easy it might be to resolve problems is to ask questions and see how the behavior is:

  • SimpleJSON: I’ve got a response the next day. The response was clear, easy to follow, friendly. Bob Ippolito answered me — the guy who originally developed it and who also is mentioned in the Python docs for the JSON module!
  • uJSON: I’ve got a clear, friendly, easy to follow answer within 30 minutes. @hugovank
  • ORJSON: No answer after 8 days.
  • PySIMDJSON: No answer after 8 days.
  • Python-RapidJSON: I’ve got a clear, friendly, easy to follow answer within 30 minutes. A simple PR wasn’t merged after two days.

One answer I’ve got for all of the projects is that they are essentially not in contact with each other.

The Benchmark

In order to benchmark the different libraries properly, I thought of the following scenarios:

  • APIs: Web services that exchange information. It might contain Unicode and have a nested structure. A JSON file from a Twitter API sounds good to test this.
  • API JSON Error: I was curious about how the performance would change if there was an error in the JSON API format. So I removed a brace in the middle.
  • GeoJSON: I’ve first seen the GeoJSON format with Overpass Turbo, an Open Streep Map exporter. You will get crazy big JSON files with mostly coordinates, but also pretty nested.
  • Machine Learning: Just a massive list of floats. Those might be weights of a neural network layer.
  • JSON Line: Structured logs are heavily used in the industry. If you analyze those logs, you might need to go through Gigabytes of data. They are all simple dictionaries with a datetime object, a message, the logger, log status, and maybe some more.

Deserialization Speed

The speed of my hard drive gives a lower boundary for the speed to read. I’ve included it as a baseline in the following 3 charts.

Read a complex, but small JSON
Read a complex, but small JSON
Read a GeoJSON
Read a GeoJSON
Read a massive float array
Read a massive float array
Read a structured log file
Read a structured log file
Read a faulty twitter.json
Read a faulty twitter.json

The conclusion from this:

  • Rapidjson is slow, but for small JSONs like the twitter.json, you will not notice a difference. One can see this with the structured logs.
  • simdjson, orjson, and ujson are all crazy fast.
  • Reading a JSON file that contains a structural error is equally fast for most libraries. A notable exception is rapidjson. I guess that it aborts reading the file once it finds the error.

Serialization Speed

In this case, I created the JSON-String beforehand and measured the time it takes to write it to disk as a baseline.

Write a twitter.json
Write a twitter.json
Write a GeoJSON
Write a GeoJSON
Write a massive float array
Write a massive float array
Write a structured log file
Write a structured log file

What I conclude from this:

  • orjson is just insanely fast. It is super close to maxing out my hard drive. And ujson is pretty close to that.
  • rapidjson is pretty quick, but not on the same level as orjson or ujson.
  • simdjson is slow.

A professional workflow with JSON

As a closing note, I want to point out some issues I see sometimes and have written myself:

  • Calling variables foo_json : JSON is a string format. If it’s not a string, it’s not JSON. If you deserialized a JSON with bar = json.loads(foo) , then bar is not a JSON. You can serialize bar to a JSON which is equivalent to the JSONfoo , but bar is not a JSON. It’s a Python object. Very likely a dictionary. You can then all it foo_dict .
  • Attribute checks all over the place: If you receive a JSON, it’s super easy to convert it to a Python object (e.g. a dict) and use it. This is fine for proof-of-concept code or very small JSON strings. It will bite you in the ass if you don’t convert it to something like a dataclass.

pydantic is a super helpful validation library. You can take the JSON-string, parse it to a Python base representation with dictionaries / lists / strings / numbers / booleans with your favorite JSON library and then parse it again with Pydantic. The advantage you get from this is that you know what you’re dealing with later. No longer just Dict[str, Any] as a type annotation. No longer unhelpful editor autocompletion. No longer checking if attributes exist all over your code.

To include other json packages than the default json , I recommend the pattern

import ujson as json

For Flask, you can use another encoder/decoder like this:

from simplejson import JSONEncoder, JSONDecoder

app.json_encoder = JSONEncoder
app.json_decoder = JSONDecoder

See also

  • Daniel Lemire: Parsing JSON Really Quickly: Lessons Learned at InfoQ
  • Ng Wai Foong: Introduction to orjson
  • Nicolas Seriot: Parsing JSON is a Minefield

Published

Okt 5, 2020
by Martin Thoma

Category

Code

Tags

  • Benchmark 2
  • JSON 4
  • Python 141

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor