JSON is a cornerstone for the exchange of data on the Internet. REST APIs use the standardized message format all around the world. Being a subset of JavaScript, it got a huge initial boost in its adoption right from the start. The fact that its syntax is pretty clear and easy to read also helped.
JSON has libraries in every language I know for serialization and deserialization. In Python, there are actually multiple libraries. In this article, I will compare them for you.
The libraries
CPython itself has a json module. It was originally developed by Bob Ippolito as simplejson and was merged into Python 2.4 (source). CPython is licensed under the Python Software Foundation License.
simplejson still exists as its own library and you can install it via pip. It is a pure Python library with an optional C extension. Simplejson is licensed under the MIT and the Academic Free License (AFL) license.
ujson is a binding to the C library Ultra JSON. Ultra JSON was developed by ESN (an Electronic Arts Inc. studio) and is licensed under the 3-clause BSD License. Ultra JSON has 3k stars on Github, 305 forks, 50 contributors, the last commit is only 12 days old and the last issue was opened 5 days ago. I’ve heard that it is in “maintenance mode” (source), indicating that there is no new development.
pysimdjson is a binding to the C++ library simdjson. SIMDjson received funding from Canada. simdjson has 12.2k stars on Github, 611 forks, 63 contributors, the last commit was 11 hours ago, and the last issue was opened 2 hours ago.
python-rapidjson is a binding to the C++ library RapidJSON. RapidJSON was developed by Tencent. RapidJSON has 9.8k stars on GitHub, 2.7k forks, 150 contributors, the last commit was about 2 months ago and the last issue was opened 17 days ago.
orjson is a Python package that relies on Rust to do the heavy lifting.
Maturity and Operational Safety
All mentioned libraries worked for the benchmark examples without issues. Switching the JSON module is not a super big deal, but still, I want to know that the module is supported.
CPython, simplejson, ujson, and orjson consider themselves production-ready.
python-rapidjson marks itself as alpha, but one maintainer says that is a mistake and will be fixed soon (source).
cPython JSON | simplejson | ujson | orjson | pysimdjson | python-rapidjson | |
---|---|---|---|---|---|---|
License | Python Software Foundation License | MIT / Academic Free License (AFL) | BSD License | MIT / Apache | MIT | MIT |
Maturity | ||||||
Version | 3.8.6 | 3.17.2 | 3.2.0 | 3.4.0 | 3.0.0 | 0.9.1 |
Development Status | Production/Stable | Production/Stable | Production/Stable | Alpha | Alpha | |
GH First release | 1993-01-10 | 2006-01-01 | 2012-06-18 | 2018-11-23 | 2019-02-23 | 2017-03-02 |
CI-Pipeline | GH, Travis, Azure | GH, Travis, Appveyor | GH, Travis | Azure | GH, Travis | Appveyor |
Operational Safety | ||||||
GH Organization | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ |
GH Contributors | 1319 | 30 | 50 | 9 | 7 | 15 |
Last release | 2020-09-23 | 2020-07-16 | 2020-09-08 | 2020-09-25 | 2020-08-21 | 2019-11-13 |
Last Commit | 2020-09-25 | 2020-07-16 | 2020-09-19 | 2020-09-25 | 2020-08-31 | 2020-05-08 |
PyPI Maintainers | 3 | 4 | 1 | 1 | 2 | |
Users | ||||||
GH Stars | 33,700 | 1310 | 2966 | 1348 | 374 | 397 |
GH Forks | 16,200 | 290 | 306 | 48 | 25 | 31 |
GH Used By | - | 47,164 | 14,760 | 613 | 11 | 661 |
StackOverflow Questions | 279 | 6 | 3 | - | 319 | |
Benchmarks | ||||||
GeoJSON Read | 48ms | 45ms | 22ms | 19ms | 14ms | 83ms |
GeoJSON Write | 291ms | 352ms | 34ms | 15ms | 289ms | 108ms |
Twitter Read | 6ms | 6ms | 6ms | 5ms | 6ms | 9ms |
Twitter Write | 25ms | 33ms | 5ms | 3ms | 24ms | 6ms |
2MB Float List Read | 36ms | 37ms | 16ms | 9ms | 7ms | 66ms |
2MB Float List Write | 161ms | 186ms | 25ms | 12ms | 164ms | 104ms |
The Questions
One indicator of how easy it might be to resolve problems is to ask questions and see how the behavior is:
- SimpleJSON: I’ve got a response the next day. The response was clear, easy to follow, friendly. Bob Ippolito answered me — the guy who originally developed it and who also is mentioned in the Python docs for the JSON module!
- uJSON: I’ve got a clear, friendly, easy to follow answer within 30 minutes. @hugovank
- ORJSON: No answer after 8 days.
- PySIMDJSON: No answer after 8 days.
- Python-RapidJSON: I’ve got a clear, friendly, easy to follow answer within 30 minutes. A simple PR wasn’t merged after two days.
One answer I’ve got for all of the projects is that they are essentially not in contact with each other.
The Benchmark
In order to benchmark the different libraries properly, I thought of the following scenarios:
- APIs: Web services that exchange information. It might contain Unicode and have a nested structure. A JSON file from a Twitter API sounds good to test this.
- API JSON Error: I was curious about how the performance would change if there was an error in the JSON API format. So I removed a brace in the middle.
- GeoJSON: I’ve first seen the GeoJSON format with Overpass Turbo, an Open Streep Map exporter. You will get crazy big JSON files with mostly coordinates, but also pretty nested.
- Machine Learning: Just a massive list of floats. Those might be weights of a neural network layer.
- JSON Line: Structured logs are heavily used in the industry. If you analyze those logs, you might need to go through Gigabytes of data. They are all simple dictionaries with a datetime object, a message, the logger, log status, and maybe some more.
Deserialization Speed
The speed of my hard drive gives a lower boundary for the speed to read. I’ve included it as a baseline in the following 3 charts.
The conclusion from this:
- Rapidjson is slow, but for small JSONs like the twitter.json, you will not notice a difference. One can see this with the structured logs.
- simdjson, orjson, and ujson are all crazy fast.
- Reading a JSON file that contains a structural error is equally fast for most libraries. A notable exception is rapidjson. I guess that it aborts reading the file once it finds the error.
Serialization Speed
In this case, I created the JSON-String beforehand and measured the time it takes to write it to disk as a baseline.
What I conclude from this:
- orjson is just insanely fast. It is super close to maxing out my hard drive. And ujson is pretty close to that.
- rapidjson is pretty quick, but not on the same level as orjson or ujson.
- simdjson is slow.
A professional workflow with JSON
As a closing note, I want to point out some issues I see sometimes and have written myself:
- Calling variables foo_json : JSON is a string format. If it’s not a string, it’s not JSON. If you deserialized a JSON with bar = json.loads(foo) , then bar is not a JSON. You can serialize bar to a JSON which is equivalent to the JSONfoo , but bar is not a JSON. It’s a Python object. Very likely a dictionary. You can then all it foo_dict .
- Attribute checks all over the place: If you receive a JSON, it’s super easy to convert it to a Python object (e.g. a dict) and use it. This is fine for proof-of-concept code or very small JSON strings. It will bite you in the ass if you don’t convert it to something like a dataclass.
pydantic is a super helpful validation library. You can take the JSON-string, parse it to a Python base representation with dictionaries / lists / strings / numbers / booleans with your favorite JSON library and then parse it again with Pydantic. The advantage you get from this is that you know what you’re dealing with later. No longer just Dict[str, Any] as a type annotation. No longer unhelpful editor autocompletion. No longer checking if attributes exist all over your code.
To include other json packages than the default json , I recommend the pattern
import ujson as json
For Flask, you can use another encoder/decoder like this:
from simplejson import JSONEncoder, JSONDecoder
app.json_encoder = JSONEncoder
app.json_decoder = JSONDecoder