Data Serialization

Transforming objects you have in memory into a structure which can be stored in a file is called serialization. Serializing data is important for several reasons:

Checkpoints: Especially in machine learning, it happens often that the preprocessing takes long. The longest setup I had so far was something like 20 hours. I don't want my computer to run when I sleep. And I want it to be usable when I sit in front of it. Hence I want to be able to abort and resume the preprocessing. This means I have to store the current state on the disk.
Memory limitations: My laptop has 8 GB of RAM. For some datasets, this is not enough. Hence I want to be able to preprocess some part of my data, store the results to a file, remove it from memory and continue. This way, I can handle arbitrary large datasets. Formats with random access are nice in such a case, otherwise you can always create multiple files.
Sharing data

Properties that are interesting are:

Library support: How easy is it to read / write data? Is it only for one programming language / environment or is the format open and wide-spread?
Read speed
Write speed

Overview

Format	Pro	Con	Binary	Comment
CSV	Simple to use, not much overhead, can be imported into EXCEL	No data types	No
JSON	Simple to use, data types, supported by all programming languages	some overhead	No	Use this if possible
YAML	Simple to use, data types, supported by Python	some overhead, not so simple to write it correctly	No	Don't use this for storing data. Nice for configuration, though.
XML	Data formats	A pain in the *** to use	No	Might be worth a shot if you're using a strictly typed programming language
MessagePack	Data types, relatively small		Yes

Data Serialization

Overview

See also

Published

Category

Tags

Contact