Transforming objects you have in memory into a structure which can be stored in a file is called serialization. Serializing data is important for several reasons:
- Checkpoints: Especially in machine learning, it happens often that the preprocessing takes long. The longest setup I had so far was something like 20 hours. I don't want my computer to run when I sleep. And I want it to be usable when I sit in front of it. Hence I want to be able to abort and resume the preprocessing. This means I have to store the current state on the disk.
- Memory limitations: My laptop has 8 GB of RAM. For some datasets, this is not enough. Hence I want to be able to preprocess some part of my data, store the results to a file, remove it from memory and continue. This way, I can handle arbitrary large datasets. Formats with random access are nice in such a case, otherwise you can always create multiple files.
- Sharing data
Properties that are interesting are:
- Library support: How easy is it to read / write data? Is it only for one programming language / environment or is the format open and wide-spread?
- Read speed
- Write speed
Overview
Format | Pro | Con | Binary | Comment |
---|---|---|---|---|
CSV | Simple to use, not much overhead, can be imported into EXCEL | No data types | No | |
JSON | Simple to use, data types, supported by all programming languages | some overhead | No | Use this if possible |
YAML | Simple to use, data types, supported by Python | some overhead, not so simple to write it correctly | No | Don't use this for storing data. Nice for configuration, though. |
XML | Data formats | A pain in the *** to use | No | Might be worth a shot if you're using a strictly typed programming language |
MessagePack | Data types, relatively small | Yes |