Transforming objects you have in memory into a structure which can be stored in a file is called serialization. Serializing data is important for several reasons:
- Checkpoints: Especially in machine learning, it happens often that the preprocessing takes long. The longest setup I had so far was something like 20 hours. I don't want my computer to run when I sleep. And I want it to be usable when I sit in front of it. Hence I want to be able to abort and resume the preprocessing. This means I have to store the current state on the disk.
- Memory limitations: My laptop has 8 GB of RAM. For some datasets, this is not enough. Hence I want to be able to preprocess some part of my data, store the results to a file, remove it from memory and continue. This way, I can handle arbitrary large datasets. Formats with random access are nice in such a case, otherwise you can always create multiple files.
- Sharing data
Properties that are interesting are:
- Library support: How easy is it to read / write data? Is it only for one programming language / environment or is the format open and wide-spread?
- Read speed
- Write speed
|CSV||Simple to use, not much overhead, can be imported into EXCEL||No data types||No|
|JSON||Simple to use, data types, supported by all programming languages||some overhead||No||Use this if possible|
|YAML||Simple to use, data types, supported by Python||some overhead, not so simple to write it correctly||No||Don't use this for storing data. Nice for configuration, though.|
|XML||Data formats||A pain in the *** to use||No||Might be worth a shot if you're using a strictly typed programming language|
|MessagePack||Data types, relatively small||Yes|