• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Data Serialization

Contents

  • Overview
  • See also

Transforming objects you have in memory into a structure which can be stored in a file is called serialization. Serializing data is important for several reasons:

  • Checkpoints: Especially in machine learning, it happens often that the preprocessing takes long. The longest setup I had so far was something like 20 hours. I don't want my computer to run when I sleep. And I want it to be usable when I sit in front of it. Hence I want to be able to abort and resume the preprocessing. This means I have to store the current state on the disk.
  • Memory limitations: My laptop has 8 GB of RAM. For some datasets, this is not enough. Hence I want to be able to preprocess some part of my data, store the results to a file, remove it from memory and continue. This way, I can handle arbitrary large datasets. Formats with random access are nice in such a case, otherwise you can always create multiple files.
  • Sharing data

Properties that are interesting are:

  • Library support: How easy is it to read / write data? Is it only for one programming language / environment or is the format open and wide-spread?
  • Read speed
  • Write speed

Overview

Format Pro Con Binary Comment
CSV Simple to use, not much overhead, can be imported into EXCEL No data types No
JSON Simple to use, data types, supported by all programming languages some overhead No Use this if possible
YAML Simple to use, data types, supported by Python some overhead, not so simple to write it correctly No Don't use this for storing data. Nice for configuration, though.
XML Data formats A pain in the *** to use No Might be worth a shot if you're using a strictly typed programming language
MessagePack Data types, relatively small Yes

See also

  • Speed comparison

Published

Aug 7, 2017
by Martin Thoma

Category

Code

Tags

  • Data formats 1
  • Machine Learning 81

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor