Ways to store Data

This is an article I had for quite a while as a draft. As part of my yearly cleanup, I've published it without finishing it. It might not be finished or have other problems.

Data is one core element of machine learning. Hence it is worth to think about ways to store it. This post is inspired by some news of really big datasets being published (source).

Hardware

This post is not about hardware. Well, not mainly. The only thing I would like to mention are some rough scales:

Size	Hardware	Backup
< 250GB	SSD	Easy
250 GB - 10 TB	HDD	Ok
10 TB - 32 TB	HDD + RAID	Difficult
more than 32 TB	Tapes? SANs?	Extremely difficult

A short overview of some RAID levels:

RAID	Stripes	Mirror	Parity	Comment
0	Blocks	No	No	Just chaining the disks. You can easily loose data
1	No	Blocks	No	You can only use half the storage
5	Blocks	No	Yes	(1/n)th of the storage is used for parity, where n is the number of disks
10	Blocks	Blocks	No	Raid 1 and 0 combined

File formats

Having talked about limitations on the upper scale of the amount of data, I would like to go down several levels. Let's talk about file formats.

Structured Data

Structured data has a schema. It is organized and thus usually easier to search than unstructured data. Relational Databases structure data, but the contents of columns can contain unstructured data (e.g. a free text field).

Unstructured data

There are too many file formats for unstructured data to name them all. Here are a few examples:

Text files: e-mails
Images: JPG, PNG, GIF, BMP, ...
Documents: PDF, PS, ...
Video, Audio, ...

Databases

Databases are a nice way to store data. Types of databases are:

SQL-based: MySQL / MariaDB, PostgreSQL, ...
Document-oriented database: CouchDB, MongoDB, Elasticsearch
Graph databases: Neo4j, ...
Key-Value databases: Reddis, ...

Data Warehouse

Classical usecases of data warehouses are operational and financial reporting.

Data Lake

The idea of a data lake is that it is a large container. Several sources add data to the lake. The type of data might be structured or unstructured, machine generated or log files.

Data lakes have 5 core principles according to Evan Shelley:

Ingest: Ability to collect all data you care about
Store: Getting data in one place (e.g. with file system like Hadoop)
Analyze: Find relations you care about
Surface: Display results found in data
Act: Help the customer to make more money

Hadoop is a key tool for data lakes.

Frameworks

Apache Hadoop: Map-Reduce framework for distributed computing
Apache Spark: Framework for cluster computing
Apache Cassandra: distributed, wide column store, NoSQL database management system

Ways to store Data

Hardware

File formats

Structured Data

Unstructured data

Databases

Data Warehouse

Data Lake

Frameworks

Published

Category

Tags

Contact