Data is one core element of machine learning. Hence it is worth to think about ways to store it. This post is inspired by some news of really big datasets being published (source).
This post is not about hardware. Well, not mainly. The only thing I would like to mention are some rough scales:
|250 GB - 10 TB||HDD||Ok|
|10 TB - 32 TB||HDD + RAID||Difficult|
|more than 32 TB||Tapes? SANs?||Extremely difficult|
A short overview of some RAID levels:
|0||Blocks||No||No||Just chaining the disks. You can easily loose data|
|1||No||Blocks||No||You can only use half the storage|
|5||Blocks||No||Yes||(1/n)th of the storage is used for parity, where n is the number of disks|
|10||Blocks||Blocks||No||Raid 1 and 0 combined|
You might be interested in
- Unboxing a PETABYTE of Storage - HOLY $H!T Ep. 16.
- RAID levels
- Why is ext4 only recommended up to 16 TB?
- What limits the number of drives in RAID?
I've heard you can store much more on tape drives. It might be necessary to physically go there and put a tape in / out.
Having talked about limitations on the upper scale of the amount of data, I would like to go down several levels. Let's talk about file formats.
Structured data has a schema. It is organized and thus usually easier to search than unstructured data. Relational Databases structure data, but the contents of columns can contain unstructured data (e.g. a free text field).
There are too many file formats for unstructured data to name them all. Here are a few examples:
- Text files: e-mails
- Images: JPG, PNG, GIF, BMP, ...
- Documents: PDF, PS, ...
- Video, Audio, ...
Databases are a nice way to store data. Types of databases are:
- SQL-based: MySQL / MariaDB, PostgreSQL, ...
- Document-oriented database: CouchDB, MongoDB, Elasticsearch
- Graph databases: Neo4j, ...
- Key-Value databases: Reddis, ...
Classical usecases of data warehouses are operational and financial reporting.
The idea of a data lake is that it is a large container. Several sources add data to the lake. The type of data might be structured or unstructured, machine generated or log files.
Data lakes have 5 core principles according to Evan Shelley:
- Ingest: Ability to collect all data you care about
- Store: Getting data in one place (e.g. with file system like Hadoop)
- Analyze: Find relations you care about
- Surface: Display results found in data
- Act: Help the customer to make more money
Hadoop is a key tool for data lakes.
- Apache Hadoop: Map-Reduce framework for distributed computing
- Apache Spark: Framework for cluster computing
- Apache Cassandra: distributed, wide column store, NoSQL database management system