• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Ways to store Data

Contents

  • Hardware
  • File formats
    • Structured Data
    • Unstructured data
  • Databases
  • Data Warehouse
  • Data Lake
  • Frameworks
This is an article I had for quite a while as a draft. As part of my yearly cleanup, I've published it without finishing it. It might not be finished or have other problems.

Data is one core element of machine learning. Hence it is worth to think about ways to store it. This post is inspired by some news of really big datasets being published (source).

Hardware

This post is not about hardware. Well, not mainly. The only thing I would like to mention are some rough scales:

Size Hardware Backup
< 250GB SSD Easy
250 GB - 10 TB HDD Ok
10 TB - 32 TB HDD + RAID Difficult
more than 32 TB Tapes? SANs? Extremely difficult

A short overview of some RAID levels:

RAID Stripes Mirror Parity Comment
0 Blocks No No Just chaining the disks. You can easily loose data
1 No Blocks No You can only use half the storage
5 Blocks No Yes (1/n)th of the storage is used for parity, where n is the number of disks
10 Blocks Blocks No Raid 1 and 0 combined

You might be interested in

  • Unboxing a PETABYTE of Storage - HOLY $H!T Ep. 16.
  • RAID levels
  • Why is ext4 only recommended up to 16 TB?
  • What limits the number of drives in RAID?

I've heard you can store much more on tape drives. It might be necessary to physically go there and put a tape in / out.

File formats

Having talked about limitations on the upper scale of the amount of data, I would like to go down several levels. Let's talk about file formats.

Structured Data

Structured data has a schema. It is organized and thus usually easier to search than unstructured data. Relational Databases structure data, but the contents of columns can contain unstructured data (e.g. a free text field).

Unstructured data

There are too many file formats for unstructured data to name them all. Here are a few examples:

  • Text files: e-mails
  • Images: JPG, PNG, GIF, BMP, ...
  • Documents: PDF, PS, ...
  • Video, Audio, ...

Databases

Databases are a nice way to store data. Types of databases are:

  • SQL-based: MySQL / MariaDB, PostgreSQL, ...
  • Document-oriented database: CouchDB, MongoDB, Elasticsearch
  • Graph databases: Neo4j, ...
  • Key-Value databases: Reddis, ...

Data Warehouse

Classical usecases of data warehouses are operational and financial reporting.

See also:

  • Wikipedia
    • Data Warehouse
    • FACT table

Data Lake

The idea of a data lake is that it is a large container. Several sources add data to the lake. The type of data might be structured or unstructured, machine generated or log files.

Data lakes have 5 core principles according to Evan Shelley:

  • Ingest: Ability to collect all data you care about
  • Store: Getting data in one place (e.g. with file system like Hadoop)
  • Analyze: Find relations you care about
  • Surface: Display results found in data
  • Act: Help the customer to make more money

Hadoop is a key tool for data lakes.

Frameworks

  • Apache Hadoop: Map-Reduce framework for distributed computing
  • Apache Spark: Framework for cluster computing
  • Apache Cassandra: distributed, wide column store, NoSQL database management system

Published

Dez 30, 2018
by Martin Thoma

Category

Code

Tags

  • Data 6
  • Machine Learning 81

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor