• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Data Science - An Overview

Contents

  • Data Science vs Business Analytics
  • Data Scientist vs Data Engineer
  • Data Scientist vs Data Analyst
  • Data Science vs ML vs AI
  • See also

Data Science recently became popular. Currently are 154 open job positions on Indeed.com for Data Scientists in Munich. To put it into context: There are 186 Android developer positions open, 527 Dev Ops, 753 frontend, 812 backend. So it's still fairly small, but in the same ballpark.

I wanted to have a data-based answer to what a data scientist actually is and created a list of the skill set employers ask for, but it turns out that this would lead to a lengthy and hard to digest blog post.

I don't really like the term "data science" as it is too vague to me, but here is how I would define it: A data scientist is a person who applies data science. Data science is an academic field which deals with the extraction of knowledge and insights from data.

Popularity of Data Science and related terms in books.
Popularity of Data Science and related terms in books. One can see a linear increase for "machine learning" since about 1975, the term "data mining" exploded from 1992 to 2003. Other related terms like "big data", "deep learning", "information extraction" and "data science" are much less popular in books.

I would also say it is a term used much more often in industry than in academia. The requirements between different job postings differ, but there are some general themes:

Word Cloud of the Skillset in 10 different Data Scientist job postings
Word Cloud of the Skillset in 10 different Data Scientist job postings. If you're interested how to create word clouds, look here.

Some of the requirements are typical senior software developer skills, such as knowledge in Scrum and Waterfall and good knowledge of spoken and written English and German. And some are rather special such as several skills around machine learning (sklearn, scipy, nltk, Theno / Tensorflow / Keras / MXNet) or Big Data (AWS, Hadoop, Spark).

I've also asked some friends and collegues which kind of tasks they have seen so far. I gave them a list of six possible responses and asked for more if there is something that didn't match an entry in the list. I didn't get any answer outside of it. Here are the answers:

Data Science project types
Data Science project types. EDA is short for "Exploratory Data Analysis". The bar chart was created with rapidtables.com

Let's first explain the differnt project types:

  1. Forecasts: Given a time series of the past, predict the future
  2. Classification (and regression): For example, detect if an e-mail is spam or not
  3. EDA: Exploratory Data Analysis. Here is the data - now find something interesting. This is a very unspecific task.
  4. Visualizations: Data Science can also be a bit about story telling. You found something which can be explained with exact terminology and words, but it has to be made clear to stakeholders what you found in an simple, intuitive, fast way.
  5. A/B tests (and hypothesis testing)
  6. Clustering: Which types of customers do we have? (Customer segmenation)

Now, back to the bar chart: You can see that bar charts are much more visible / stick better to peoples mind, although the other tasks are more common. And you can see that people tend to make too quick conclusions from seeing a pattern in small numbers 😉

From personal experience, I would say that forecasts, clustering and regression are relative common tasks. Of course, one has often to start with exploratory data analysis.

I try to avoid clustering and pure EDA tasks as they are ill-defined. You can't say when you are ready which makes it hard to get satisfying results.

Data Science vs Business Analytics

Both, data science and business analytics are closely related. They certainly have big overlaps. Here are some differences:

Business Intelligence Data Science
Tasks Reports Predictive models
Tool Qlickview, SAP Pandas, sklearn, Jupiter notebooks, Tensorflow, Keras, XGBoost, scipy, numpy

Data Scientist vs Data Engineer

Both, data scientists and data engineers, deal with data. While the engineer has more ETL-tasks (extract, transform, load), the scientists has more model creation and analysis tasks.

Data Engineer Data Scientists
Typical Background Computer Science + Software Engineering Computer Science + Mathematics
Tasks Collect and Transform Data (ETL), Data Warehousing Generate Insights from Data; Machine Learning
Typical Frameworks Hadoop, Spark, Cassandra, Apache Drill, CouchDB, talend, mongoDB, neo4j, MariaDB Pandas, Numpy, Scipy, scikit-learn

Data Scientist vs Data Analyst

Data Scientists and Data Analysts are pretty similar compared to Data Engineers. I would say that Data Scientists should also know about Machine Learning algorithms and Frameworks while I would not expect it from a Data Analyst.

Data Science vs ML vs AI

David Robinson made a really nice quote (source):

So in this post, I'm proposing an oversimplified definition of the difference between the three fields:

  • Data science produces insights.
  • Machine learning produces predictions.
  • Artificial intelligence produces actions.

Usually, I said that ML is a strict subset of AI:

AI vs ML vs Deep Learning
AI vs ML vs Deep Learning

David Robinsons statement is not a contradiction to mine. I would say you need predictions about the future to take smart actions in a changing world.

A more detailed view of Data Science, Machine Learning, and AI
A more detailed view of Data Science, Machine Learning, and AI

See also

Now that it is clear what kinds of tasks are common in data science, will continue with blog posts how to make those projects sucessful.


Published

Jun 2, 2018
by Martin Thoma

Category

Machine Learning

Tags

  • Machine Learning 81

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor