Data Science - An Overview

Data Science recently became popular. Currently are 154 open job positions on Indeed.com for Data Scientists in Munich. To put it into context: There are 186 Android developer positions open, 527 Dev Ops, 753 frontend, 812 backend. So it's still fairly small, but in the same ballpark.

I wanted to have a data-based answer to what a data scientist actually is and created a list of the skill set employers ask for, but it turns out that this would lead to a lengthy and hard to digest blog post.

I don't really like the term "data science" as it is too vague to me, but here is how I would define it: A data scientist is a person who applies data science. Data science is an academic field which deals with the extraction of knowledge and insights from data.

Popularity of Data Science and related terms in books. One can see a linear increase for "machine learning" since about 1975, the term "data mining" exploded from 1992 to 2003. Other related terms like "big data", "deep learning", "information extraction" and "data science" are much less popular in books.

I would also say it is a term used much more often in industry than in academia. The requirements between different job postings differ, but there are some general themes:

Word Cloud of the Skillset in 10 different Data Scientist job postings. If you're interested how to create word clouds, look here.

Some of the requirements are typical senior software developer skills, such as knowledge in Scrum and Waterfall and good knowledge of spoken and written English and German. And some are rather special such as several skills around machine learning (sklearn, scipy, nltk, Theno / Tensorflow / Keras / MXNet) or Big Data (AWS, Hadoop, Spark).

I've also asked some friends and collegues which kind of tasks they have seen so far. I gave them a list of six possible responses and asked for more if there is something that didn't match an entry in the list. I didn't get any answer outside of it. Here are the answers:

Data Science project types. EDA is short for "Exploratory Data Analysis". The bar chart was created with rapidtables.com

Let's first explain the differnt project types:

Forecasts: Given a time series of the past, predict the future
Classification (and regression): For example, detect if an e-mail is spam or not
EDA: Exploratory Data Analysis. Here is the data - now find something interesting. This is a very unspecific task.
Visualizations: Data Science can also be a bit about story telling. You found something which can be explained with exact terminology and words, but it has to be made clear to stakeholders what you found in an simple, intuitive, fast way.
A/B tests (and hypothesis testing)
Clustering: Which types of customers do we have? (Customer segmenation)

Now, back to the bar chart: You can see that bar charts are much more visible / stick better to peoples mind, although the other tasks are more common. And you can see that people tend to make too quick conclusions from seeing a pattern in small numbers 😉

From personal experience, I would say that forecasts, clustering and regression are relative common tasks. Of course, one has often to start with exploratory data analysis.

I try to avoid clustering and pure EDA tasks as they are ill-defined. You can't say when you are ready which makes it hard to get satisfying results.

Data Science vs Business Analytics

Both, data science and business analytics are closely related. They certainly have big overlaps. Here are some differences:

	Business Intelligence	Data Science
Tasks	Reports	Predictive models
Tool	Qlickview, SAP	Pandas, sklearn, Jupiter notebooks, Tensorflow, Keras, XGBoost, scipy, numpy

Data Scientist vs Data Engineer

Both, data scientists and data engineers, deal with data. While the engineer has more ETL-tasks (extract, transform, load), the scientists has more model creation and analysis tasks.

	Data Engineer	Data Scientists
Typical Background	Computer Science + Software Engineering	Computer Science + Mathematics
Tasks	Collect and Transform Data (ETL), Data Warehousing	Generate Insights from Data; Machine Learning
Typical Frameworks	Hadoop, Spark, Cassandra, Apache Drill, CouchDB, talend, mongoDB, neo4j, MariaDB	Pandas, Numpy, Scipy, scikit-learn

Data Scientist vs Data Analyst

Data Scientists and Data Analysts are pretty similar compared to Data Engineers. I would say that Data Scientists should also know about Machine Learning algorithms and Frameworks while I would not expect it from a Data Analyst.

Data Science vs ML vs AI

David Robinson made a really nice quote (source):

So in this post, I'm proposing an oversimplified definition of the difference between the three fields:

Data science produces insights.

Machine learning produces predictions.

Artificial intelligence produces actions.

Usually, I said that ML is a strict subset of AI:

David Robinsons statement is not a contradiction to mine. I would say you need predictions about the future to take smart actions in a changing world.

Data Science - An Overview

Data Science vs Business Analytics

Data Scientist vs Data Engineer

Data Scientist vs Data Analyst

Data Science vs ML vs AI

See also

Published

Category

Tags

Contact