State of the Art in ML

It is difficult to keep track of the current state of the art (SotA). Also, it might not be directly clear which datasets are relevant. The following list should help. If you think some datasets / problems / SotA results are missing, let me know in the comments or via E-mail ([email protected]). I will update it.

Papers and blog posts which summarize a topic or give a good introduction are always welcome.

In the following, a + will indicate "higher is better" and a - will indicate "lower is better".

Computer Vision

Image Classification

Dataset	Year	Score	Type	Paper
ImageNet 2012	2015	3.08 %	Top-5 error -	[SIVA16]
MNIST	2013	0.21 %	error -	[WZZ+13]
CIFAR-10	2017	2.72 %	error -	[G17]
CIFAR-100	2016	15.85 %	error -	[G17]
STL-10	2017	78.66 %	accuracy +	[Tho17-2]
SVHN	2016	1.54 %	error -	[ZK16]
Caltech-101	2014	91.4 %	accuracy +	[HZRS14]
Caltech-256	2014	74.2 %	accuracy +	[ZF14]
HASYv2	2017	85.92 %	accuracy +	[Tho17-2]
Graz-02	2010	78.98 %	accuracy +	[BMDP10]
YFCC100m
CUB-200-2011 Birds	2015	84.1	accuracy +	[LRM15]
DISFA	2017	48.5	accuracy +	[LAZY17]
BP4D
EMNIST	2017	50.93	accuracy +	[CATS17]
Megaface	2015	74.6%	accuracy +	Google - FaceNet v8
CelebA	?	?	?	?
GTSRB	2017	99.51%	accuracy +	[Tho17-2]

State of the art in this category are CNN models which use skip connections in the form of residual connections or dense connections.

The evaluation metrics are straight-forward:

Accuracy: Count how many elements of the test dataset you got right, divided by the total number of elements in the test dataset. The accuracy is in $[0, 1]$. Higher is better.
Error = 1 - accuracy. The error is in $[0, 1]$. Lower is better.
Top-k accuracy: Sometimes, there are either extremely similar classes or the application allows having multiple guesses. Hence not the Top-1 guess of the network has to be right, but the correct label has to be within the top $k$ guesses. The top-$k$ accuracy is in $[0, 1]$. Higher is better.

Detection (Images)

Face recognition is a special case of detection.

Common metrics are:

mAP (Mean Average Precision): A detection is successfull, if the bounding box prediction and the true bounding box $\frac{intersection}{union}$ (IU, IoU) ratio is at least 0.5. Then the average precision = $\frac{TP}{TP + FP}$ is calculated for each class and the mean is calculated of those (see Explanation, What does the notation mAP@[.5:.95] mean?)
MR (miss rate)

Dataset	Year	Score	Type	Paper
PASCAL VOC 2012	2015	75.9	[email protected] +	[RHGS15]
PASCAL VOC 2011	2014	62.7	mean IU +	[LSD14]
PASCAL VOC 2010	2011	30.2	mean accuracy +	[Kol11]
PASCAL VOC 2007	2015	71.6	[email protected] +	[LAES+15]
MS COCO	2015	46.5	[email protected] +	[LAES+15]
CityPersons	2017	33.10	MR -	[ZBS17]

Detection (Videos)

Dataset	Year	Score	Type	Paper
YouTube-BoundingBoxes

Person Re-Identitification

Person Re-ID is the task of identifying a person again which was already seen in a video stream. Person following and MTMCT seems to be very similar if not identical.

Dataset	Year	Score	Type	Paper
Market-1501	2017	62.1	mAP +	[SZDW17]
CUHK03	2017	84.8	mAP +	[SZDW17]
DukeMTMC	2017	56.8	mAP +	[SZDW17]

Semantic Segmentation

A summary of classical methods for semantic segmentation, more information to several datasets and metrics for evaluation can be found in A Survey of Semantic Segmentation.

Dataset	Year	Score	Type	Paper
MSRC-21	2011	84.7	mean accuracy +	[Kol11]
KITTI Road		96.69	Max F1 +
NYUDv2	2014	34.0	mean IO +	[Kol11]
SIFT Flow	2014	39.5	mean IU +	[LSD14]
DIARETDB1
Warwick-QU
Ciona17	2017	51.36 %	mean IoU +	[GTRM17]

Instance Segmentation

See [DHS15]

Dataset	Year	Score	Type	Paper
CityScapes

Action Recognition

Action recognition is a classification problem over a short video clip.

Dataset	Year	Score	Type	Paper
YouTube-8M
Sports-1M	2015	68.7 %	Clip Hit@1 accuracy +	[NHV+15]
UCF-101	2015	70.8 %	Clip Hit@1 accuracy +	[NHV+15]
KTH	2015	95.6 %	EER +	[RMRMD15]
UCF Sport	2015	97.8 %	EER +	[RMRMD15]
UCF-11 Human Action	2015	89.5 %	EER +	[RMRMD15]

Super Resolution

See github.com/huangzehao

Dataset	Year	Score	Type	Paper

I'm not sure how super resolution is benchmarked. One way to do it would be to get high resolution images, scale them down, feed them to the network and measure the mean squared error for each pixel:

$$\frac{1}{|I|} \sum_{t \in I} {(t - \hat{t})}^2$$

However, this might be sensitive to the way the images were downsampled.

Lip Reading

Dataset	Year	Score	Type	Paper
GRID	2016	95.2 %	accuracy +	[ASWF16]

Other Datasets

For the following datasets, I was not able to find where to download them

Mapping global urban areas using MODIS 500-m data: New methods and datasets based on urban ecoregions
TorontoCity: Seeing the World with a Million Eyes

ASR

Automatic Speech Recognition (ASR).

Sentence-Level

Dataset	Year	Score	Type	Paper
WSJ (eval92)	2015	3.47	WER -	[CL15]
Switchboard Hub5'00	2016	6.3%	WER -	[XDSS+16]

See Word Error Rate (WER) for an explanation of the metric.

Relevant papers might be

Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Phoneme-Level

Dataset	Year	Score	Type	Paper
TIMIT	2013	17.7 %	error rate -	[GMH13]

Language

Natural Language Processing (NLP) deals with how to represent language. It is related and often a part of ASR.

Dataset	Year	Score	Type	Paper
WikiText-103	2016	48.7	Perplexity -	[GJU16]
Penn Treebank (PTB)	2016	62.4	Perplexity -	[ZL16] (summary)
Stanford Sentiment Treebank

NLP benchmarks use perplexity to measure how good a result is.

Translation

Dataset	Year	Score	Type	Paper
MT03	2003	35.76	BLEU	[OGKS+03]

The BLEU score is used to measure how good a translation system is.

Another score is the Translation Edit Rate (TER) introduced by Snover et al., 2006.

Matrix completion

Collaborative filtering is an application of matrix completion. More datasets are on entaroadun/gist:1653794.

Dataset	Year	Score	Type	Paper
MovieLens
Jester

Reinforcment Learning

The OpenAI Gym offers many environments for testing RL algorithms.

Challenge	Year	Score	Type	Paper
Chess		3395		Stockfishchess
Go	2015	3,168	ELO +	AlphaGo
Dota	2018	-	A few matches	OpenAI Five

Control

Dataset	Year	Score	Type	Paper
Cart Pole