• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

State of the Art in ML

Contents

  • State of the Art in ML
    • Computer Vision
      • Image Classification
      • Detection (Images)
      • Detection (Videos)
      • Person Re-Identitification
      • Semantic Segmentation
      • Instance Segmentation
      • Action Recognition
      • Super Resolution
      • Lip Reading
      • Other Datasets
    • ASR
      • Sentence-Level
      • Phoneme-Level
    • Language
    • Translation
    • Matrix completion
    • Reinforcment Learning
    • Control
    • See also

It is difficult to keep track of the current state of the art (SotA). Also, it might not be directly clear which datasets are relevant. The following list should help. If you think some datasets / problems / SotA results are missing, let me know in the comments or via E-mail ([email protected]tin-thoma.de). I will update it.

Papers and blog posts which summarize a topic or give a good introduction are always welcome.

In the following, a + will indicate "higher is better" and a - will indicate "lower is better".

Computer Vision

Image Classification

Dataset Year Score Type Paper
ImageNet 2012 2015 3.08 % Top-5 error - [SIVA16]
MNIST 2013 0.21 % error - [WZZ+13]
CIFAR-10 2017 2.72 % error - [G17]
CIFAR-100 2016 15.85 % error - [G17]
STL-10 2017 78.66 % accuracy + [Tho17-2]
SVHN 2016 1.54 % error - [ZK16]
Caltech-101 2014 91.4 % accuracy + [HZRS14]
Caltech-256 2014 74.2 % accuracy + [ZF14]
HASYv2 2017 85.92 % accuracy + [Tho17-2]
Graz-02 2010 78.98 % accuracy + [BMDP10]
YFCC100m
CUB-200-2011 Birds 2015 84.1 accuracy + [LRM15]
DISFA 2017 48.5 accuracy + [LAZY17]
BP4D
EMNIST 2017 50.93 accuracy + [CATS17]
Megaface 2015 74.6% accuracy + Google - FaceNet v8
CelebA ? ? ? ?
GTSRB 2017 99.51% accuracy + [Tho17-2]

State of the art in this category are CNN models which use skip connections in the form of residual connections or dense connections.

The evaluation metrics are straight-forward:

  • Accuracy: Count how many elements of the test dataset you got right, divided by the total number of elements in the test dataset. The accuracy is in \([0, 1]\). Higher is better.
  • Error = 1 - accuracy. The error is in \([0, 1]\). Lower is better.
  • Top-k accuracy: Sometimes, there are either extremely similar classes or the application allows having multiple guesses. Hence not the Top-1 guess of the network has to be right, but the correct label has to be within the top \(k\) guesses. The top-\(k\) accuracy is in \([0, 1]\). Higher is better.

Detection (Images)

Face recognition is a special case of detection.

Common metrics are:

  • mAP (Mean Average Precision): A detection is successfull, if the bounding box prediction and the true bounding box $\frac{intersection}{union}$ (IU, IoU) ratio is at least 0.5. Then the average precision = $\frac{TP}{TP + FP}$ is calculated for each class and the mean is calculated of those (see Explanation, What does the notation [email protected][.5:.95] mean?)
  • MR (miss rate)
Dataset Year Score Type Paper
PASCAL VOC 2012 2015 75.9 [email protected] + [RHGS15]
PASCAL VOC 2011 2014 62.7 mean IU + [LSD14]
PASCAL VOC 2010 2011 30.2 mean accuracy + [Kol11]
PASCAL VOC 2007 2015 71.6 [email protected] + [LAES+15]
MS COCO 2015 46.5 [email protected] + [LAES+15]
CityPersons 2017 33.10 MR - [ZBS17]

Detection (Videos)

Dataset Year Score Type Paper
YouTube-BoundingBoxes

Person Re-Identitification

Person Re-ID is the task of identifying a person again which was already seen in a video stream. Person following and MTMCT seems to be very similar if not identical.

Dataset Year Score Type Paper
Market-1501 2017 62.1 mAP + [SZDW17]
CUHK03 2017 84.8 mAP + [SZDW17]
DukeMTMC 2017 56.8 mAP + [SZDW17]

Semantic Segmentation

A summary of classical methods for semantic segmentation, more information to several datasets and metrics for evaluation can be found in A Survey of Semantic Segmentation.

Dataset Year Score Type Paper
MSRC-21 2011 84.7 mean accuracy + [Kol11]
KITTI Road 96.69 Max F1 +
NYUDv2 2014 34.0 mean IO + [Kol11]
SIFT Flow 2014 39.5 mean IU + [LSD14]
DIARETDB1
Warwick-QU
Ciona17 2017 51.36 % mean IoU + [GTRM17]

Instance Segmentation

See [DHS15]

Dataset Year Score Type Paper
CityScapes

Action Recognition

Action recognition is a classification problem over a short video clip.

Dataset Year Score Type Paper
YouTube-8M
Sports-1M 2015 68.7 % Clip [email protected] accuracy + [NHV+15]
UCF-101 2015 70.8 % Clip [email protected] accuracy + [NHV+15]
KTH 2015 95.6 % EER + [RMRMD15]
UCF Sport 2015 97.8 % EER + [RMRMD15]
UCF-11 Human Action 2015 89.5 % EER + [RMRMD15]

Super Resolution

See github.com/huangzehao

Dataset Year Score Type Paper

I'm not sure how super resolution is benchmarked. One way to do it would be to get high resolution images, scale them down, feed them to the network and measure the mean squared error for each pixel:

$$\frac{1}{|I|} \sum_{t \in I} {(t - \hat{t})}^2$$

However, this might be sensitive to the way the images were downsampled.

Lip Reading

Dataset Year Score Type Paper
GRID 2016 95.2 % accuracy + [ASWF16]

Other Datasets

For the following datasets, I was not able to find where to download them

  • Mapping global urban areas using MODIS 500-m data: New methods and datasets based on urban ecoregions
  • TorontoCity: Seeing the World with a Million Eyes

ASR

Automatic Speech Recognition (ASR).

Sentence-Level

Dataset Year Score Type Paper
WSJ (eval92) 2015 3.47 WER - [CL15]
Switchboard Hub5'00 2016 6.3% WER - [XDSS+16]

See Word Error Rate (WER) for an explanation of the metric.

Relevant papers might be

  • Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Phoneme-Level

Dataset Year Score Type Paper
TIMIT 2013 17.7 % error rate - [GMH13]

Language

Natural Language Processing (NLP) deals with how to represent language. It is related and often a part of ASR.

Dataset Year Score Type Paper
WikiText-103 2016 48.7 Perplexity - [GJU16]
Penn Treebank (PTB) 2016 62.4 Perplexity - [ZL16] (summary)
Stanford Sentiment Treebank

NLP benchmarks use perplexity to measure how good a result is.

Translation

Dataset Year Score Type Paper
MT03 2003 35.76 BLEU [OGKS+03]

The BLEU score is used to measure how good a translation system is.

Another score is the Translation Edit Rate (TER) introduced by Snover et al., 2006.

Matrix completion

Collaborative filtering is an application of matrix completion. More datasets are on entaroadun/gist:1653794.

Dataset Year Score Type Paper
MovieLens
Jester

Reinforcment Learning

The OpenAI Gym offers many environments for testing RL algorithms.

Challenge Year Score Type Paper
Chess 3395 Stockfishchess
Go 2015 3,168 ELO + AlphaGo
Dota 2018 - A few matches OpenAI Five

Control

Dataset Year Score Type Paper
Cart Pole

See also

  • Are we there yet ?
  • Some state-of-the-arts in natural language processing and their discussion
  • aclweb.org: State of the art - NLP tasks
  • wer_are_we: SotA in ASR
  • github.com/michalwols/ml-sota

More datasets

  • List of datasets for machine learning research
  • traffic-signs-dataset
  • Stanford Dogs
  • Awesome Public Datasets
  • archive.ics.uci.edu/ml/datasets.html
  • Tiny ImageNet Visual Recognition Challenge

Published

Feb 6, 2017
by Martin Thoma

Category

Machine Learning

Tags

  • Datasets 1
  • Machine Learning 81

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor