It is difficult to keep track of the current state of the art (SotA). Also, it
might not be directly clear which datasets are relevant. The following list
should help. If you think some datasets / problems / SotA results are missing,
let me know in the comments or via E-mail ([email protected]
).
I will update it.
Papers and blog posts which summarize a topic or give a good introduction are always welcome.
In the following, a +
will indicate "higher is better" and a
-
will indicate "lower is better".
Computer Vision
Image Classification
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
ImageNet 2012 | 2015 | 3.08 % | Top-5 error - | [SIVA16] |
MNIST | 2013 | 0.21 % | error - | [WZZ+13] |
CIFAR-10 | 2017 | 2.72 % | error - | [G17] |
CIFAR-100 | 2016 | 15.85 % | error - | [G17] |
STL-10 | 2017 | 78.66 % | accuracy + | [Tho17-2] |
SVHN | 2016 | 1.54 % | error - | [ZK16] |
Caltech-101 | 2014 | 91.4 % | accuracy + | [HZRS14] |
Caltech-256 | 2014 | 74.2 % | accuracy + | [ZF14] |
HASYv2 | 2017 | 85.92 % | accuracy + | [Tho17-2] |
Graz-02 | 2010 | 78.98 % | accuracy + | [BMDP10] |
YFCC100m | ||||
CUB-200-2011 Birds | 2015 | 84.1 | accuracy + | [LRM15] |
DISFA | 2017 | 48.5 | accuracy + | [LAZY17] |
BP4D | ||||
EMNIST | 2017 | 50.93 | accuracy + | [CATS17] |
Megaface | 2015 | 74.6% | accuracy + | Google - FaceNet v8 |
CelebA | ? | ? | ? | ? |
GTSRB | 2017 | 99.51% | accuracy + | [Tho17-2] |
State of the art in this category are CNN models which use skip connections in the form of residual connections or dense connections.
The evaluation metrics are straight-forward:
- Accuracy: Count how many elements of the test dataset you got right, divided by the total number of elements in the test dataset. The accuracy is in \([0, 1]\). Higher is better.
- Error = 1 - accuracy. The error is in \([0, 1]\). Lower is better.
- Top-k accuracy: Sometimes, there are either extremely similar classes or the application allows having multiple guesses. Hence not the Top-1 guess of the network has to be right, but the correct label has to be within the top \(k\) guesses. The top-\(k\) accuracy is in \([0, 1]\). Higher is better.
Detection (Images)
Face recognition is a special case of detection.
Common metrics are:
- mAP (Mean Average Precision): A detection is successfull, if the bounding box prediction and the true bounding box $\frac{intersection}{union}$ (IU, IoU) ratio is at least 0.5. Then the average precision = $\frac{TP}{TP + FP}$ is calculated for each class and the mean is calculated of those (see Explanation, What does the notation mAP@[.5:.95] mean?)
- MR (miss rate)
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
PASCAL VOC 2012 | 2015 | 75.9 | [email protected] + | [RHGS15] |
PASCAL VOC 2011 | 2014 | 62.7 | mean IU + | [LSD14] |
PASCAL VOC 2010 | 2011 | 30.2 | mean accuracy + | [Kol11] |
PASCAL VOC 2007 | 2015 | 71.6 | [email protected] + | [LAES+15] |
MS COCO | 2015 | 46.5 | [email protected] + | [LAES+15] |
CityPersons | 2017 | 33.10 | MR - | [ZBS17] |
Detection (Videos)
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
YouTube-BoundingBoxes |
Person Re-Identitification
Person Re-ID is the task of identifying a person again which was already seen in a video stream. Person following and MTMCT seems to be very similar if not identical.
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
Market-1501 | 2017 | 62.1 | mAP + | [SZDW17] |
CUHK03 | 2017 | 84.8 | mAP + | [SZDW17] |
DukeMTMC | 2017 | 56.8 | mAP + | [SZDW17] |
Semantic Segmentation
A summary of classical methods for semantic segmentation, more information to several datasets and metrics for evaluation can be found in A Survey of Semantic Segmentation.
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
MSRC-21 | 2011 | 84.7 | mean accuracy + | [Kol11] |
KITTI Road | 96.69 | Max F1 + | ||
NYUDv2 | 2014 | 34.0 | mean IO + | [Kol11] |
SIFT Flow | 2014 | 39.5 | mean IU + | [LSD14] |
DIARETDB1 | ||||
Warwick-QU | ||||
Ciona17 | 2017 | 51.36 % | mean IoU + | [GTRM17] |
Instance Segmentation
See [DHS15]
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
CityScapes |
Action Recognition
Action recognition is a classification problem over a short video clip.
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
YouTube-8M | ||||
Sports-1M | 2015 | 68.7 % | Clip Hit@1 accuracy + | [NHV+15] |
UCF-101 | 2015 | 70.8 % | Clip Hit@1 accuracy + | [NHV+15] |
KTH | 2015 | 95.6 % | EER + | [RMRMD15] |
UCF Sport | 2015 | 97.8 % | EER + | [RMRMD15] |
UCF-11 Human Action | 2015 | 89.5 % | EER + | [RMRMD15] |
Super Resolution
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
I'm not sure how super resolution is benchmarked. One way to do it would be to get high resolution images, scale them down, feed them to the network and measure the mean squared error for each pixel:
However, this might be sensitive to the way the images were downsampled.
Lip Reading
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
GRID | 2016 | 95.2 % | accuracy + | [ASWF16] |
Other Datasets
For the following datasets, I was not able to find where to download them
- Mapping global urban areas using MODIS 500-m data: New methods and datasets based on urban ecoregions
- TorontoCity: Seeing the World with a Million Eyes
ASR
Automatic Speech Recognition (ASR).
Sentence-Level
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
WSJ (eval92) | 2015 | 3.47 | WER - | [CL15] |
Switchboard Hub5'00 | 2016 | 6.3% | WER - | [XDSS+16] |
See Word Error Rate (WER) for an explanation of the metric.
Relevant papers might be
Phoneme-Level
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
TIMIT | 2013 | 17.7 % | error rate - | [GMH13] |
Language
Natural Language Processing (NLP) deals with how to represent language. It is related and often a part of ASR.
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
WikiText-103 | 2016 | 48.7 | Perplexity - | [GJU16] |
Penn Treebank (PTB) | 2016 | 62.4 | Perplexity - | [ZL16] (summary) |
Stanford Sentiment Treebank |
NLP benchmarks use perplexity to measure how good a result is.
Translation
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
MT03 | 2003 | 35.76 | BLEU | [OGKS+03] |
The BLEU score is used to measure how good a translation system is.
Another score is the Translation Edit Rate (TER) introduced by Snover et al., 2006.
Matrix completion
Collaborative filtering is an application of matrix completion. More datasets are on entaroadun/gist:1653794.
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
MovieLens | ||||
Jester |
Reinforcment Learning
The OpenAI Gym offers many environments for testing RL algorithms.
Challenge | Year | Score | Type | Paper |
---|---|---|---|---|
Chess | 3395 | Stockfishchess | ||
Go | 2015 | 3,168 | ELO + | AlphaGo |
Dota | 2018 | - | A few matches | OpenAI Five |
Control
Dataset | Year | Score | Type | Paper |
---|---|---|---|---|
Cart Pole |
See also
- Are we there yet ?
- Some state-of-the-arts in natural language processing and their discussion
- aclweb.org: State of the art - NLP tasks
- wer_are_we: SotA in ASR
- github.com/michalwols/ml-sota
More datasets