Machine Learning algorithms for computer vision need huge amounts of data. Here are a few remarks on how to download them.
- Make sure you have enough space (
df -h) - Get a download manager. I use aria2c (
sudo apt-get install aria2)
For ImageNet, you have to register at image-net.org.
Download the files like this:
$ aria2c -s 16 [URL]
After downloading the file, use
$ md5sum [Filename]
and compare the hash with the provided hash. If it differs, download the file again.
The ImageNet training data tar file contains 1000 files of the form
n01440764.tar, n01443537.tar, ...
Each of those files contains JPEGs of one class. You can look the class label up with
from nltk.corpus import wordnet as wn
print(wn._synset_from_pos_and_offset("n", 1440764))
print(wn._synset_from_pos_and_offset("n", 1443537))
which reveals
- Synset('tench.n.01')
- Synset('goldfish.n.01')
If you extract all 1000 of those tar files into one directory, this takes about 6 hours with a script like this:
#!/usr/bin/env python
import glob
import tarfile
def untar(fname, targetd_dir):
with tarfile.open(fname) as tar:
tar.extractall(path=targetd_dir)
files = glob.glob("ILSVRC2012_img_train/*.tar")
for f in files:
untar(f, "extracted")
This gives 1281170 files in total.