I really like Machine Learning. I like reading papers, understanding and
evaluating new ideas. But one part I always have to spend quite a bit of time
on is loading the data. It's always a mess to find the datasets, understand
where exactly I can download them and how they've packaged the information.
Just a few days ago I found skdata
. It is a Python package which aims at
helping to load standard datasets. If I can trust the git commit message, then
the development was started in August 2011 by James Bergstra! This is before
edit: Although it seemed to be a cool project, it seems to be dead, too. The last commit is from July 2015.
Usage ¶
One way to use skdata
is the following:
#!/usr/bin/env python
"""MNIST example with skdata."""
from skdata.mnist.view import OfficialVectorClassification
except ImportError:
# Fallback, if you have an old version
from skdata.mnist.views import OfficialVectorClassification
from sklearn.tree import DecisionTreeClassifier
# Load the data
view = OfficialVectorClassification()
train_idx = view.fit_idxs # indices of training data
val_idx = view.val_idxs # incices of validation data
test_idx = view.tst_idxs # indices of test data
# Fit a simple classifier
print("Start fitting DecisionTreeClassifier.")
clf = DecisionTreeClassifier(max_depth=5)
features = view.all_vectors[train_idx] # select features of training data
targets = view.all_labels[train_idx] # select labels of training data
clf.fit(features, targets)
# Evaluate the classifier
predict = clf.predict(view.all_vectors[test_idx])
accuracy = sum(predict == view.all_labels[test_idx]) / float(len(test_idx))
print("Fitted DecisionTreeClassifier has test accuracy of %0.4f." % accuracy)
However, it is inteded to be used like this:
from skdata.mnist.view import OfficialVectorClassification
from sklearn.tree import DecisionTreeClassifier
# Load the data
mnist_view = OfficialVectorClassification()
train_idx = mnist_view.fit_idxs
val_idx = mnist_view.val_idxs
test_idx = mnist_view.tst_idxs
# Fit a simple classifier
from skdata.base import SklearnClassifier
learning_algo = SklearnClassifier(DecisionTreeClassifier)
print(learn_algo.results["loss"][0]["task_name"] == "tst")
... but this doesn't work (for me)
Other Data-Loading Projects ¶
You can access R data with rpy2
from rpy2.robjects import r
from rpy2.robjects import pandas2ri
def data(name):
return pandas2ri.ri2py(r[name])
df = data("iris")
You can also load mnist
, cifar10
, imdb
, reuters
with keras:
from keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()