• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

skdata

Contents

  • Usage
  • Other Data-Loading Projects
  • See also

I really like Machine Learning. I like reading papers, understanding and evaluating new ideas. But one part I always have to spend quite a bit of time on is loading the data. It's always a mess to find the datasets, understand where exactly I can download them and how they've packaged the information. Just a few days ago I found skdata. It is a Python package which aims at helping to load standard datasets. If I can trust the git commit message, then the development was started in August 2011 by James Bergstra! This is before AlexNet!

edit: Although it seemed to be a cool project, it seems to be dead, too. The last commit is from July 2015.

Usage

One way to use skdata is the following:

#!/usr/bin/env python

"""MNIST example with skdata."""

try:
    from skdata.mnist.view import OfficialVectorClassification
except ImportError:
    # Fallback, if you have an old version
    from skdata.mnist.views import OfficialVectorClassification
from sklearn.tree import DecisionTreeClassifier

# Load the data
view = OfficialVectorClassification()
train_idx = view.fit_idxs  # indices of training data
val_idx = view.val_idxs  # incices of validation data
test_idx = view.tst_idxs  # indices of test data

# Fit a simple classifier
print("Start fitting DecisionTreeClassifier.")
clf = DecisionTreeClassifier(max_depth=5)
features = view.all_vectors[train_idx]  # select features of training data
targets = view.all_labels[train_idx]  # select labels of training data
clf.fit(features, targets)

# Evaluate the classifier
predict = clf.predict(view.all_vectors[test_idx])
accuracy = sum(predict == view.all_labels[test_idx]) / float(len(test_idx))
print("Fitted DecisionTreeClassifier has test accuracy of %0.4f." % accuracy)

However, it is inteded to be used like this:

from skdata.mnist.view import OfficialVectorClassification
from sklearn.tree import DecisionTreeClassifier

# Load the data
mnist_view = OfficialVectorClassification()
train_idx = mnist_view.fit_idxs
val_idx = mnist_view.val_idxs
test_idx = mnist_view.tst_idxs

# Fit a simple classifier
from skdata.base import SklearnClassifier

learning_algo = SklearnClassifier(DecisionTreeClassifier)
mnist_view.protocol(learning_algo)
print(learn_algo.results["loss"][0]["task_name"] == "tst")

... but this doesn't work (for me)

Other Data-Loading Projects

You can access R data with rpy2.

from rpy2.robjects import r
from rpy2.robjects import pandas2ri


def data(name):
    return pandas2ri.ri2py(r[name])


df = data("iris")
print(df.describe())
  • quandl
  • fuel

You can also load mnist, cifar10,cifar100, imdb, reuters with keras:

from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

See also

  • skdata documentation
  • skdata on Github
  • How to Create a New Dataset Module
    • Protocol
  • List of datasets
  • fuel
  • kerosene
  • ImageNet

Published

Jan 30, 2017
by Martin Thoma

Category

Machine Learning

Tags

  • dataset 1
  • Machine Learning 81
  • Python 141

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor