• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

WiLI-2018

Contents

  • WiLI-2018
    • Errata

WiLI-2018, the Wikipedia Language Identification database, is a collection of sentences from Wikipedia of different languages. It can be used to test how hard it is to distinguish different languages.

If you want to get to the data, go to zenodo.org. If you want to get to the publication, go to archive.org. If you want to give feedback or a comment, just comment below.

If you want to add to the errata, you have a couple of options:

  • Send me an e-mail ([email protected]) - if I don't respond within 3 days, just ping me again. I get a lot of messages and if I read it during work I might not answer right away. I'm sorry if that happens.
  • Add an issue on GitHub
  • Make a Pull Request on Github - that is best, because it makes sure that you get credit for your work as well!

Errata

If you want to share corrected labels, the following format is used:

file;line;wrong label;correct label;comment;contributor

Where:

  • file: either y_test.txt or y_train.txt
  • line: Zero-based - your editor might show something different!
  • wrong label: what is in y_test.txt / y_train.txt
  • correct label: what should be in y_test.txt / y_train.txt
  • comment: How you found it
  • contributor: Your name / e-mail address / pseudonym - whatever you want, as long as it is not insulting or otherwise improper such as advertisement

Published

Jan 13, 2019
by Martin Thoma

Category

Machine Learning

Tags

  • Machine Learning 81

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor