WiLI-2018

WiLI-2018, the Wikipedia Language Identification database, is a collection of sentences from Wikipedia of different languages. It can be used to test how hard it is to distinguish different languages.

If you want to get to the data, go to zenodo.org. If you want to get to the publication, go to archive.org. If you want to give feedback or a comment, just comment below.

If you want to add to the errata, you have a couple of options:

Send me an e-mail ([email protected]) - if I don't respond within 3 days, just ping me again. I get a lot of messages and if I read it during work I might not answer right away. I'm sorry if that happens.
Add an issue on GitHub
Make a Pull Request on Github - that is best, because it makes sure that you get credit for your work as well!

Errata

If you want to share corrected labels, the following format is used:

file;line;wrong label;correct label;comment;contributor

Where:

file: either y_test.txt or y_train.txt
line: Zero-based - your editor might show something different!
wrong label: what is in y_test.txt / y_train.txt
correct label: what should be in y_test.txt / y_train.txt
comment: How you found it
contributor: Your name / e-mail address / pseudonym - whatever you want, as long as it is not insulting or otherwise improper such as advertisement

WiLI-2018

Errata

Published

Category

Tags

Contact