WiLI-2018, the Wikipedia Language Identification database, is a collection of sentences from Wikipedia of different languages. It can be used to test how hard it is to distinguish different languages.
If you want to get to the data, go to zenodo.org. If you want to get to the publication, go to archive.org. If you want to give feedback or a comment, just comment below.
If you want to add to the errata, you have a couple of options:
- Send me an e-mail ([email protected]) - if I don't respond within 3 days, just ping me again. I get a lot of messages and if I read it during work I might not answer right away. I'm sorry if that happens.
- Add an issue on GitHub
- Make a Pull Request on Github - that is best, because it makes sure that you get credit for your work as well!
Errata
If you want to share corrected labels, the following format is used:
file;line;wrong label;correct label;comment;contributor
Where:
file
: eithery_test.txt
ory_train.txt
line
: Zero-based - your editor might show something different!wrong label
: what is iny_test.txt
/y_train.txt
correct label
: what should be iny_test.txt
/y_train.txt
comment
: How you found itcontributor
: Your name / e-mail address / pseudonym - whatever you want, as long as it is not insulting or otherwise improper such as advertisement