Natural language processing (NLP) is a scientific field which deals with language in textual form.
Tasks ¶
- Classification:
- Is an e-mail spam or not?
- Topic: Is it about sports, science or religion?
- Language: Is it English, German or French?
- Sentence boundary: Is a character the boundary of a sentence or not?
- Author:
- Identify the author from a given set of authors
- Age of the author
- Gender of the author
- Sentiment analysis (Opinion mining, opinion extraction, sentiment mining, subjectivity analysis)
- Machine Translation (MT): Given a text in language A, return the same content in language B.
- Similarity calculation: Given a corpus of n texts and one text A as
input, find passages of the corpus which are similar to passages of A. This
can be used to detect if students copied content / copyright violation.
- Minimum Edit Distance
- Spelling correction: Find places where the grammar / writing needs to be fixed.
- Word sense disambiguation: If "mouse" is in a sentence, is it about the computer mouse or the animal.
- POS Tagging: Detect adjectives, verbs, nouns in a sentence.
- Summarization / Paraphrasing
- Information extraction: For example, find the date in a calender
application when the user enters the name of the event. Or dates in an e-mail
in order to allow users to create a calender date.
- Named Entity Recognition (NER):
- Find names in a text
- Classify names into names of places, people, organizations and non-names.
- Relation Extraction
- Named Entity Recognition (NER):
- Compound splitting: For German, "Donaudampfschiffskapitän" can be split into the compounds "Donau" (a river) "dampfschiff" (steam boat) and "kapitän" (captain).
Sentiment analysis ¶
Find out how users feel about something.
Sentiment Lexicons are compared by Christopher Potts ("Sentiment Tutorial", 2011):
- General Inquirer is free for research use
- List of categories
- Positive (1915 words) and negative (2291 words)
- strong vs weak, active vs passive, overstated vs understated
- pleasure, pain, virtue, vice, motivation, cognitive orientation, ...
- Spreadsheet
- List of categories
- LIWC - Linguistic Inquiry and Word count: 30 US-Dollar or 90 US-Dollar fee
- 2300 words, more than 70 classes
- affective process (negative and positive emotion)
- cognitive processes: Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
- Pronouns, Negation (no, never), Quantifiers (few, many)
- Bing Liu Opinion Lexicon
- 6786 words (2006 positive, 4783 negative)
- SentiWordNet
Positivity of a word can be infered from reviews. Reviews with many stars should have positive words, reviews with only one or two stars should have negative words.
Text generation ¶
Generate text in a given style / tone.
- Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks
Named Entity Recognition ¶
Named entities are sequences of word tokens. Each word token can either be other (O), the beginning of a namend entity (B) or the continuation of a named entity (I):
IO-encoding IOB-encoding
Adam PER B-PER
gives O O
Berta PER B-PER
Charlies PER B-PER
Smith PER I-PER
's O O
table O O
--------------------------------------
#NER 2 3
IOB encoding
Relation Extraction ¶
- ACE (Automated Content Extraction): 17 relations from 2007 "Relation Extraction Task"
- UMLS (Unified Medical Language System): 134 entity types, 54 relations
One approach is taking seed relations to find language patterns. For example, a
seed relation could be BORN-IN(Albert Einstein, Ulm)
. Now find all sentences
in a corpus which contain "Albert Einstain" and "Ulm". You might find find
patterns like:
- Albert Einstein, born in Ulm, ...
- Albert Einstein (1879, Ulm) ...
- One son of Ulm is Albert Einstein.
Now you can extract language patterns:
- X, born in Y, ...
- X (?, Y), ...
- One son of Y is X.
Unsupervised Information Extraction (or Open Information Extraction) does not start with given relations or training data. The textrunner algorithm is one way to do it.
Data sources and Corpora ¶
Thesaurus: WordNet
- Wikipedia
- News websites
- Amazon Reviews
- AP Newswire
- IMDB: Polarity data 2.0 (sentiment analysis)
- Reuters newswire dataset
- DBPedia: 1 billion RDF triples
- Freebase: many relations
Name | Tokens | Types |
---|---|---|
Switchboard phone conversations | 2 400 000 | 20 000 |
Shakespeare | 884 000 | 31 000 |
Google N-Grams | 1 000 000 000 000 | 13 000 000 |
bAbI | ? | ? |
Libraries ¶
- NLTK: The natural language toolkit. Written in Python, for Python. (Book)
- SpaCy: According to reddit, it is cleaner than NLTK but less complete.
- TextBlob: A simple to start toolkit for Python.
- CoreNLP: Faster than NLTK (source?), written in Java, Python wrappers available
- gensim: topic modeling and document similarity analysis
- fasttext: a classifier on top of a sentence2vec model
- DeepText: an NLP engine
- Tensorflow
Products ¶
Terminology / Methods ¶
- Backoff: Use trigram if possible. If not, backoff to bigram (or unigram). Alternatively, use interpolation of trigram, bigram and unigram
- Filled pauses: "uh" in English or "ähm" in German
- Fragment: A part of a word (e.g. if you transcribe spoken text and somebody stutters)
- Lemma: Two words belong to the same lemma if they have the same stem, belong to the same POS and have the same meaning.
- Lexer: One type of tokenizer
- Maxent classifiers
- n-gram model: Model language by counting word-tuples of length n.
- Naive Bayes
- OOV: Out of vocabulary,
<UNK>
token - Porters Algorithms
- Regular expressions (see regexpal.com to test)
- sentence2vec: Similar to word2vec.
- Statistical parsing
- Stemming: Bring a word in a normed form (the stem). Mostly for verbs.
- Tokenization: Segment the text into tokens.
- Tokenizer: Splits a text into tokens.
- Viterbi Algorithm
- word2vec: Embedd any word in a (high-dimensional) vector space. Allows vector arithmetic.
More might me in my ML Glossary.
Smoothing ¶
A common task in NLP is estimating the probability of a word given some other words: P(wi|wi−1,wi−2). You can do that by counting n-grams (wi−2,wi−1,wi):
P(wi|wi−1,wi−2)=N((wi−2,wi−1,wi))N((wi−2,wi−1))
But you will quite often have the case that you did not see a 3-gram. How do you deal with that?
Smoothing is the answer. The simplest method is Laplace Smoothing (aka Add-one smoothing).
How do you deal with words you've never seen? The Good-Turing smoothing method uses things you've seen once to estimate things you've never seen:
$$P^_{GT}(\text{things never seen}) = \frac{N_1}{N}P^{GT}(\text{thing seen}) = (\frac{(c+1) N) / N$$}}{N_c
when c bekomes "large" (depends on the dataset), just replace Nc by a best-fit power law.
Other smoothing methods
- Interpolated Kneser-Ney
- Good-Turing Smoothing
- Stupid backoff: For very large N-grams
Another important concept is the continuation probability. While some words (like "a", "to", "the", ...) can be followed / preceeded by many different words, others (like "San", "Angelo", "D.C.", "United States of", ...). The continuation probability quantifies how likely it is that a word is continued by something novel. Putting this together with absolute discounting gives the Kneser-Ney Smoothing algorithm:
PKN(wi|wi−1)=max
where \lambda \in \mathbb{R} weights how important the continuation probability is,
Data structures ¶
- Bloom filter: a space-efficient probabilistic data structure that is used to test whether an element is a member of a set
- Trie: A prefix-tree
Resources ¶
- KIT: The ASR course has some NLP content
- Reddit: /r/LanguageTechnology
- StackExchange: datascience.stackexchange.com
- Online Courses:
- Coursera: Introduction to Natural Language Processing
- Stanford: Natural Language Processing with Deep Learning
- Dan Jurafsky and Chris Manning on YouTube: Stanford NLP
- Oxford: Lecture Notes
- Machine Translation
- NLP for Hackers