Natural Language Processing

Natural language processing (NLP) is a scientific field which deals with language in textual form.

Tasks

Classification:
- Is an e-mail spam or not?
- Topic: Is it about sports, science or religion?
- Language: Is it English, German or French?
- Sentence boundary: Is a character the boundary of a sentence or not?
- Author:
  - Identify the author from a given set of authors
  - Age of the author
  - Gender of the author
- Sentiment analysis (Opinion mining, opinion extraction, sentiment mining, subjectivity analysis)
Machine Translation (MT): Given a text in language A, return the same content in language B.
Similarity calculation: Given a corpus of n texts and one text A as input, find passages of the corpus which are similar to passages of A. This can be used to detect if students copied content / copyright violation.
- Minimum Edit Distance
Spelling correction: Find places where the grammar / writing needs to be fixed.
Word sense disambiguation: If "mouse" is in a sentence, is it about the computer mouse or the animal.
POS Tagging: Detect adjectives, verbs, nouns in a sentence.
Summarization / Paraphrasing
Information extraction: For example, find the date in a calender application when the user enters the name of the event. Or dates in an e-mail in order to allow users to create a calender date.
- Named Entity Recognition (NER):
  - Find names in a text
  - Classify names into names of places, people, organizations and non-names.
- Relation Extraction
Compound splitting: For German, "Donaudampfschiffskapitän" can be split into the compounds "Donau" (a river) "dampfschiff" (steam boat) and "kapitän" (captain).

Sentiment analysis

Find out how users feel about something.

Opinion mining and sentiment analysis

Sentiment Lexicons are compared by Christopher Potts ("Sentiment Tutorial", 2011):

General Inquirer is free for research use
- List of categories
  - Positive (1915 words) and negative (2291 words)
  - strong vs weak, active vs passive, overstated vs understated
  - pleasure, pain, virtue, vice, motivation, cognitive orientation, ...
- Spreadsheet
LIWC - Linguistic Inquiry and Word count: 30 US-Dollar or 90 US-Dollar fee
- 2300 words, more than 70 classes
- affective process (negative and positive emotion)
- cognitive processes: Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
- Pronouns, Negation (no, never), Quantifiers (few, many)
Bing Liu Opinion Lexicon
- 6786 words (2006 positive, 4783 negative)
SentiWordNet

Positivity of a word can be infered from reviews. Reviews with many stars should have positive words, reviews with only one or two stars should have negative words.

Text generation

Generate text in a given style / tone.

Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks

Named Entity Recognition

Named entities are sequences of word tokens. Each word token can either be other (O), the beginning of a namend entity (B) or the continuation of a named entity (I):

           IO-encoding   IOB-encoding
Adam       PER           B-PER
gives      O             O
Berta      PER           B-PER
Charlies   PER           B-PER
Smith      PER           I-PER
's         O             O
table      O             O
--------------------------------------
#NER       2             3

IOB encoding

Relation Extraction

ACE (Automated Content Extraction): 17 relations from 2007 "Relation Extraction Task"
UMLS (Unified Medical Language System): 134 entity types, 54 relations

One approach is taking seed relations to find language patterns. For example, a seed relation could be BORN-IN(Albert Einstein, Ulm). Now find all sentences in a corpus which contain "Albert Einstain" and "Ulm". You might find find patterns like:

Albert Einstein, born in Ulm, ...
Albert Einstein (1879, Ulm) ...
One son of Ulm is Albert Einstein.

Now you can extract language patterns:

X, born in Y, ...
X (?, Y), ...
One son of Y is X.

Unsupervised Information Extraction (or Open Information Extraction) does not start with given relations or training data. The textrunner algorithm is one way to do it.

Data sources and Corpora

Thesaurus: WordNet

Twitter
Wikipedia
News websites
Amazon Reviews
AP Newswire
IMDB: Polarity data 2.0 (sentiment analysis)
Reuters newswire dataset
DBPedia: 1 billion RDF triples
Freebase: many relations

Name	Tokens	Types
Switchboard phone conversations	2 400 000	20 000
Shakespeare	884 000	31 000
Google N-Grams	1 000 000 000 000	13 000 000
bAbI	?	?

Libraries

NLTK: The natural language toolkit. Written in Python, for Python. (Book)
SpaCy: According to reddit, it is cleaner than NLTK but less complete.
TextBlob: A simple to start toolkit for Python.
CoreNLP: Faster than NLTK (source?), written in Java, Python wrappers available
gensim: topic modeling and document similarity analysis
fasttext: a classifier on top of a sentence2vec model
DeepText: an NLP engine
Tensorflow
- syntaxnet and Parsey McParseface (paper)

Products

Terminology / Methods

Backoff: Use trigram if possible. If not, backoff to bigram (or unigram). Alternatively, use interpolation of trigram, bigram and unigram
Filled pauses: "uh" in English or "ähm" in German
Fragment: A part of a word (e.g. if you transcribe spoken text and somebody stutters)
Lemma: Two words belong to the same lemma if they have the same stem, belong to the same POS and have the same meaning.
Lexer: One type of tokenizer
Maxent classifiers
n-gram model: Model language by counting word-tuples of length n.
Naive Bayes
OOV: Out of vocabulary, <UNK> token
Porters Algorithms
Regular expressions (see regexpal.com to test)
sentence2vec: Similar to word2vec.
Statistical parsing
Stemming: Bring a word in a normed form (the stem). Mostly for verbs.
Tokenization: Segment the text into tokens.
Tokenizer: Splits a text into tokens.
Viterbi Algorithm
word2vec: Embedd any word in a (high-dimensional) vector space. Allows vector arithmetic.

More might me in my ML Glossary.

Smoothing

A common task in NLP is estimating the probability of a word given some other words: $P(w_i | w_{i-1}, w_{i-2})$. You can do that by counting n-grams $(w_{i-2}, w_{i-1}, w_{i})$:

$$P(w_i | w_{i-1}, w_{i-2}) = \frac{N((w_{i-2}, w_{i-1}, w_i))}{N((w_{i-2}, w_{i-1}))}$$

But you will quite often have the case that you did not see a 3-gram. How do you deal with that?

Smoothing is the answer. The simplest method is Laplace Smoothing (aka Add-one smoothing).

How do you deal with words you've never seen? The Good-Turing smoothing method uses things you've seen once to estimate things you've never seen:

$$P^*_{GT}(\text{things never seen}) = \frac{N_1}{N}$$

$$P^*_{GT}(\text{thing seen}) = (\frac{(c+1) N_{c+1}}{N_c}) / N$$

when $c$ bekomes "large" (depends on the dataset), just replace $N_c$ by a best-fit power law.

Other smoothing methods

Interpolated Kneser-Ney
Good-Turing Smoothing
Stupid backoff: For very large N-grams

Another important concept is the continuation probability. While some words (like "a", "to", "the", ...) can be followed / preceeded by many different words, others (like "San", "Angelo", "D.C.", "United States of", ...). The continuation probability quantifies how likely it is that a word is continued by something novel. Putting this together with absolute discounting gives the Kneser-Ney Smoothing algorithm:

$$P_{KN}(w_i | w_{i-1}) = \frac{\max( c(w_{i-1}, w_i) - d, 0)}{c(w_{i-1})} + \lambda (w_{i-1}) P_{continuation}(w_i)$$

where $\lambda \in \mathbb{R}$ weights how important the continuation probability is,

Data structures

Bloom filter: a space-efficient probabilistic data structure that is used to test whether an element is a member of a set
Trie: A prefix-tree

Resources

KIT: The ASR course has some NLP content
Reddit: /r/LanguageTechnology
StackExchange: datascience.stackexchange.com
Online Courses:
- Coursera: Introduction to Natural Language Processing
- Stanford: Natural Language Processing with Deep Learning
- Dan Jurafsky and Chris Manning on YouTube: Stanford NLP
- Oxford: Lecture Notes
- Machine Translation
NLP for Hackers