• Martin Thoma
  • Home
  • Categories
  • Tags
  • Archives
  • Support me

Natural Language Processing

Contents

  • Natural Language Processing
    • Tasks
      • Sentiment analysis
      • Text generation
      • Named Entity Recognition
      • Relation Extraction
    • Data sources and Corpora
    • Libraries
    • Products
    • Terminology / Methods
      • Smoothing
    • Data structures
    • Resources

Natural language processing (NLP) is a scientific field which deals with language in textual form.

Tasks

  • Classification:
    • Is an e-mail spam or not?
    • Topic: Is it about sports, science or religion?
    • Language: Is it English, German or French?
    • Sentence boundary: Is a character the boundary of a sentence or not?
    • Author:
      • Identify the author from a given set of authors
      • Age of the author
      • Gender of the author
    • Sentiment analysis (Opinion mining, opinion extraction, sentiment mining, subjectivity analysis)
  • Machine Translation (MT): Given a text in language A, return the same content in language B.
  • Similarity calculation: Given a corpus of n texts and one text A as input, find passages of the corpus which are similar to passages of A. This can be used to detect if students copied content / copyright violation.
    • Minimum Edit Distance
  • Spelling correction: Find places where the grammar / writing needs to be fixed.
  • Word sense disambiguation: If "mouse" is in a sentence, is it about the computer mouse or the animal.
  • POS Tagging: Detect adjectives, verbs, nouns in a sentence.
  • Summarization / Paraphrasing
  • Information extraction: For example, find the date in a calender application when the user enters the name of the event. Or dates in an e-mail in order to allow users to create a calender date.
    • Named Entity Recognition (NER):
      • Find names in a text
      • Classify names into names of places, people, organizations and non-names.
    • Relation Extraction
  • Compound splitting: For German, "Donaudampfschiffskapitän" can be split into the compounds "Donau" (a river) "dampfschiff" (steam boat) and "kapitän" (captain).

Sentiment analysis

Find out how users feel about something.

  • Opinion mining and sentiment analysis

Sentiment Lexicons are compared by Christopher Potts ("Sentiment Tutorial", 2011):

  • General Inquirer is free for research use
    • List of categories
      • Positive (1915 words) and negative (2291 words)
      • strong vs weak, active vs passive, overstated vs understated
      • pleasure, pain, virtue, vice, motivation, cognitive orientation, ...
    • Spreadsheet
  • LIWC - Linguistic Inquiry and Word count: 30 US-Dollar or 90 US-Dollar fee
    • 2300 words, more than 70 classes
    • affective process (negative and positive emotion)
    • cognitive processes: Tentative (maybe, perhaps, guess), Inhibition (block, constraint)
    • Pronouns, Negation (no, never), Quantifiers (few, many)
  • Bing Liu Opinion Lexicon
    • 6786 words (2006 positive, 4783 negative)
  • SentiWordNet

Positivity of a word can be infered from reviews. Reviews with many stars should have positive words, reviews with only one or two stars should have negative words.

Text generation

Generate text in a given style / tone.

  • Andrej Karpathy: The Unreasonable Effectiveness of Recurrent Neural Networks

Named Entity Recognition

Named entities are sequences of word tokens. Each word token can either be other (O), the beginning of a namend entity (B) or the continuation of a named entity (I):

           IO-encoding   IOB-encoding
Adam       PER           B-PER
gives      O             O
Berta      PER           B-PER
Charlies   PER           B-PER
Smith      PER           I-PER
's         O             O
table      O             O
--------------------------------------
#NER       2             3

IOB encoding

Relation Extraction

  • ACE (Automated Content Extraction): 17 relations from 2007 "Relation Extraction Task"
  • UMLS (Unified Medical Language System): 134 entity types, 54 relations

One approach is taking seed relations to find language patterns. For example, a seed relation could be BORN-IN(Albert Einstein, Ulm). Now find all sentences in a corpus which contain "Albert Einstain" and "Ulm". You might find find patterns like:

  • Albert Einstein, born in Ulm, ...
  • Albert Einstein (1879, Ulm) ...
  • One son of Ulm is Albert Einstein.

Now you can extract language patterns:

  • X, born in Y, ...
  • X (?, Y), ...
  • One son of Y is X.

Unsupervised Information Extraction (or Open Information Extraction) does not start with given relations or training data. The textrunner algorithm is one way to do it.

Data sources and Corpora

Thesaurus: WordNet

  • Twitter
  • Wikipedia
  • News websites
  • Amazon Reviews
  • AP Newswire
  • IMDB: Polarity data 2.0 (sentiment analysis)
  • Reuters newswire dataset
  • DBPedia: 1 billion RDF triples
  • Freebase: many relations
Name Tokens Types
Switchboard phone conversations 2 400 000 20 000
Shakespeare 884 000 31 000
Google N-Grams 1 000 000 000 000 13 000 000
bAbI ? ?

Libraries

  • NLTK: The natural language toolkit. Written in Python, for Python. (Book)
  • SpaCy: According to reddit, it is cleaner than NLTK but less complete.
  • TextBlob: A simple to start toolkit for Python.
  • CoreNLP: Faster than NLTK (source?), written in Java, Python wrappers available
  • gensim: topic modeling and document similarity analysis
  • fasttext: a classifier on top of a sentence2vec model
  • DeepText: an NLP engine
  • Tensorflow
    • syntaxnet and Parsey McParseface (paper)

Products

  • aspell
  • Google n-gram Viewer
  • Google Translate
  • Quora: Duplication detection
  • IBM Watson

Terminology / Methods

  • Backoff: Use trigram if possible. If not, backoff to bigram (or unigram). Alternatively, use interpolation of trigram, bigram and unigram
  • Filled pauses: "uh" in English or "ähm" in German
  • Fragment: A part of a word (e.g. if you transcribe spoken text and somebody stutters)
  • Lemma: Two words belong to the same lemma if they have the same stem, belong to the same POS and have the same meaning.
  • Lexer: One type of tokenizer
  • Maxent classifiers
  • n-gram model: Model language by counting word-tuples of length n.
  • Naive Bayes
  • OOV: Out of vocabulary, <UNK> token
  • Porters Algorithms
  • Regular expressions (see regexpal.com to test)
  • sentence2vec: Similar to word2vec.
  • Statistical parsing
  • Stemming: Bring a word in a normed form (the stem). Mostly for verbs.
  • Tokenization: Segment the text into tokens.
  • Tokenizer: Splits a text into tokens.
  • Viterbi Algorithm
  • word2vec: Embedd any word in a (high-dimensional) vector space. Allows vector arithmetic.

More might me in my ML Glossary.

Smoothing

A common task in NLP is estimating the probability of a word given some other words: \(P(w_i | w_{i-1}, w_{i-2})\). You can do that by counting n-grams \((w_{i-2}, w_{i-1}, w_{i})\):

$$P(w_i | w_{i-1}, w_{i-2}) = \frac{N((w_{i-2}, w_{i-1}, w_i))}{N((w_{i-2}, w_{i-1}))}$$

But you will quite often have the case that you did not see a 3-gram. How do you deal with that?

Smoothing is the answer. The simplest method is Laplace Smoothing (aka Add-one smoothing).

How do you deal with words you've never seen? The Good-Turing smoothing method uses things you've seen once to estimate things you've never seen:

$$P^*_{GT}(\text{things never seen}) = \frac{N_1}{N}$$
$$P^*_{GT}(\text{thing seen}) = (\frac{(c+1) N_{c+1}}{N_c}) / N$$

when \(c\) bekomes "large" (depends on the dataset), just replace \(N_c\) by a best-fit power law.

Other smoothing methods

  • Interpolated Kneser-Ney
  • Good-Turing Smoothing
  • Stupid backoff: For very large N-grams

Another important concept is the continuation probability. While some words (like "a", "to", "the", ...) can be followed / preceeded by many different words, others (like "San", "Angelo", "D.C.", "United States of", ...). The continuation probability quantifies how likely it is that a word is continued by something novel. Putting this together with absolute discounting gives the Kneser-Ney Smoothing algorithm:

$$P_{KN}(w_i | w_{i-1}) = \frac{\max( c(w_{i-1}, w_i) - d, 0)}{c(w_{i-1})} + \lambda (w_{i-1}) P_{continuation}(w_i)$$

where \(\lambda \in \mathbb{R}\) weights how important the continuation probability is,

Data structures

  • Bloom filter: a space-efficient probabilistic data structure that is used to test whether an element is a member of a set
  • Trie: A prefix-tree

Resources

  • KIT: The ASR course has some NLP content
  • Reddit: /r/LanguageTechnology
  • StackExchange: datascience.stackexchange.com
  • Online Courses:
    • Coursera: Introduction to Natural Language Processing
    • Stanford: Natural Language Processing with Deep Learning
    • Dan Jurafsky and Chris Manning on YouTube: Stanford NLP
    • Oxford: Lecture Notes
    • Machine Translation
  • NLP for Hackers

Published

Mai 24, 2017
by Martin Thoma

Category

Machine Learning

Tags

  • Machine Learning 81
  • NLP 3

Contact

  • Martin Thoma - A blog about Code, the Web and Cyberculture
  • E-mail subscription
  • RSS-Feed
  • Privacy/Datenschutzerklärung
  • Impressum
  • Powered by Pelican. Theme: Elegant by Talha Mansoor