One of the most widely used methods natural language is n-gram modeling. This article explains what an n-gram model is, how it is computed, and what the probabilities of an n-gram model tell us.
What is an n-gram?
An n-gram is a contiguous sequence of n items from a given sequence of text.
Given a sentence, s
, we can construct a list of n-grams from s
by finding
pairs of words that occur next to each other. For example, given the sentence
“I am Sam” you can construct bigrams (n-grams of length 2) by finding
consecutive pairs of words.
>>> s = "I am Sam."
>>> tokens = s.split(" ")
>>> bigrams = [(tokens[i],tokens[i+1]) for i in range(0,len(tokens)-1)]
>>> bigrams
[('I', 'am'), ('am', 'Sam.')]
Calculating n-gram Probability
Given a list of n-grams we can count the number of occurrences of each n-gram; this count determines the frequency with which an n-gram occurs throughout our document.
>>> from collections import Counter
>>> count = Counter(bigrams)
>>> count
[(('am', 'Sam.'), 1), (('I', 'am'), 1)]
With this small corpus we only count one occurrence of each n-gram. By dividing these counts by the size of all n-grams in our list we would get a probability of 0.5 of each n-gram occurring.
Let’s look a larger corpus of words and see what the probabilities can tell us. The following sequence of bigrams was computed from data downloaded from HC Corpora. It lists the 20 most frequently encountered bigrams out of 97,810,566 bigrams in the entire corpus.
This data represents the most frequently used pairs of words in the corpus along with the number of times they occur.
of the 421560
in the 380608
to the 207571
for the 190683
on the 184430
to be 153285
at the 128980
and the 114232
in a 109527
with the 99141
is a 99053
for a 90209
from the 82223
with a 78918
will be 78049
of a 78009
I was 76788
I have 76621
going to 75088
is the 70045
By consulting our frequency table of bigrams, we can tell that the sentence
There was heavy rain last night
is much more likely to be grammatically
correct than the sentence There was large rain last night
by the fact that the
bigram heavy rain
occurs much more frequently than large rain
in our corpus.
Said another way, the probability of the bigram heavy rain
is larger than the
probability of the bigram large rain
.
Sentences as probability models
More precisely, we can use n-gram models to derive a probability of the sentence
,W
, as the joint probability of each individual word in the sentence, wi
.
P(W) = P(w1, w2, ..., wn)
This can be reduced to a sequence of n-grams using the Chain Rule of conditional probability.
P(x1, x2, ..., xn) = P(x1)P(x2|x1)...P(xn|x1,...xn-1)
As a concrete example, let’s predict the probability of the sentence There was heavy rain
.
P('There was heavy rain') = P('There', 'was', 'heavy', 'rain')
P('There was heavy rain') = P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')
Each of the terms on the right hand side of this equation are n-gram probabilities that we can estimate using the counts of n-grams in our corpus. To calculate the probability of the entire sentence, we just need to lookup the probabilities of each component part in the conditional probability.
Unfortunately, this formula does not scale since we cannot compute n-grams of every length. For example, consider the case where we have solely bigrams in our model; we have no way of knowing the probability `P(‘rain’|‘There was’) from bigrams.
By using the Markov Assumption, we can simplify our equation by assuming that future states in our model only depend upon the present state of our model. This assumption means that we can reduce our conditional probabilities to be approximately equal so that
P('rain'|'There was heavy') ~ P('rain'|'heavy')
More generally, we can estimate the probability of a sentence by the probabilities of each component part. In the equation that follows, the probability of the sentence is reduced to the probabilities of the sentence’s individual bigrams.
P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')
Applications
What can we use n-gram models for? Given the probabilities of a sentence we can determine the likelihood of an automated machine translation being correct, we could predict the next most likely word to occur in a sentence, we could automatically generate text from speech, automate spelling correction, or determine the relative sentiment of a piece of text.