One of the most widely used methods natural language is n-gram modeling. This article explains what an n-gram model is, how it is computed, and what the probabilities of an n-gram model tell us.
What is an n-gram?
An n-gram is a contiguous sequence of n items from a given sequence of text.
Given a sentence,
s, we can construct a list of n-grams from
s by finding
pairs of words that occur next to each other. For example, given the sentence
“I am Sam” you can construct bigrams (n-grams of length 2) by finding
consecutive pairs of words.
>>> s = "I am Sam." >>> tokens = s.split(" ") >>> bigrams = [(tokens[i],tokens[i+1]) for i in range(0,len(tokens)-1)] >>> bigrams [('I', 'am'), ('am', 'Sam.')]
Calculating n-gram Probability
Given a list of n-grams we can count the number of occurrences of each n-gram; this count determines the frequency with which an n-gram occurs throughout our document.
>>> from collections import Counter >>> count = Counter(bigrams) >>> count [(('am', 'Sam.'), 1), (('I', 'am'), 1)]
With this small corpus we only count one occurrence of each n-gram. By dividing these counts by the size of all n-grams in our list we would get a probability of 0.5 of each n-gram occurring.
Let’s look a larger corpus of words and see what the probabilities can tell us. The following sequence of bigrams was computed from data downloaded from HC Corpora. It lists the 20 most frequently encountered bigrams out of 97,810,566 bigrams in the entire corpus.
This data represents the most frequently used pairs of words in the corpus along with the number of times they occur.
of the 421560 in the 380608 to the 207571 for the 190683 on the 184430 to be 153285 at the 128980 and the 114232 in a 109527 with the 99141 is a 99053 for a 90209 from the 82223 with a 78918 will be 78049 of a 78009 I was 76788 I have 76621 going to 75088 is the 70045
By consulting our frequency table of bigrams, we can tell that the sentence
There was heavy rain last night is much more likely to be grammatically
correct than the sentence
There was large rain last night by the fact that the
heavy rain occurs much more frequently than
large rain in our corpus.
Said another way, the probability of the bigram
heavy rain is larger than the
probability of the bigram
Sentences as probability models
More precisely, we can use n-gram models to derive a probability of the sentence
W, as the joint probability of each individual word in the sentence,
P(W) = P(w1, w2, ..., wn)
This can be reduced to a sequence of n-grams using the Chain Rule of conditional probability.
P(x1, x2, ..., xn) = P(x1)P(x2|x1)...P(xn|x1,...xn-1)
As a concrete example, let’s predict the probability of the sentence
P('There was heavy rain') = P('There', 'was', 'heavy', 'rain') P('There was heavy rain') = P('There')P('was'|'There')P('heavy'|'There was')P('rain'|'There was heavy')
Each of the terms on the right hand side of this equation are n-gram probabilities that we can estimate using the counts of n-grams in our corpus. To calculate the probability of the entire sentence, we just need to lookup the probabilities of each component part in the conditional probability.
Unfortunately, this formula does not scale since we cannot compute n-grams of every length. For example, consider the case where we have solely bigrams in our model; we have no way of knowing the probability `P(‘rain’|‘There was’) from bigrams.
By using the Markov Assumption, we can simplify our equation by assuming that future states in our model only depend upon the present state of our model. This assumption means that we can reduce our conditional probabilities to be approximately equal so that
P('rain'|'There was heavy') ~ P('rain'|'heavy')
More generally, we can estimate the probability of a sentence by the probabilities of each component part. In the equation that follows, the probability of the sentence is reduced to the probabilities of the sentence’s individual bigrams.
P('There was heavy rain') ~ P('There')P('was'|'There')P('heavy'|'was')P('rain'|'heavy')
What can we use n-gram models for? Given the probabilities of a sentence we can determine the likelihood of an automated machine translation being correct, we could predict the next most likely word to occur in a sentence, we could automatically generate text from speech, automate spelling correction, or determine the relative sentiment of a piece of text.