N-Gram Language Models — Large Language Model Intuition

Understanding N-Gram Language Models in NLP, AI and Machine Learning

“Predicting the next word is a dance of knowledge and intuition, crafting stories one word at a time.” — Anonymous

In this blog you will read about the a basic Language Model, N-Grams Language Model, which is the most basic but foundational language model. You will read about the mathematics, statistics and intuition behind it. All you need to know, to understand more advanced Large Language Models such as GPTs, BERT, T5 etc!

Here is a Table of Contents of this blog:

**Predicting the Probability of Next Word**
- Introduction to Language Models in AI and NLP
- The Role of N-Grams in Large Language Models
**What are Language Models and N-Grams**
- Understanding Probabilistic Language Models
- N-Grams in Natural Language Processing
**Chain Rule in N-Gram Word Probabilities**
- Fundamentals of Chain Rule in Probabilities
- Application in N-Gram Models
**N-Gram Language Model**
- Overview of N-Gram Models in NLP
- Bigram, Trigram, and Higher Order Models
- The Concept of Markov Assumption in N-Grams
**Introducing and **
- Significance of Start and End Tokens in Language Models
- Practical Examples of and Usage
**Maximum Likelihood Estimation (MLE) in N-Grams**
- Basics of MLE in N-Gram Models
- Practical Application and Examples
**Using Log Probabilities**
- The Need for Log Probabilities in Language Models
- Practical Implications and Computational Benefits
**Conclusion and Further Resources**
- Summary and Key Takeaways
- Additional Resources and GitHub Source

Intuition behind N-Grams

Imagine you’re trying to guess the next word in a sentence like, “After the movie, we all went to…” Most likely, you’re thinking of a place, perhaps ‘dinner’, ‘home’, or ‘a party’. This kind of prediction is what language models do, but on a much larger and more complex scale. They analyze vast amounts of text to learn how words typically come together.

For instance, consider the sentence, “The sun rises in the…” You probably thought ‘east’. This is a simple example of how language models use context to predict text. They would recognize that ‘east’ is a far more likely word to follow than ‘kitchen’ or ‘basement’.

Language models are also crucial in applications like voice assistants. When you ask your phone, “What’s the weather like in…”, it needs to understand that you’re likely looking for a city or a location, not a food item or a person’s name. This understanding is based on the probabilities of word sequences.

FREE Data Science and AI Resources

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook

FREE Data Science and AI Career Handbook

Want to learn Machine Learning from scratch, or refresh your memory? Download this FREE Machine Learning Fundamentals Handbook to get all Machine Learning fundamentals combiend with examples in Python in one place.

FREE Machine Learning Fundamentals Handbook

Want to learn Java Programming from scratch, or refresh your memory? Download this FREE Java Porgramming Fundamnetals Books to get all Java fundamentals combiend with interview preparation and code examples.

FREE Java Porgramming Fundamnetals Books

Predicting the Probability of Next Word

Language models are essential tools in the field of AI and Natural Language Processing (NLP), serving as the foundation for how computers understand and interact with human language. These models function by predicting and generating text, which is crucial for machines to communicate in a way that resembles human speech and writing.

You might have heard GPTs, BERT, T5 as these are the most popular Large Language Models, but N-Grams are the corner stone of these language models, and knowing N-Grams will be super helpful for anyone who want to learn LLMs at their core.

N-grams operate on a simple yet effective principle:

predicting the likelihood of a word sequence based on historical data.

This approach not only makes N-Gram models an excellent educational tool for understanding the basics of language processing in AI but also highlights their significance in the broader context of Machine Learning and Generative AI.

Their contribution to the field lies in laying the groundwork for more advanced language models, making them a fundamental chapter in the story of AI’s evolving ability to understand and generate human language.

What are Language Models and N-Grams

Models that assign probabilities to sequences of words are called Language Models or LMs.

An N-gram language model is a category of probabilistic language model that is used in NLP which predicts the probability of a next word or sequence of words in a sentence based on the occurrence of its preceding ‘n-1’ words.

Let’s look at an example to illustrate the concept of computing the probability of a word w given a history h using relative frequency counts. Suppose our history of words is h is “the quick brown fox jumps over the” and we want to estimate the probability that the next word is “lazy”:

P( lazy ∣ the quick brown fox jumps over the)

To estimate the probability of the word lazy, we would use a large corpus of text and following steps:

We count how many times the phrase “the quick brown fox jumps over the lazy” appears in the corpus, basically how many time we find exactly the same structure of w followed by h in our entire corpus
Count the occurrences of the history h: we need to know how many times the phrase “the quick brown fox jumps over the” appears in our corpus, regardless of what word follows this phrase.
Then we calculate the probability, as the ratio of these two counts:

where

This formula calculates the probability of the word lazy coming after this phrase as a ratio of two counts where in the numerator is the count of the entire phrase including the word “lazy”, and in the denominator we have the count of the initial part of the phrase without “lazy”. This ratio gives an estimate of how likely it is that the word following this given phrase will be “lazy” in natural language.

A higher value of this probability that we just calculated will suggest that in the given dataset, “lazy” is a common continuation of the phrase “the quick brown fox jumps over the”. In contrast, a lower of this probability would indicate that the word “lazy” is an unlikely continuation for this phrase.

Chain Rule in N-Gram Word Probabilities

Chain Rule of Probabilities says:

The chain rule of probability allows us to decompose the probability of a sequence of words. For a sequence w1:n, this is given by:

In a more general form, this expression can be written as:

The chain rule of probabilities applied to our language model, specifically the bigram case, shows the link between computing the joint probability of a sequence and computing conditional probability of a word given previous words.

N-Gram Language Model

N-Gram models are a fundamental part of NLP and are used to predict the probability of a word based on the preceding words in a given sequence. The basic idea behind N-gram LMs is to approximate the probability of a word given its entire history by considering only a limited number (N-1) of previous words. This approach simplifies significantly the computation and makes it feasible for large language computations.

Bigram Model (N=2)

Bigram is N-gram with N=2, so it considers only the immediate previous word when predicting the word’s conditional probability. For example, the probability of the word “lazy” following “the quick brown fox jumps over the” is approximated by considering only “the”.

Trigram Model (N=3)

Trigram is N-gram with N=3, so it considers only the previous 2 words when predicting the word’s conditional probability. For example, in the phrase “the quick brown fox jumps over the lazy”, the probability of “lazy” is predicted based on “over the”.

4-Gram Model (N=4)

Considers the last three words, and so on. For example, in the phrase “the quick brown fox jumps over the lazy”, the probability of “lazy” is predicted based on “jumps over the”.

Markov Assumption

These N-gram language models are based on the assumption that:

the probability of a word can be reasonably approximated by a limited context, rather than needing the entire sentence or paragraph.

This intuition or assumption we are making in N-grams, formally, is know as Markov assumption. Markov assumption that the probability of a word depends only on the previous word.

Markov models are the class of probabilistic models very popular in advanced statistics and specifically in Qualitative Finance!

Markov Model assume we can predict the probability of some future values without looking too far into the past.

N-Grams with Markov Assumption

Mathematically, N-gram model intuition can be formulated as follows:

where

Latex Schoal Assistent GPT by LunarTech

Image Source: LunarTech

Introducing and

In language modeling, especially when dealing with n-grams, it is crucial to consider the beginning and end of sentences. So, we introduce special tokens: < SOS > (Start Of Sentence) and < EOS > (End Of Sentence). These are industry common abbreviations that are used in many LLMs applications.

These tokens help in defining the boundaries of a sentence, allowing the model to learn the probability of a word occurring at the beginning or the end of a sentence.

For instance, in our corpus, each sentence would start with < SOS > and end with < EOS >. This inclusion ensures that transitions into the first word and from the last word of a sentence are appropriately modelled, enhancing the accuracy and effectiveness of the n-gram model. Here are some examples of and usage in some Corpus.1. < SOS > The quick brown fox jumps over the lazy dog < EOS >
2. < SOS > The lazy dog sleeps < EOS >
3. < SOS > The fox and the dog are friends < EOS >

Maximum Likelihood Estimation (MLE) in N-Grams

Given the assumption we make in bi-grams for the probability of an individual word, that is:

we can then compute the probability of a complete word sequence by substituting this in our total product of probabilities, that is:

which we saw before in Chain Rule of Probabilities. After replacing these values for each k we will get:

When it comes to estimating probabilities in n-gram models, a common method is Maximum Likelihood Estimation (MLE). [check out the Fundamentals to Machine Learning Course for this topic]. This approach involves counting occurrences in a corpus and normalizing these counts to fall between 0 and 1.

Example applying N-Gram and MLE

Given a hypothetical corpus with 3 sentences including the phrase “the quick brown fox jumps over the lazy dog”, we calculate some bigram probabilities using MLE.

Our Hypothetical Corpus: 1. < SOS > The quick brown fox jumps over the lazy dog < EOS >
2. < SOS > The lazy dog sleeps < EOS >
3. < SOS > The fox and the dog are friends < EOS >

we can compute some of the bigram probabilities as follows:

General Bigram Probabilities

For general bigram, we calculate this by taking the count of the bigram C(w_n−1 * wn) and normalizing it by the total counts of all bigrams that start with w_n−1:

this sum of bigram counts starting with w_n−1 is equal to the unigram count of w_n−1. Therefore, the probability simplifies to:

General N-Gram Probabilities

For general case of n-gram with MLE for parameter estimation, the probability of a word w_n given its preceding context of N-1 words is estimated as follows:

This ratio is referred to as a relative frequency. As mentioned earlier, this approach of using relative frequencies to estimate probabilities is an example of MLE.

In MLE (Maximum Likelihood Estimation), as the name suggests the resulting parameters are the ones that jointly maximise the likelihood of the training set T given the model M , i.e., P (T |M ).

Using Log Probabilities

In language models, we commonly represent and compute these probabilities using their logarithmic form instead of pure probabilities, known as log probabilities.

This log-based approach is crucial because probabilities, by their very nature, are values in range [0,1]. When we multiply several such probabilities, as is often the case in language models, the resultant product of probabilities becomes exceedingly small.

This can lead to a computational issue called numerical underflow, where this product is so small that it is effectively treated as zero, which is something we want to avoid, so we use log probabilities instead.

In logarithmic space, multiplication of probabilities translates to addition of their logarithms (this is pure mathematics). This means that instead of multiplying n-gram probabilities, we can add their log values, which are more manageable numbers.

Also, computations in log space are more stable and efficient, def simpler to look into. The final probabilities can be easily retrieved by taking the exponential (exp) of the accumulated log probabilities.

For example, the product of probabilities p1×p2×p3×p4 can be computed as exp⁡(log⁡p1+log⁡p2+log⁡p3+log⁡p4). This method ensures the accuracy and efficiency of our calculations, making it a standard practice in evaluating language models.

Note: I will cover the evaluation of Language Models in upcoming blogs so stay tuned!

Github Source where you can fine summary of N-Grams

Github Repo: https://github.com/TatevKaren

About the Author

I am Tatev, Senior Machine Learning and AI Researcher. I have had the privilege of working in Data Science across numerous countries, including the US, UK, Canada, and the Netherlands.

With an MSc and BSc in Econometrics under my belt, my journey in Machine and AI has been nothing short of incredible. Drawing from my technical studies during my Bachelors & Masters, along with over 5 years of hands-on experience in the Data Science Industry, in Machine Learning and AI including NLP, LLM and GenAI, I’ve gathered this knowledge to share with you.

Become Job Ready Data Scientist with LunarTech

After gaining so much from this guide, if you’re keen to dive even deeper and structured learning is your style, consider joining us at LunarTech. Become job ready data scientist with The Ultimate Data Science Bootcamp which has earned the recognition of being one of the Best Data Science Bootcamps of 2023, and has been featured in esteemed publications like Forbes, Yahoo, Entrepreneur and more. This is your chance to be a part of a community that thrives on innovation and knowledge. [Enroll to The Ultimate Data Science Bootcamp at LunarTech]

Enroll Here

Connect with Me:

[The Data Science and AI Newsletter | Tatev Karen | Substack
Where businesses meet breakthroughs, and enthusiasts transform to experts! From creator of 2023 top-rated Data Science…tatevaslanyan.substack.com](https://tatevaslanyan.substack.com/ "tatevaslanyan.substack.com")

Want to learn Machine Learning from scratch, or refresh your memory? Download this FREE Machine Learning Fundamentals Handbook

Want to discover everything about a career in Data Science, Machine Learning and AI, and learn how to secure a Data Science job? Download this FREE Data Science and AI Career Handbook.

Thank you for choosing this guide as your learning companion. As you continue to explore the vast field of machine learning, I hope you do so with confidence, precision, and an innovative spirit. Best wishes in all your future endeavors!