Predicting the Next Word in 5 Easy Steps Using Python

You may have observed that when you’re typing on your mobile phone, it predicts the next word you might want to use. It’s a feature that makes typing quicker and saves you time.

It is more convenient. Whether you’re texting, searching the internet, or writing an email, predictive text can be really helpful. But have you ever wondered how your mobile knows what word to suggest next?

In this guide, I’ll show you how to create word predictions using Python.

To follow along easily, having a basic understanding of NLP helps. If you’re new to this, don’t worry; you can quickly get the basics from our article “NLP Simplified,” where we explain it simply.

Applications of Next Word Prediction

Next word prediction improves the speed and accuracy of typing on mobile devices, making it highly beneficial for text messaging and communication apps.
Search engines use predictive text to suggest search queries, making it easier for users to find relevant information quickly.
It helps in auto-correcting misspelled words and reducing typing errors in various applications, including word processors and email clients.
Developers and programmers benefit from predictive text when writing code, as it suggests relevant functions, methods, and variable names.
Online platforms and streaming services use predictive text to recommend relevant content to users.

Let’s start learning about this interesting NLP technique together and how it’s useful, and go through each step clearly.

Data Preparation

First of all, we need to import the necessary libraries which are important for our project. After that, we will define sample text, which will be used for training. You can replace this text with any text data of your choice.

You can also use a dataset with text data, which you can easily find on Kaggle or a similar platform.

# Import Necessary Libraries

import nltk
from nltk import ngrams
from collections import defaultdict
import random

Sample Text Data, which will be used for predicting the next word.

# Sample Text Data
text = """
Once upon a luminous, starry night in the quaint, enigmatic town of Serendipity, 
a curious young explorer named Amelia embarked on an extraordinary adventure. 
With her trusty magnifying glass in hand and an indomitable spirit, she embarked on a quest to discover the elusive Elysian treasure hidden deep within the labyrinthine forest. 
As she ventured through the verdant woods, Amelia encountered an eccentric, talking squirrel named Percival, who spoke in riddles and guided her toward the treasure's whereabouts. 
The forest was resplendent with bioluminescent flora, illuminating her path with a kaleidoscope of colors. 
Amelia soon reached a precipice overlooking an awe-inspiring, cerulean waterfall, its cascading waters echoing a melodious serenade. 
Beside the waterfall stood a colossal, moss-covered stone with cryptic inscriptions. 
With Percival's guidance, she deciphered the ancient runes and uncovered the entrance to the treasure trove. 
Inside, she discovered an opulent chest adorned with intricate, golden filigree. 
Upon opening it, a symphony of shimmering jewels, radiant gemstones, and glistening artifacts greeted her with an ethereal glow. 
The Elysian treasure was hers, a testament to her dauntless courage and insatiable curiosity. 
Amelia's return to Serendipity was celebrated with jubilant revelry, and her remarkable journey became a legend, inspiring others to embark on their own adventures in the wondrous realm of imagination and discovery.
"""

You can replace this text as per your requirement.

Tokenization

We will preprocess our text and tokenize it. Tokenization is the process of breaking the text into individual words or tokens. We use the nltk library in Python to tokenize our text.

To ensure that our model focuses on words and ignores case or punctuation, we perform preprocessing. This step involves converting all words to lowercase and removing any punctuation.

import nltk

# Tokenize the text into words
words = nltk.word_tokenize(text)

# Preprocess the words (convert to lowercase, remove punctuation)
words = [word.lower() for word in words if word.isalnum()]

words

After preprocessing and tokenization, we will get all words in lowercase and without punctuation.

Tokenized Words

Building N-grams

In this step, we’re going to create N-grams, which are sequences of N words in natural language processing (NLP).

In our code, we’re going to create bigrams, where N is equal to 2, meaning each N-gram will consist of pairs of words.

This is a fundamental step in building a prediction model for the next word because it allows us to analyze word sequences and predict the next word based on the context provided by the previous N-1 words.

# Define the order of the N-gram model (N=2 for bigrams)
N = 2

# Create N-grams from the tokenized words
ngrams_list = list(ngrams(words, N))

# Create a defaultdict to store N-grams and their frequency
ngram_freq = defaultdict(int)
for ngram in ngrams_list:
    ngram_freq[ngram]  = 1

These N-grams serve as the building blocks for training and implementing our next-word prediction model.

Define Function

In this step, we create a function called ‘predict_next_word’ that guesses the next word in a sentence based on a provided prefix (a sequence of words).

This function is crucial in the next word prediction model, as it takes the context provided by the prefix and uses it to make a prediction about the most likely next word.

I will explain what happens in this process in simple words:

The function looks at all the word pairs (bigrams) in our text data that start with the provided prefix (the words before the missing word).
It counts how often each word appears in those pairs and sorts them by frequency, from most to least common.
The function then suggests the word that occurs most often as the next word after the given prefix.

# Define Function
def predict_next_word(prefix):
    # Filter N-grams that start with the given prefix
    matching_ngrams = [(ngram, freq) for ngram, freq in ngram_freq.items() if ngram[:-1] == prefix]

    if not matching_ngrams:
        return "No prediction available."

    # Sort N-grams by frequency in descending order
    sorted_ngrams = sorted(matching_ngrams, key=lambda x: x[1], reverse=True)

    # Select the N-gram with the highest frequency as the prediction
    prediction = sorted_ngrams[0][0][-1]

    return prediction

It’s a crucial part of the next word prediction model, as it allows us to generate contextually relevant suggestions for the next word in a given text sequence.

Testing

This code lets you test the model with your own input. You type a few words, press Enter, and the model predicts the next word. If you type something invalid, it asks you to try again.

# You can use this code snippet to interactively test the model with user input
user_input = input("Enter a prefix for next-word prediction: ").lower().split()
if len(user_input) != N - 1:
    print("Please enter a valid prefix.")
else:
    prefix = tuple(user_input)
    prediction = predict_next_word(prefix)
    print(f"Next word prediction: {prediction}")

Our code will create this kind of interface. Where you can write prefixes and press enter.

Enter Prefix

After entering enter, you will get your next word

Predicted Word

It’s a way to demonstrate how the next-word prediction model can be used in practice.

Challenges:

The accuracy of next-word prediction heavily depends on the size and quality of the training data. Limited or noisy data can lead to less accurate predictions.
If a word in the input text doesn’t exist in the training data, it can’t be predicted accurately.
Punctuation can affect prediction accuracy, especially in languages like English, where word boundaries can be ambiguous.
Incorrect tokenization or preprocessing can lead to incorrect predictions.
Many words have multiple meanings, and the context may not always disambiguate them.

How to Improve Accuracy

Using a larger and more diverse dataset improves the model’s understanding of various contexts and words.
Consider using higher-order N-grams (e.g., trigrams) for more context, but balance it with data availability.
Collect user feedback and continuously improve the model based on real-world usage.
Regularly evaluate the model’s performance with appropriate metrics and adjust strategies accordingly.
You can implement neural network-based models, such as LSTM or Transformer, for more complex context modeling.

Final Words

In the world of Natural Language Processing, predicting the next word is a valuable skill. With these 5 simple Python steps, you’ve gained a powerful tool for faster communication and smarter technology.

Keep exploring and using this knowledge to enhance your language experiences. The journey has just begun!

You may also explore some best ways to download files from a URL using Python.

Development, Python

Show Comments

Predicting the Next Word in 5 Easy Steps Using Python

Applications of Next Word Prediction

Data Preparation

Tokenization

Building N-grams

Define Function

Testing

Challenges:

How to Improve Accuracy

Final Words

Other stories

JavaScript Snake Tutorial Explained

9 Best Dedicated Trading Servers for Smooth Trading Experience

Press ESC to close

Applications of Next Word Prediction

Data Preparation

Tokenization

Building N-grams

Define Function

Testing

Challenges:

How to Improve Accuracy

Final Words

Share Article:

You might also like

Event-Driven Architecture – Streamlining Software Delivery

How to Call a Function in Python [With Examples]

Sprint Planning: The Roadmap to Agile Efficiency

Other stories

JavaScript Snake Tutorial Explained

9 Best Dedicated Trading Servers for Smooth Trading Experience