What is an N-gram?

An n-gram is a connected sequence of n items in a given text or speech. Precisely, an N-gram is a probabilistic language model used to predict the next item in a particular sequence, as in (n-1).

In the context of data analytics, an N-gram refers to a contiguous sequence of N items from a given sample of text or speech. These items can be characters, words, or symbols.

For instance:

  • A unigram (N=1) would be a single word.
  • A bigram (N=2) would be a sequence of two adjacent words.
  • A trigram (N=3) would be a sequence of three adjacent words.

N-grams are commonly used in natural language processing (NLP) tasks such as text mining, sentiment analysis, machine translation, and speech recognition. They help in capturing the syntactic and semantic relationships between words in a sequence, aiding in tasks like language modeling, predictive text input, and information retrieval.