Experiment 2:
Demonstrate the N-gram
modeling to analyze and establish the probability distribution across sentences
and explore the utilization of unigrams, bigrams, and trigrams in diverse
English sentences to illustrate the impact of varying n-gram orders on the
calculated probabilities.
What is an N-gram?
An N-gram is a sequence of 'n' words that appear next to
each other in a sentence or a piece of text. The value of 'n' tells you how
many words are in that sequence. For example, if we have the sentence:
"I love programming"
- Unigram
(1-gram): Each word is considered individually. So, the unigrams for
this sentence would be:
- I
- love
- programming
- Bigram
(2-gram): A bigram looks at pairs of consecutive words. For the same
sentence, the bigrams would be:
- I
love
- love
programming
- Trigram
(3-gram): A trigram looks at triplets of consecutive words. For the
sentence above, the trigram would be:
- I
love programming
As you can see, the number of words in the sequence
increases as you move from unigrams to bigrams to trigrams.
Why is 'n' important?
The number 'n' in N-grams tells us how many words we look at
together. For instance:
- Unigrams
focus on individual words.
- Bigrams
help us understand how two words come together to form meaning.
- Trigrams
take a closer look at how three words work together to convey meaning.
Increasing 'n' allows us to capture more context. The more
words we include, the better we can understand the relationships and meanings
behind them.
Examples of N-grams in Action
- Unigrams
(1-grams):
- Definition:
Unigrams are individual words in a sentence.
- Example:
If the sentence is “I enjoy reading books,” the unigrams are:
- I
- enjoy
- reading
- books
- Bigrams
(2-grams):
- Definition:
Bigrams are pairs of consecutive words in a sentence.
- Example:
In the same sentence, the bigrams would be:
- I
enjoy
- enjoy
reading
- reading
books
Bigrams help us understand the relationship between two
consecutive words. For example, “enjoy reading” makes more sense
together than the words “enjoy” or “reading” separately.
- Trigrams
(3-grams):
- Definition:
Trigrams are triplets of consecutive words in a sentence.
- Example:
For the sentence "I enjoy reading books," the trigram would be:
- I
enjoy reading
- enjoy
reading books
Trigrams give us even more context, like how "reading
books" is a common phrase and gives us more meaning than just looking
at "reading" or "books".
Why Are N-grams Important?
N-grams are crucial for understanding how words are
connected in sentences and for making sense of the language. In simple terms,
they help machines learn patterns in text, making it possible for computers to:
- Predict
the next word: In tasks like autocomplete or chatbots, N-grams help
computers predict what word is most likely to come next.
- Improve
search engines: When searching for information, understanding common
phrases (bigrams, trigrams) can help improve the results returned.
- Language
translation: N-grams are used in translating text from one language to
another, because they help the machine understand the relationship between
different words in both languages.
Applications of N-grams
- Speech
Recognition: N-grams are used to recognize words in spoken language,
such as voice assistants like Siri or Alexa. They predict which words are
likely to follow each other.
- Text
Prediction: In predictive text systems, like texting on smartphones or
typing in word processors, N-grams help suggest the next word based on the
previous ones.
- Sentiment
Analysis: When analyzing the sentiment (positive or negative) of a
sentence, N-grams can help detect patterns of words that typically
indicate sentiment, like “love programming” (positive sentiment) or “hate
bugs” (negative sentiment).
- Machine
Translation: When translating one language to another, N-grams help
machines understand word patterns and improve the accuracy of
translations.
How Does Increasing the Value of 'n' Affect the Model?
- Unigrams
(n=1): Only the individual words are considered. While simple, this
doesn’t capture relationships between words. For instance, “I love
programming” would be seen as three separate words.
- Bigrams
(n=2): This captures some context, such as how “love programming”
appears together frequently. It helps in understanding common word
pairings.
- Trigrams (n=3): This gives a deeper level of understanding by capturing patterns of three words, such as "machine learning model" or "data science approach". This helps in predicting sequences of words that make sense together.
0 Comments