Experiment 5
Given the following short movie
reviews, each labelled with a genre, either comedy or action:
● fun, couple, love, love comedy
● fast, furious, shoot action
● couple, fly, fast, fun, fun
comedy
● furious, shoot, shoot, fun
action
● fly, fast, shoot, love action
and
A new document D: fast, couple,
shoot, fly Compute the most likely class for D.
Assume a Naive Bayes classifier
and use add-1 smoothing for the likelihoods.
Aim:
To classify the new document DD
(which contains the words "fast", "couple",
"shoot", and "fly") using a Naive Bayes classifier, we need
to calculate the likelihood of each class (comedy or action) and then compare the
results. We'll also use add-1 smoothing for the likelihoods.
Procedure:
1) Open Anaconda Navigator.
2) Click
on Launch under Jupyter Notebook.
3) Once
Jupyter Notebook opens in the browser, create a new notebook by selecting New
-> Python 3.
4) Install
necessary libraries (e.g., nltk, spacy).
5)
After completing your analysis, make sure to
save your work. Click on File > Save and Checkpoint or use the keyboard
shortcut Ctrl + S to save your Jupyter notebook.
6)
Export Notebook (Optional)
If you'd like to share your Jupyter notebook with others or convert it into
another format (like PDF or HTML), you can do so by:
File > Download as and then select
the format you wish to export to (e.g., PDF, HTML, Markdown).
7) Shut
Down Jupyter Notebook
To shut down
your notebook server, simply close the Jupyter Notebook tab in your browser, or
from the command line, press Ctrl + C to stop the server.
Theory:
Step 1: Organize the data
We have two classes: comedy
and action, and the following labelled movie reviews:
Comedy Class:
- fun, couple, love, love (comedy)
- couple, fly, fast, fun, fun (comedy)
Action Class:
- fast, furious, shoot (action)
- furious, shoot, shoot, fun (action)
- fly, fast, shoot, love (action)
The document DD consists of the
words "fast", "couple", "shoot", and
"fly".
Step 2: Calculate Class Prior
Probabilities
The class prior probabilities are
based on the number of documents in each class.
- Total number of documents = 6
- Number of comedy documents = 2
- Number of action documents = 4
The prior probabilities are:
P(comedy)=2/6=1/3
P(action)=4/6=2/3
Step 3: Compute the Word
Likelihoods with Add-1 Smoothing
For each class, we need to
calculate the likelihood of each word given the class. We'll apply add-1
smoothing (also known as Laplace smoothing) to account for words that might not
appear in the training data.
Vocabulary:
We need the union of words that
appear across all reviews:
- Words in comedy reviews: "fun",
"couple", "love", "fly", "fast"
- Words in action reviews: "fast",
"furious", "shoot", "fun", "fly",
"love"
Thus, the total vocabulary size
VV is 6 distinct words: ["fun", "couple", "love",
"fly", "fast", "shoot", "furious"].
Calculate Likelihoods for
Comedy Class:
In the comedy class, the word
counts (after adding 1 for each word) are:
- "fun": Appears 2 times in comedy
documents.
- "couple": Appears 2 times in comedy
documents.
- "love": Appears 2 times in comedy
documents.
- "fly": Appears 1 time in comedy
documents.
- "fast": Appears 1 time in comedy
documents.
- "shoot": Appears 0 times in comedy
documents.
- "furious": Appears 0 times in comedy
documents.
Total words in comedy documents =
2 + 2 + 2 + 1 + 1 = 8 words.
Calculate Likelihoods for
Action Class:
In the action class, the word
counts (after adding 1 for each word) are:
- "fun": Appears 1 time in action
documents.
- "couple": Appears 0 times in action
documents.
- "love": Appears 1 time in action
documents.
- "fly": Appears 1 time in action
documents.
- "fast": Appears 2 times in action
documents.
- "shoot": Appears 3 times in action
documents.
- "furious": Appears 2 times in action
documents.
Total words in action documents =
1 + 0 + 1 + 1 + 2 + 3 + 2 = 10 words.
Result :
Based on the Naive Bayes
classifier with add-1 smoothing, the new document "fast, couple, shoot,
fly" was classified as action with a higher posterior probability than
comedy. The classification was based on the prior probabilities and likelihoods
derived from the training data.
0 Comments