Experiment 6:
●
Study the various Corpus – Brown, Inaugural, Reuters, udhr with various methods
like filelds, raw, words, sents, categories 3
●
Create and use your own corpora (plaintext, categorical)
●
Study Conditional frequency distributions
●
Study of tagged corpora with methods like tagged_sents, tagged_words ● Write a
program to find the most frequent noun tags
●
Map Words to Properties Using Python Dictionaries
●
Study Rule based tagger, Unigram Tagger
Find
different words from a given plain text without any space by comparing this
text with a given corpus of words. Also find the score of words.
Aim
The
aim of this lab exercise is to explore and demonstrate various Natural
Language Processing (NLP) techniques using the NLTK library. This
includes studying and analyzing different corpora such as Brown, Inaugural,
Reuters, and UDHR using methods like words(), sents(),
categories(), and raw(). Additionally, the exercise involves creating custom
corpora (both plaintext and categorical), studying Conditional
Frequency Distributions (CFD), and working with tagged corpora to
extract the most frequent noun tags. The lab also focuses on mapping words to
properties using Python dictionaries, implementing rule-based and unigram
taggers, and comparing a given text with a corpus to identify matching
words and calculate their frequencies.
Procedure:
1)
Open Anaconda Navigator.
2)
Click
on Launch under Jupyter Notebook.
3)
Once
Jupyter Notebook opens in the browser, create a new notebook by selecting New
-> Python 3.
4)
Install
necessary libraries (e.g., nltk, spacy).
5) After completing your analysis,
make sure to save your work. Click on File > Save and Checkpoint or use the
keyboard shortcut Ctrl + S to save your Jupyter notebook.
6) Export Notebook (Optional)
If you'd like to share
your Jupyter notebook with others or convert it into another format (like PDF
or HTML), you can do so by:
File
> Download as and then select the format you wish to export to (e.g., PDF,
HTML, Markdown).
7)
Shut
Down Jupyter Notebook
To shut down your notebook
server, simply close the Jupyter Notebook tab in your browser, or from the
command line, press Ctrl + C to stop the server.
Theory:
1.
Exploring Various NLTK Corpora (Brown, Inaugural, Reuters, UDHR)
Corpora
are large collections of text, often used for linguistic analysis or as
training data for NLP models. NLTK provides access to several built-in corpora,
each serving different purposes. Let's take a deeper look at four common
corpora:
- Brown
Corpus: The
Brown Corpus is one of the most famous corpora, consisting of texts from
various genres, such as news, fiction, academic texts, and more. It
contains over 1 million words categorized by genre.
·
Inaugural
Corpus: The Inaugural
Corpus contains the presidential inaugural addresses of the United States.
It's useful for analyzing political speech, trends over time, or stylistic
changes in political rhetoric.
·
Reuters
Corpus: The Reuters
Corpus is a collection of news documents, typically used for tasks like
text classification and topic modeling.
·
UDHR
(Universal Declaration of Human Rights): The UDHR Corpus contains translations of the
Universal Declaration of Human Rights in multiple languages. It’s useful for
linguistic studies or multilingual text processing.
2.
Creating Custom Corpora (Plaintext and Categorical)
Custom
corpora can be created using either plaintext data (simple text files) or
categorized data (files belonging to predefined categories).
- Plaintext
Corpus: NLTK provides a PlaintextCorpusReader that can read plain text
files and treat them as a corpus.
- Creating
a Plaintext Corpus: Suppose you have a directory containing text files.
You can create a custom corpus by placing these files in a folder and
using PlaintextCorpusReader:
·
Categorical
Corpus: You can
organize your corpus into categories by placing texts into subdirectories, one
for each category. NLTK’s CategorizedPlaintextCorpusReader handles this.
3. Studying
Conditional Frequency Distributions
- Conditional
Frequency Distributions (CFDs) allow you to analyze how often certain
words appear conditioned on some other attribute, such as a category. For
example, you can count the frequency of certain words in different genres
(or categories) from the Brown Corpus.
4. Working
with Tagged Corpora and Extracting Most Frequent Noun Tags
·
Tagged
corpora contain texts where each word is labeled with its part-of-speech (POS)
tag, such as "NN" (noun), "VB" (verb), and so on. The Brown
Corpus includes tagged versions of its sentences, which can be useful for
studying parts of speech.
5. Mapping
Words to Properties (Frequency) Using Dictionaries
·
In
NLP, it’s common to map words to properties such as their frequency of
occurrence. This can be done using a dictionary, where each key is a word, and
the value is its frequency.
6.
Implementing Rule-based and Unigram Taggers
- Rule-based
Tagger: A
rule-based tagger uses predefined patterns (rules) to assign tags to words
based on their shape or context.
·
Unigram
Tagger: A
unigram tagger assigns tags based on the most likely tag for each word, using
training data. It can be trained using a tagged corpus.
7. Comparing a Given Text with a
Corpus and Scoring Words by Frequency
·
This
task involves comparing a given text (like one without spaces) against a corpus
of words and scoring each word based on its frequency in the corpus. This
process can help identify the most likely words in the text based on their
probability of occurrence.
Result:
In
this exercise, we explored various NLP tasks using the NLTK library,
focusing on information retrieval and text analysis. We examined multiple
corpora like Brown, Inaugural, Reuters, and UDHR,
learning to access and analyze text through methods such as words(), sents(),
categories(), and raw(). We also created custom corpora using both plaintext
and categorical methods. Conditional Frequency Distributions (CFDs) were
used to analyze word occurrences across categories. We worked with tagged
corpora and extracted the most frequent noun tags. Additionally, we
mapped words to properties like frequency using Python dictionaries, and
implemented rule-based and unigram taggers for part-of-speech
tagging. Finally, we developed a method to compare a given text with a corpus,
identifying matching words and scoring them based on frequency, demonstrating
core NLP and information retrieval techniques.
0 Comments