Ticker

6/recent/ticker-posts

NLP Lab Program 6-BAI601

 

Experiment 6:

 Demonstrate the following using appropriate programming tool which illustrates the use of information retrieval in NLP:


● Study the various Corpus – Brown, Inaugural, Reuters, udhr with various methods like filelds, raw, words, sents, categories 3

● Create and use your own corpora (plaintext, categorical)

● Study Conditional frequency distributions

● Study of tagged corpora with methods like tagged_sents, tagged_words ● Write a program to find the most frequent noun tags

● Map Words to Properties Using Python Dictionaries

● Study Rule based tagger, Unigram Tagger

Find different words from a given plain text without any space by comparing this text with a given corpus of words. Also find the score of words.

Aim

The aim of this lab exercise is to explore and demonstrate various Natural Language Processing (NLP) techniques using the NLTK library. This includes studying and analyzing different corpora such as Brown, Inaugural, Reuters, and UDHR using methods like words(), sents(), categories(), and raw(). Additionally, the exercise involves creating custom corpora (both plaintext and categorical), studying Conditional Frequency Distributions (CFD), and working with tagged corpora to extract the most frequent noun tags. The lab also focuses on mapping words to properties using Python dictionaries, implementing rule-based and unigram taggers, and comparing a given text with a corpus to identify matching words and calculate their frequencies.

Procedure:

1)    Open Anaconda Navigator.

2)   Click on Launch under Jupyter Notebook.

3)   Once Jupyter Notebook opens in the browser, create a new notebook by selecting New -> Python 3.

4)   Install necessary libraries (e.g., nltk, spacy).

5)   After completing your analysis, make sure to save your work. Click on File > Save and Checkpoint or use the keyboard shortcut Ctrl + S to save your Jupyter notebook.

6)   Export Notebook (Optional)

 

If you'd like to share your Jupyter notebook with others or convert it into another format (like PDF or HTML), you can do so by:

File > Download as and then select the format you wish to export to (e.g., PDF, HTML, Markdown).

7)   Shut Down Jupyter Notebook

To shut down your notebook server, simply close the Jupyter Notebook tab in your browser, or from the command line, press Ctrl + C to stop the server.

Theory:

1. Exploring Various NLTK Corpora (Brown, Inaugural, Reuters, UDHR)

Corpora are large collections of text, often used for linguistic analysis or as training data for NLP models. NLTK provides access to several built-in corpora, each serving different purposes. Let's take a deeper look at four common corpora:

  • Brown Corpus: The Brown Corpus is one of the most famous corpora, consisting of texts from various genres, such as news, fiction, academic texts, and more. It contains over 1 million words categorized by genre.

·         Inaugural Corpus: The Inaugural Corpus contains the presidential inaugural addresses of the United States. It's useful for analyzing political speech, trends over time, or stylistic changes in political rhetoric.

·         Reuters Corpus: The Reuters Corpus is a collection of news documents, typically used for tasks like text classification and topic modeling.

·         UDHR (Universal Declaration of Human Rights): The UDHR Corpus contains translations of the Universal Declaration of Human Rights in multiple languages. It’s useful for linguistic studies or multilingual text processing.

2. Creating Custom Corpora (Plaintext and Categorical)

Custom corpora can be created using either plaintext data (simple text files) or categorized data (files belonging to predefined categories).

  • Plaintext Corpus: NLTK provides a PlaintextCorpusReader that can read plain text files and treat them as a corpus.
    • Creating a Plaintext Corpus: Suppose you have a directory containing text files. You can create a custom corpus by placing these files in a folder and using PlaintextCorpusReader:

·         Categorical Corpus: You can organize your corpus into categories by placing texts into subdirectories, one for each category. NLTK’s CategorizedPlaintextCorpusReader handles this.

3. Studying Conditional Frequency Distributions

  • Conditional Frequency Distributions (CFDs) allow you to analyze how often certain words appear conditioned on some other attribute, such as a category. For example, you can count the frequency of certain words in different genres (or categories) from the Brown Corpus.

4. Working with Tagged Corpora and Extracting Most Frequent Noun Tags

·         Tagged corpora contain texts where each word is labeled with its part-of-speech (POS) tag, such as "NN" (noun), "VB" (verb), and so on. The Brown Corpus includes tagged versions of its sentences, which can be useful for studying parts of speech.

5. Mapping Words to Properties (Frequency) Using Dictionaries

·         In NLP, it’s common to map words to properties such as their frequency of occurrence. This can be done using a dictionary, where each key is a word, and the value is its frequency.

6. Implementing Rule-based and Unigram Taggers

  • Rule-based Tagger: A rule-based tagger uses predefined patterns (rules) to assign tags to words based on their shape or context.

·         Unigram Tagger: A unigram tagger assigns tags based on the most likely tag for each word, using training data. It can be trained using a tagged corpus.

7. Comparing a Given Text with a Corpus and Scoring Words by Frequency

·         This task involves comparing a given text (like one without spaces) against a corpus of words and scoring each word based on its frequency in the corpus. This process can help identify the most likely words in the text based on their probability of occurrence.

 

 

Result:

In this exercise, we explored various NLP tasks using the NLTK library, focusing on information retrieval and text analysis. We examined multiple corpora like Brown, Inaugural, Reuters, and UDHR, learning to access and analyze text through methods such as words(), sents(), categories(), and raw(). We also created custom corpora using both plaintext and categorical methods. Conditional Frequency Distributions (CFDs) were used to analyze word occurrences across categories. We worked with tagged corpora and extracted the most frequent noun tags. Additionally, we mapped words to properties like frequency using Python dictionaries, and implemented rule-based and unigram taggers for part-of-speech tagging. Finally, we developed a method to compare a given text with a corpus, identifying matching words and scoring them based on frequency, demonstrating core NLP and information retrieval techniques.

Post a Comment

0 Comments