Ticker

6/recent/ticker-posts

NLP Lab_BAI701_ Program 1

 

Experiment 1

Aim:

Design and implement a neural based network for generating word embedding for words in a document corpus

Step 1: Install Required Libraries

  • spaCy → used for splitting the text into words (called tokenization).
  • torch → used to build and train the neural network.

Step 2: Tokenize Corpus Using spaCy

·         We take a small paragraph (corpus).

            · Convert it to lowercase (corpus.lower()).

·         Use spaCy to split it into clean words (removing punctuations, etc.).

·         Example: ["neural", "networks", "are", "useful", ...]

Step 3: Prepare Vocabulary and Training Data (Skip-Gram)

·         This creates training data for our neural network.

·         For each word in the text, we take words near it (context) and pair them.

·         Example: If the sentence is "I love machine learning" and window = 2:

  • Training pairs: ("love", "I"), ("love", "machine"), etc.


Step 4: Build the Word2Vec Neural Network

The network learns by trying to predict context words for each word.

Step 5: Train the Network

·        ·         We train the model for 100 rounds (epochs).

·         Each training step does: 

           Pick a pair: center word and context word.

  1. Predict context from center.
  2. Check how wrong the prediction is.
  3. Adjust the network to improve (backpropagation).

·         Over time, the model learns good word vectors.


Step 6: Get Word Embeddings

  • After training, we can get the embedding vector for any word.
  • For example, get_embedding("neural") returns a 50-dimensional vector.

These vectors can later be:

  • Used to measure word similarity
  • Fed into chatbots, classifiers, or text models

 










Post a Comment

0 Comments