📚 Natural Language Processing (NLP) - Comprehensive Practicals

A comprehensive repository of 10 practical NLP implementations covering everything from basic text processing to advanced deep learning models.

Quick Start • Installation • Practicals • Contributing • License

👤 Author Information

Field	Details
Name	PREXIT JOSHI
Roll Number	UE233118
Branch	Computer Science and Engineering (CSE)
Institute	University Institute of Engineering and Technology, Punjab University (UIET, PU)
Email	📧 prexitjoshi@gmail.com
GitHub	@intronep666

📖 Quick Start

Get started in minutes:

# 1. Clone repository
git clone https://github.com/intronep666/Natural-Language-Processing.git
cd Natural-Language-Processing

# 2. Install dependencies
pip install -r requirements.txt

# 3. Download NLP data
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt')"

# 4. Launch Jupyter
jupyter notebook

# 5. Open and run practicals!

For detailed setup instructions, see GETTING_STARTED.md.

📦 Installation

Requirements

Python 3.8 or higher
pip/conda
~2GB disk space (for models)
Virtual environment (recommended)

Step-by-Step Installation

# Create virtual environment
python -m venv nlp_env
source nlp_env/bin/activate  # On Windows: nlp_env\Scripts\activate

# Install all dependencies
pip install -r requirements.txt

# Download spaCy model
python -m spacy download en_core_web_sm

See GETTING_STARTED.md for detailed setup, troubleshooting, and next steps.

🤔 What is NLP?

Definition

Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics that focuses on enabling computers to understand, interpret, and generate human language in a meaningful and useful way. It bridges the gap between human communication and computer understanding.

Why is NLP Important?

💬 Communication Bridge: Enables machines to understand human language naturally
🔍 Data Extraction: Extract valuable insights from unstructured text data
🤖 Automation: Automate language-based tasks at scale
📊 Business Intelligence: Analyze customer feedback, reviews, and sentiment
🌐 Global Reach: Break language barriers through translation

Key Objectives of NLP

┌─────────────────────────────────────────┐
│         NLP Core Objectives              │
├─────────────────────────────────────────┤
│ 1. Understanding (Comprehension)         │
│ 2. Generation (Producing text)           │
│ 3. Translation (Language to language)    │
│ 4. Analysis (Extracting information)     │
│ 5. Classification (Categorizing text)    │
└─────────────────────────────────────────┘

🧠 Core Concepts

1. Tokenization

Breaking down text into smaller units (words, sentences, or subwords).

Example:

Text: "Natural Language Processing is amazing!"
Tokens: ["Natural", "Language", "Processing", "is", "amazing", "!"]

2. Stemming vs. Lemmatization

Stemming	Lemmatization
Removes suffixes mechanically	Uses vocabulary and morphology
Fast but may oversimplify	Accurate but slower
"running", "runs" → "run"	"running", "runs" → "run"

3. Stop Words

Common words (the, is, and, etc.) that are often removed for efficiency.

Example:

Original: "The cat is on the mat"
After removal: "cat mat"

4. Part-of-Speech (POS) Tagging

Labeling each word with its grammatical role.

The     → DET (Determiner)
cat     → NN  (Noun)
runs    → VB  (Verb)
quickly → RB  (Adverb)

5. Named Entity Recognition (NER)

Identifying and classifying named entities in text.

Text: "Apple Inc. is located in Cupertino, California"
Entities:
- "Apple Inc." → Organization
- "Cupertino" → Location
- "California" → Location

6. Dependency Parsing

Understanding grammatical relationships between words.

"The cat chased the mouse"
     ↓
nsubj ↓ obj
subject: "cat"
action: "chased"
object: "mouse"

🔄 NLP Processing Pipeline

Typical NLP Workflow

┌──────────────────┐
│   Raw Text       │
└────────┬─────────┘
         ↓
┌──────────────────┐
│  Text Cleaning   │ (Remove special characters, lowercasing)
└────────┬─────────┘
         ↓
┌──────────────────┐
│ Tokenization     │ (Break into tokens)
└────────┬─────────┘
         ↓
┌──────────────────┐
│ Normalization    │ (Stemming/Lemmatization)
└────────┬─────────┘
         ↓
┌──────────────────┐
│ Stop Word        │ (Remove common words)
│ Removal          │
└────────┬─────────┘
         ↓
┌──────────────────┐
│ Feature          │ (Convert to numerical vectors)
│ Extraction       │
└────────┬─────────┘
         ↓
┌──────────────────┐
│ ML/DL Model      │ (Classification, clustering, etc.)
└────────┬─────────┘
         ↓
┌──────────────────┐
│ Prediction/      │ (Output results)
│ Analysis         │
└──────────────────┘

🛠️ Key Techniques

1. Bag of Words (BoW)

Converts text into a vector of word counts, ignoring word order.

Sentence: "I love NLP, NLP is great"
BoW: {
    "I": 1,
    "love": 1,
    "NLP": 2,
    "is": 1,
    "great": 1
}

2. Term Frequency-Inverse Document Frequency (TF-IDF)

Weighs words based on their importance in a document and corpus.

Formula:

TF-IDF(t, d) = TF(t, d) × IDF(t)
Where:
- TF = frequency of term in document
- IDF = log(total documents / documents containing term)

3. N-Gram Analysis

Sequences of N consecutive words.

Text: "Natural Language Processing"

Unigrams (1-gram):
["Natural"], ["Language"], ["Processing"]

Bigrams (2-gram):
["Natural", "Language"], ["Language", "Processing"]

Trigrams (3-gram):
["Natural", "Language", "Processing"]

4. Word Embeddings

Word2Vec

Captures semantic similarity between words
Two models: CBOW (Continuous Bag of Words) and Skip-gram
Output: Dense vector for each word

GloVe (Global Vectors)

Count-based embedding using word co-occurrence matrix
Combines global statistics with local context

FastText

Extension of Word2Vec
Treats words as bags of character n-grams
Can generate vectors for out-of-vocabulary words

BERT (Bidirectional Encoder Representations from Transformers)

Contextual embeddings based on transformer architecture
Understands context from both directions
State-of-the-art for many NLP tasks

5. Sentiment Analysis

Determining the emotional tone or sentiment of text.

Positive Sentiment: "This movie is absolutely amazing!"
Negative Sentiment: "I hate waiting in long lines"
Neutral Sentiment: "The temperature is 25 degrees"

6. Text Classification

Assigning documents to predefined categories.

Common Algorithms:

Naïve Bayes (probabilistic)
Support Vector Machine (SVM)
Neural Networks (Deep Learning)
LSTM (Long Short-Term Memory)

7. Clustering

Grouping similar documents without predefined labels.

Popular Method: K-Means

Partitions documents into K clusters
Minimizes within-cluster distance
Maximizes between-cluster distance

8. Sequence Models: LSTM

Long Short-Term Memory networks
Handle sequential data (text)
Maintain long-term dependencies
Excellent for sentiment analysis and text generation

🌟 Applications of NLP

📱 1. Virtual Assistants & Chatbots

Siri, Alexa, Google Assistant
Customer support chatbots
Conversational AI systems

📧 2. Email & Spam Detection

Filtering spam messages
Identifying phishing emails
Priority inbox management

🗣️ 3. Machine Translation

Google Translate
Breaking language barriers
Real-time translation

📰 4. Information Extraction

Extract structured data from unstructured text
Resume parsing
Document analysis

💬 5. Sentiment Analysis

Monitoring brand reputation
Analyzing customer reviews
Social media monitoring
Market research

📚 6. Question Answering Systems

Search engines
FAQ automation
Knowledge base systems

🔍 7. Information Retrieval

Search engines (Google, Bing)
Document ranking
Semantic search

🎯 8. Named Entity Recognition (NER)

Person/Place/Organization identification
Resume screening
News article analysis

✍️ 9. Text Generation

Autocomplete (Gmail, predictive text)
Content generation
Paraphrasing tools

📊 10. Document Clustering & Classification

News categorization
Document organization
Topic modeling

⚠️ Challenges in NLP

1. Ambiguity

Lexical Ambiguity: Words with multiple meanings
- "bank" (financial institution vs. river bank)
Syntactic Ambiguity: Multiple grammatical interpretations
- "I saw the man with the telescope"

2. Context Understanding

Machines struggle with understanding nuanced meanings
Sarcasm, idioms, and cultural references are difficult

3. Language Variation

Different languages have different structures
Dialects, slang, and informal speech
Misspellings and typos

4. Data Scarcity

Limited labeled data for training
Low-resource languages
Domain-specific terminology

5. Long-Range Dependencies

Understanding relationships between distant words
Solved partially by LSTM and Transformers

6. Bias in Data

Training data may contain biases
Results in biased models and unfair predictions

7. Computational Cost

Large language models require significant resources
Training and inference can be expensive

🔧 Tools & Libraries

Python Libraries

Library	Purpose	Features
NLTK	Natural Language Toolkit	Tokenization, POS tagging, stemming, NER
spaCy	Industrial-strength NLP	Fast, efficient, production-ready
TextBlob	Simple text processing	Sentiment analysis, POS tagging
Gensim	Topic modeling & word embeddings	Word2Vec, Doc2Vec, FastText
Transformers	Pre-trained models	BERT, GPT, T5
scikit-learn	Machine learning	Text classification, clustering
TensorFlow/PyTorch	Deep learning frameworks	Neural networks, LSTM

Datasets

Dataset	Purpose	Size
20 Newsgroups	Text classification	~19,000 documents
Movie Reviews	Sentiment analysis	1,000 positive + 1,000 negative
Wikipedia Corpus	General knowledge	Millions of articles
Common Crawl	Web data	Petabytes of text
GLUE	Model evaluation	Multiple benchmark tasks

📖 Practical Implementations

This Repository Contains 10 Comprehensive Practical Implementations:

1️⃣ 01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb

📋 Overview A complete end-to-end NLP pipeline demonstrating all fundamental linguistic analysis techniques using two powerful libraries: spaCy and NLTK.

🎯 Objectives

Understand complete text processing workflow
Learn multiple NLP techniques in one integrated example
Perform comprehensive linguistic analysis on sample text

📚 Key Topics Covered

Technique	Description	Library
Tokenization	Breaking text into individual words and sentences	spaCy
POS Tagging	Assigning grammatical roles to words	spaCy
Lemmatization	Converting words to base form using vocabulary	spaCy
Stemming	Reducing words to root form mechanically	NLTK
Stop Word Removal	Filtering common, less meaningful words	spaCy
Noun Phrase Chunking	Identifying meaningful noun phrases	spaCy
Dependency Parsing	Understanding grammatical relationships	spaCy
Named Entity Recognition	Identifying persons, places, organizations	spaCy

💡 Practical Example

Input: "On May 13, 2025, the Israeli Air Force executed strikes on Gaza's European Hospital"

Processing:
- Tokenization: ["On", "May", "13", ",", "2025", ...]
- POS Tags: DET, PROPN, NUM, PUNCT, NUM, ...
- NER: "May" → DATE, "Israeli Air Force" → ORG, "Gaza" → LOC, "Hospital" → ORG
- Lemmatization: "executed" → "execute"

🎓 Learning Outcomes

Master spaCy and NLTK libraries
Perform complete linguistic analysis
Understand relationship between different NLP tasks
Handle real-world text data

2️⃣ 02_N_Gram_Analysis_Tokenization_Probability.ipynb

📋 Overview Explores n-gram models, a foundational technique in NLP for understanding word sequences, calculating probabilities, and predicting word patterns.

🎯 Objectives

Understand tokenization and punctuation removal
Generate n-grams of varying sizes
Calculate frequency and probability distributions

📚 Key Topics Covered

Concept	Definition	Use Case
Unigrams (1-grams)	Individual words	Word frequency analysis
Bigrams (2-grams)	Two consecutive words	Word associations
Trigrams (3-grams)	Three consecutive words	Phrase patterns
Frequency Counting	How often each n-gram appears	Statistical analysis
Probability Calculation	Relative frequency of n-grams	Language modeling

💡 Practical Example

Text: "NLP is amazing. It is widely used in AI applications"

Unigrams: [NLP, is, amazing, It, widely, used, in, AI, applications]
Frequency: {is: 2, NLP: 1, amazing: 1, ...}

Bigrams: [(NLP, is), (is, amazing), (is, widely), (in, AI), ...]
Probability of "is": 2/9 ≈ 0.222

Trigrams: [(NLP, is, amazing), (is, amazing, It), ...]

🔢 Mathematical Foundation

Unigram Probability: P(w) = Count(w) / Total_words
Bigram Probability: P(w2|w1) = Count(w1, w2) / Count(w1)
Language Model: P(w1, w2, w3) = P(w1) × P(w2|w1) × P(w3|w1,w2)

🎓 Learning Outcomes

Extract and analyze n-grams from text
Calculate statistical probabilities
Understand language modeling foundations
Prepare for more advanced NLP techniques

3️⃣ 03_Feature_Extraction_TF_TF-IDF.ipynb

📋 Overview Demonstrates two fundamental feature extraction techniques that convert text into numerical vectors suitable for machine learning algorithms.

🎯 Objectives

Convert text documents into numerical feature vectors
Understand importance weighting mechanisms
Compare simple frequency with intelligent weighting

📚 Key Topics Covered

Term Frequency (TF)

Simple word count approach
Represents how often a word appears in a document
Formula: TF(t, d) = frequency of term t in document d

Example TF Matrix:

Document 1: "NLP is amazing, NLP is great"
         NLP  is  amazing  great
Doc 1     2    2     1      1

Document 2: "Machine learning is powerful"
            NLP  is  learning  powerful
Doc 2        0    1     1         1

TF-IDF (Term Frequency-Inverse Document Frequency)

Weights terms based on importance across documents
Reduces weight of common words
Highlights distinctive terms

Formula:

TF-IDF(t, d) = TF(t, d) × IDF(t)
IDF(t) = log(Total_Documents / Documents_containing_t)

Comparison Example:

Word "is" (appears in most documents):
- TF: 2 (high count)
- IDF: log(4/3) ≈ 0.29 (low importance)
- TF-IDF: 2 × 0.29 ≈ 0.58 (low weight)

Word "NLP" (appears in few documents):
- TF: 2 (high count)
- IDF: log(4/1) ≈ 1.39 (high importance)
- TF-IDF: 2 × 1.39 ≈ 2.78 (high weight) ✓

🎓 Learning Outcomes

Convert text to numerical vectors
Understand importance weighting
Choose appropriate feature extraction method
Prepare data for ML algorithms

4️⃣ 04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb

📋 Overview Comprehensive exploration of modern word embedding techniques that capture semantic and syntactic relationships between words.

🎯 Objectives

Learn multiple word embedding approaches
Understand semantic relationships
Compare different embedding methods

📚 Key Topics Covered

1. Word2Vec

Two architectures: CBOW (Continuous Bag of Words) and Skip-gram
Predicts words from context (Skip-gram) or context from word (CBOW)
Vector size: 50-300 dimensions
Limitation: Cannot handle out-of-vocabulary words

Example:

Word: "king"
Vector: [0.2, -0.4, 0.1, 0.5, -0.2, ...]

Similar words: ["queen", "prince", "emperor"]
Vector distances measure similarity

2. GloVe (Global Vectors)

Count-based approach using global word-word co-occurrence
Combines global statistics with local context
Generally more stable than Word2Vec
Pre-trained models available (Wikipedia, Common Crawl)

Matrix Factorization:

X[i,j] = count of word j in context of word i
GloVe decomposes this matrix into embeddings

3. FastText

Extension of Word2Vec
Treats words as bags of character n-grams
Advantage: Can generate vectors for out-of-vocabulary words
Better for morphologically rich languages

Example (OOV handling):

Training vocabulary: ["running", "runner", "run"]
Unknown word: "runs" (not in training)

Word2Vec: Cannot create vector ✗
FastText: Uses character n-grams ["ru", "un", "nn", "ni", "in", "ng"] ✓

4. BERT (Bidirectional Encoder Representations from Transformers)

Contextual embeddings (word meaning changes with context)
Bidirectional: understands context from both directions
Pre-trained on massive corpus
State-of-the-art for many tasks

Contextual Example:

Sentence 1: "I saw the bank by the river"
Sentence 2: "I deposited money at the bank"

Word: "bank"
- Embedding 1: Vector representing financial institution
- Embedding 2: Vector representing river bank
BERT generates different vectors based on context! ✓

Comparison Table:

Method	Type	OOV Handling	Speed	Context
Word2Vec	Predictive	✗	Fast	Static
GloVe	Count-based	✗	Medium	Static
FastText	Hybrid	✓	Medium	Static
BERT	Neural	✓	Slow	Dynamic

🎓 Learning Outcomes

Train and use Word2Vec models
Utilize pre-trained GloVe embeddings
Handle OOV words with FastText
Implement contextual embeddings with BERT
Choose embeddings based on task requirements

5️⃣ 05_Text_Classification_Naive_Bayes_SVM.ipynb

📋 Overview Implements two classic supervised learning algorithms for text categorization using the 20 Newsgroups dataset.

🎯 Objectives

Build text classification models
Compare probabilistic vs. geometric approaches
Evaluate model performance with multiple metrics

📚 Key Topics Covered

Classification Pipeline

Raw Text
   ↓
TF-IDF Vectorization (convert to numerical features)
   ↓
Train/Test Split (prepare data)
   ↓
Model Training (Naïve Bayes or SVM)
   ↓
Prediction & Evaluation

Multinomial Naïve Bayes

Probabilistic classifier based on Bayes' Theorem
Assumes feature independence (Naïve assumption)
Fast training and prediction
Works well with text (TF-IDF vectors)

Formula:

P(Category|Document) = P(Document|Category) × P(Category) / P(Document)

For text: P(category|words) ∝ ∏ P(word|category)

Advantages:

✓ Fast training
✓ Good with high-dimensional data
✓ Effective for text
✓ Handles missing values well

Disadvantages:

✗ Independence assumption too strong
✗ May underestimate probabilities

Support Vector Machine (SVM)

Geometric classifier finding optimal hyperplane
Maximizes margin between classes
Kernel trick for non-linear problems
Linear kernel works well for text (TF-IDF)

Concept:

┌─────────────────────────────┐
│         Feature Space        │
│                              │
│   ● Class 1 (Spam)          │
│    ●  ●        ════════     │  Optimal
│   ●  ●           ║ Margin   │  Hyperplane
│      ●           ║          │
│   ────────────────║──────── │
│         ║ Margin ║          │
│      ○  ○        ║  ○       │
│    ○   ○    ════════        │
│      ○ ○  Class 0 (Ham)     │
└─────────────────────────────┘

Advantages:

✓ Effective in high dimensions
✓ Memory efficient
✓ Versatile (different kernels)
✓ Handles complex boundaries

Disadvantages:

✗ Slower training on large datasets
✗ Requires careful kernel selection
✗ Hard to interpret

Dataset: 20 Newsgroups

18,846 documents
20 categories
Real-world news articles
Imbalanced distribution

Categories (sample):

alt.atheism
soc.religion.christian
comp.graphics
sci.med

Evaluation Metrics:

Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)           (of predicted positive, how many correct)
Recall = TP / (TP + FN)              (of actual positive, how many caught)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

🎓 Learning Outcomes

Implement classification pipelines
Train Naïve Bayes and SVM classifiers
Evaluate models with multiple metrics
Compare algorithm performance
Make informed algorithm choices

6️⃣ 06_K-Means_Text_Clustering.ipynb

📋 Overview Unsupervised learning approach to automatically group similar documents into clusters based on their content.

🎯 Objectives

Understand unsupervised learning
Apply clustering to text documents
Analyze cluster characteristics

📚 Key Topics Covered

K-Means Algorithm

An iterative algorithm that partitions documents into K clusters:

Algorithm Steps:

Step 1: Choose K (number of clusters)
         ↓
Step 2: Randomly initialize K centroids
         ↓
Step 3: Assign each document to nearest centroid (Euclidean distance)
         ↓
Step 4: Recalculate centroids as mean of assigned points
         ↓
Step 5: Repeat steps 3-4 until convergence
         ↓
Step 6: Analyze clusters

Visualization:

Iteration 1:        Iteration 2:        Final:
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ ★     ●●     │   │ ★    ●●      │   │ ★    ●●      │
│  ●●●  ●  ●  │   │●●●●   ●  ●   │   │●●●●   ●  ●   │
│  ●●   ▲   ● │→  │●●    ▲   ●   │→  │●●    ▲   ●   │
│       ★      │   │      ★       │   │     ★        │
└──────────────┘   └──────────────┘   └──────────────┘
Initial            Converging         Final Clusters

Clustering Pipeline for Text

Documents → TF-IDF Vectorization → K-Means → Cluster Analysis

Example Output:

Documents:
1. "Machine learning provides systems ability to learn"
2. "Artificial intelligence and ML are related"
3. "Cricket is popular sport in India"
4. "Indian cricket team won match"

TF-IDF Vector Space (sparse)
   ↓
K-Means with K=2
   ↓
Cluster 0: [Doc 1, Doc 2] - ML/AI related
Cluster 1: [Doc 3, Doc 4] - Sports related

Challenges in Clustering

Choosing K: How many clusters?
- Elbow method
- Silhouette analysis
- Domain knowledge
Convergence: May find local optima
- Multiple runs with different initializations
- Select best result
Scalability: Slow on very large datasets
- Mini-batch K-Means
- Approximate methods

Cluster Interpretation

Top Terms per Cluster:

Cluster 0: ["machine", "learning", "model", "data", "algorithm"]
           → ML/AI cluster

Cluster 1: ["cricket", "team", "match", "player", "game"]
           → Sports cluster

🎓 Learning Outcomes

Implement K-Means clustering
Vectorize text for clustering
Determine optimal number of clusters
Interpret and analyze clusters
Understand unsupervised learning concepts

7️⃣ 07_POS_Tagging_Part_of_Speech.ipynb

📋 Overview Assigns grammatical roles (parts of speech) to each word, enabling syntactic and semantic analysis.

🎯 Objectives

Learn POS tagging concepts
Implement using NLTK
Understand grammatical relationships

📚 Key Topics Covered

Part-of-Speech Tags

Common POS tags in English:

Tag	Meaning	Examples
NN	Noun	cat, dog, house
VB	Verb	run, jump, eat
JJ	Adjective	beautiful, quick, tall
RB	Adverb	quickly, carefully, very
DET	Determiner	the, a, an
IN	Preposition	in, on, at, by
PRP	Pronoun	he, she, it, they
CD	Cardinal Number	one, two, 42

POS Tagging Process

Sentence: "The quick brown fox jumps over the lazy dog"

Words:    [The    quick   brown  fox    jumps   over   the   lazy   dog]
          │      │       │      │      │       │      │     │      │
Tags:     [DET    JJ      JJ     NN     VB      IN     DET   JJ     NN]

Tagging Methods

Rule-based: Hand-crafted linguistic rules
Stochastic: Uses probabilistic models
Neural: Deep learning approaches
Hybrid: Combination of methods

Example Output:

Sentence: "Prexit submitted the practical on time"

Word          POS Tag    Description
─────────────────────────────────────────
Prexit        NNP        Proper Noun
submitted     VBD        Verb (past tense)
the           DT         Determiner
practical     NN         Noun
on            IN         Preposition
time          NN         Noun

Applications of POS Tagging

Information extraction
Parsing and syntax analysis
Named entity recognition (filter nouns)
Spell checking (context-aware)
Machine translation
Speech recognition (disambiguation)

🎓 Learning Outcomes

Understand linguistic grammatical concepts
Implement POS tagging with NLTK
Interpret POS tag sequences
Prepare data for downstream NLP tasks
Recognize word roles in sentences

8️⃣ 08_Text_Processing_LSTM_Sentiment_Classification.ipynb

📋 Overview Introduces neural networks for NLP, specifically LSTM (Long Short-Term Memory) networks for sentiment classification.

🎯 Objectives

Preprocess text for neural networks
Build and train LSTM models
Classify sentiment (positive/negative)

📚 Key Topics Covered

Neural Network Basics for Text

Text → Tokenization → Integer Sequences → Padding → Embedding → Neural Network

Text Preprocessing Steps

Tokenization: Convert words to integers

Vocabulary: {love: 1, this: 2, hate: 3, bad: 4}
Text: "I love this"
Tokens: [1, 2]  (numbers replacing words)

Padding: Make all sequences same length

Original: [[1, 2], [3, 4, 5], [6]]
Padded:   [[0, 1, 2],
           [3, 4, 5],
           [0, 0, 6]]  (length=3)

Embedding: Dense vector representation

Word: "love" (ID: 1)
Embedding: [0.2, -0.4, 0.1, 0.5]  (50-300 dimensions)

LSTM (Long Short-Term Memory) Architecture

Problem: Regular RNNs suffer from vanishing gradient

RNN: h_t = tanh(W_h * h_{t-1} + W_x * x_t)
Problem: Gradient → 0 over many time steps
         Long-range dependencies lost

LSTM Solution: Memory cells + gates

Cell State (C_t): "Long-term memory" (relatively unchanged)
Hidden State (h_t): "Short-term output"

Three Gates:
1. Forget Gate: What to forget from previous cell state
2. Input Gate: What new information to add
3. Output Gate: What to output from cell state

LSTM Cell Equations:

Forget Gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)
Input Gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
Cell Update: C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)
Cell State: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
Output Gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
Hidden: h_t = o_t ⊙ tanh(C_t)

Network Architecture:

Input Layer (Embedding)
        ↓
[Embedding Vectors]  (text → 50-dim vectors)
        ↓
LSTM Layer
        ↓
[Hidden States]  (sequential processing)
        ↓
Dense Layer
        ↓
Output Layer (Sigmoid)
        ↓
Sentiment: [0] Negative or [1] Positive

Dataset Example

Text: "I love this product"
Label: positive (1)

Text: "This is the worst"
Label: negative (0)

Training Process

1. Forward pass: Input → LSTM → Dense → Sigmoid → Prediction
2. Calculate loss: Binary Crossentropy
3. Backpropagation: Compute gradients
4. Update weights: Using Adam optimizer
5. Repeat for multiple epochs

🎓 Learning Outcomes

Preprocess text for neural networks
Understand LSTM architecture
Build sentiment classification models
Train deep learning models
Handle sequential text data

9️⃣ 09_Advanced_LSTM_Sentiment_Classification.ipynb

📋 Overview Enhanced version of LSTM sentiment classification with advanced techniques including dropout regularization and improved architecture.

🎯 Objectives

Implement advanced regularization techniques
Improve model performance
Handle overfitting in neural networks

📚 Key Topics Covered

Overfitting Problem

Training Loss        Training Loss & Validation Loss
         ╲                    ╲ Training Loss
          ╲                    ╲   ↓
           ╲                     ╲ ║
            ╲ (Good)              ╲║ (Overfitting)
             ╲                    ╱║
              ╲_____             ╱ Validation Loss ↑
              Good Generalization  Poor Generalization

Dropout Regularization

Random deactivation of neurons during training to prevent co-adaptation.

Without Dropout:        With Dropout (50%):
┌─────────────┐        ┌──────────────┐
│  ●  ●  ●  ● │        │  ●  ✗  ●  ✗  │
│   ╲ │ ╱    │        │   ╲ │ ╱     │ (Some neurons
│    ╲│╱     │   →    │    ╲│╱      │  randomly turned off)
│     ●      │        │     ●       │
└─────────────┘        └──────────────┘

Benefits:

✓ Prevents co-adaptation of neurons
✓ Forces learning of robust features
✓ Acts as ensemble of models
✓ Reduces overfitting

Implementation:

Dropout Rate: 0.5 (50% neurons dropped)
After Training: All neurons active, weights × (1 - dropout_rate)

Advanced Architecture

Input Layer (Embedding)
        ↓
LSTM Layer 1 (64 units)
        ↓
Dropout (0.5)  ← Prevents overfitting
        ↓
LSTM Layer 2 (32 units)
        ↓
Dropout (0.5)  ← Additional regularization
        ↓
Dense Layer (16 units, ReLU)
        ↓
Output Layer (1 unit, Sigmoid)
        ↓
Sentiment Prediction

Hyperparameter Tuning

Parameter	Purpose	Common Values
Embedding Dim	Vector size for words	50, 100, 300
LSTM Units	Hidden state size	32, 64, 128, 256
Dropout Rate	Fraction to drop	0.2, 0.5, 0.7
Learning Rate	Optimization step size	0.001, 0.01, 0.1
Batch Size	Samples per update	16, 32, 64, 128
Epochs	Training iterations	10-100

Training Monitoring

Epoch 1/50
Loss: 0.693, Accuracy: 0.50, Val_Loss: 0.691, Val_Accuracy: 0.50
Epoch 2/50
Loss: 0.620, Accuracy: 0.67, Val_Loss: 0.620, Val_Accuracy: 0.65
...
Epoch 50/50
Loss: 0.180, Accuracy: 0.95, Val_Loss: 0.320, Val_Accuracy: 0.88

🎓 Learning Outcomes

Implement regularization techniques
Build deeper neural networks
Tune hyperparameters effectively
Monitor training with metrics
Improve model generalization
Understand overfitting and solutions

🔟 10_Spam_Detection_Naive_Bayes_Application.ipynb

📋 Overview A complete real-world NLP application demonstrating spam detection using Bag-of-Words and Multinomial Naïve Bayes.

🎯 Objectives

Develop a practical NLP application
Preprocess diverse text data
Classify messages as spam or legitimate (ham)

📚 Key Topics Covered

Problem Definition

Binary Classification Task:

Spam: Unsolicited, marketing, phishing messages
Ham: Legitimate messages

Real-World Examples:

Spam Messages:

"Congratulations! You won a free lottery"
"Call now to claim your prize"
"Earn money fast by clicking this link"
"URGENT: Verify your account immediately"

Ham Messages:

"This is a meeting reminder"
"Let's have lunch tomorrow"
"Your appointment is scheduled"
"Thanks for your help!"

System Architecture

┌──────────────────────────┐
│   Raw Text Message       │
│ "Congratulations! You    │
│  won a free lottery"     │
└────────────┬─────────────┘
             ↓
┌──────────────────────────┐
│  Text Preprocessing      │
│  • Lowercase             │
│  • Remove special chars  │
│  • Strip whitespace      │
└────────────┬─────────────┘
             ↓
│ "congratulations you won │
│  a free lottery"         │
             ↓
┌──────────────────────────┐
│  Bag-of-Words (BoW)      │
│  CountVectorizer         │
└────────────┬─────────────┘
             ↓
│ {won: 1, free: 1,        │
│  lottery: 1, ...}        │
             ↓
┌──────────────────────────┐
│  Naïve Bayes Classifier  │
└────────────┬─────────────┘
             ↓
│ P(Spam|Words) = ?        │
│ P(Ham|Words) = ?         │
             ↓
┌──────────────────────────┐
│    Prediction: SPAM ✓    │
└──────────────────────────┘

Text Preprocessing

Step 1: Original
Input: "Congratulations! You won a free lottery"

Step 2: Lowercase
"congratulations! you won a free lottery"

Step 3: Remove non-letters (punctuation, numbers)
"congratulations you won a free lottery"

Step 4: Strip whitespace
["congratulations", "you", "won", "a", "free", "lottery"]

Feature Engineering: Bag-of-Words

Vocabulary (from training):
{congratulations: 0, you: 1, won: 2, a: 3, free: 4, lottery: 5, ...}

Message 1: "Congratulations you won a free lottery"
BoW Vector: [1, 1, 1, 1, 1, 1, 0, 0, 0, ...]

Message 2: "Let's have lunch tomorrow"
BoW Vector: [0, 0, 0, 0, 0, 0, 1, 1, 1, ...]

Naïve Bayes Classification

Probability calculation:

P(Spam|Message) = P(Message|Spam) × P(Spam) / P(Message)

For Bag-of-Words:
P(Message|Spam) = ∏ P(word_i|Spam)

Decision:
If P(Spam|Message) > P(Ham|Message) → Classify as SPAM
Else → Classify as HAM

Example:

Message: "Win cash now!"

P(Spam|"win", "cash", "now") = 
  P("win"|Spam) × P("cash"|Spam) × P("now"|Spam) × P(Spam) / P(Message)

P(win|Spam) = 0.05  (5% of spam contain "win")
P(cash|Spam) = 0.08 (8% of spam contain "cash")
P(now|Spam) = 0.03  (3% of spam contain "now")
P(Spam) = 0.4       (40% of messages are spam)

Result: P(Spam|Message) = 0.8 > 0.2 = P(Ham|Message) → SPAM ✓

Model Evaluation

Confusion Matrix:

                Predicted Spam    Predicted Ham
Actual Spam         TP               FN
Actual Ham          FP               TN

Metrics:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP)    (of predicted spam, how many correct)
Recall = TP / (TP + FN)       (of actual spam, how many caught)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)

Example Results:

TP = 95 (correctly identified spam)
FP = 5  (incorrectly marked ham as spam)
FN = 10 (missed spam messages)
TN = 90 (correctly identified ham)

Accuracy = (95 + 90) / 200 = 92.5%
Precision = 95 / (95 + 5) = 95%
Recall = 95 / (95 + 10) = 90.5%
F1-Score = 2 × (0.95 × 0.905) / (0.95 + 0.905) = 0.926

Prediction on New Messages

Test 1: "Win cash now!"
Prediction: SPAM (Probability: 92%)

Test 2: "Are we meeting today?"
Prediction: HAM (Probability: 88%)

Test 3: "Claim your free prize"
Prediction: SPAM (Probability: 95%)

Test 4: "See you at the meeting"
Prediction: HAM (Probability: 91%)

Advantages of This Approach

✓ Simple and interpretable
✓ Fast training and prediction
✓ Effective for spam detection
✓ Works with limited data
✓ Easy to update with new messages
✓ Good baseline for classification

Real-World Considerations

Challenges:
1. Spam variations: Attackers constantly change messages
2. False positives: Legitimate messages marked as spam
3. False negatives: Spam gets through
4. Language evolution: New words, slang, emojis
5. Multiple languages: Different preprocessing needed

Solutions:
1. Regular model retraining
2. Balanced evaluation metrics
3. Combine with other features (sender, links, etc.)
4. Use ensemble methods
5. Handle multiple languages

🎓 Learning Outcomes

Develop end-to-end NLP application
Preprocess diverse text data
Implement practical feature extraction
Apply Naïve Bayes for binary classification
Evaluate model performance
Handle real-world spam detection problem
Understand practical NLP deployment

🎓 Learning Outcomes

After studying these practicals, you will understand:

✅ How to preprocess text data
✅ How to extract meaningful features from text
✅ How to train machine learning models for NLP tasks
✅ How word embeddings capture semantic meaning
✅ How to classify text using various algorithms
✅ How to cluster similar documents
✅ How to build deep learning models (LSTM) for NLP
✅ How to implement real-world NLP applications

📈 NLP Evolution Timeline

2000s: Statistical methods (n-grams, HMMs)
       ↓
2010s: Machine learning (SVM, Naïve Bayes)
       ↓
2013: Word embeddings (Word2Vec)
       ↓
2015: Deep learning (RNN, LSTM)
       ↓
2017: Transformer architecture (Attention is All You Need)
       ↓
2018: BERT and contextual embeddings
       ↓
2020+: Large Language Models (GPT-3, T5, ELECTRA)
       ↓
2023+: Multimodal models, RAG, Fine-tuning

🚀 Future of NLP

Emerging Trends

Multimodal Learning: Combining text with images, audio, and video
Few-Shot Learning: Learning from minimal examples
Retrieval-Augmented Generation (RAG): Combining retrieval with generation
Domain Adaptation: Transferring knowledge between domains
Ethical NLP: Fair, transparent, and responsible AI
Low-Resource Languages: Improving NLP for under-resourced languages
Efficient Models: Smaller, faster models for edge devices

📚 Further Reading & Resources

Online Courses

Stanford CS224N: NLP with Deep Learning
Andrew Ng's Deep Learning Specialization
Hugging Face NLP Course

Books

"Speech and Language Processing" by Jurafsky & Martin
"Natural Language Processing with Python" (NLTK Book)
"Deep Learning for NLP" by Yoav Goldberg

Research Papers

"Attention is All You Need" (Transformer)
"BERT: Pre-training of Deep Bidirectional Transformers"
"Sequence to Sequence Learning with Neural Networks"

Websites

📞 Contact & Support

For questions or clarifications regarding this summary or the practical implementations:

📧 Email: prexitjoshi@gmail.com
🎓 Institution: University Institute of Engineering and Technology, Punjab University (UIET, PU)
👤 Author: PREXIT JOSHI (Roll No. UE233118)
🏫 Department: Computer Science and Engineering (CSE)

📂 Repository Structure

Natural-Language-Processing/
├── 01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb
├── 02_N_Gram_Analysis_Tokenization_Probability.ipynb
├── 03_Feature_Extraction_TF_TF-IDF.ipynb
├── 04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb
├── 05_Text_Classification_Naive_Bayes_SVM.ipynb
├── 06_K-Means_Text_Clustering.ipynb
├── 07_POS_Tagging_Part_of_Speech.ipynb
├── 08_Text_Processing_LSTM_Sentiment_Classification.ipynb
├── 09_Advanced_LSTM_Sentiment_Classification.ipynb
├── 10_Spam_Detection_Naive_Bayes_Application.ipynb
├── README.md                    # This file
├── GETTING_STARTED.md           # Setup and quick start guide
├── CONTRIBUTING.md              # Contribution guidelines
├── CHANGELOG.md                 # Version history
├── LICENSE                      # MIT License
├── requirements.txt             # Python dependencies
└── .gitignore                   # Git ignore rules

🛠️ Technologies Used

Core NLP Libraries

NLTK - Natural Language Toolkit
spaCy - Industrial-strength NLP
Gensim - Word embeddings (Word2Vec, FastText)
Transformers - Pre-trained models (BERT, GPT)

Machine Learning

scikit-learn - Classic ML algorithms
TensorFlow/Keras - Deep learning framework
PyTorch - Alternative DL framework

Data Processing

Pandas - Data manipulation
NumPy - Numerical computing
Jupyter - Interactive notebooks

📊 Project Statistics

Metric	Value
Total Practicals	10
Total Code Cells	100+
Documentation Lines	1400+
Code Examples	50+
Diagrams/Visualizations	30+
Topics Covered	50+
Estimated Learning Time	30-40 hours

🤝 Contributing

We welcome contributions! See CONTRIBUTING.md for:

How to report bugs
How to suggest features
Pull request process
Coding standards
Commit message guidelines

Quick Contribution Steps

# 1. Fork the repository
# 2. Create feature branch
git checkout -b feature/amazing-addition

# 3. Make changes and commit
git commit -m "feat: add amazing NLP feature"

# 4. Push and create PR
git push origin feature/amazing-addition

📞 Support & Questions

Issues: GitHub Issues
Email: prexitjoshi@gmail.com
Discussions: GitHub Discussions

📚 Additional Resources

Official Documentation

Learning Resources

Community

📄 License

This project is licensed under the MIT License - see LICENSE file for details.

MIT License - Free for personal, educational, and commercial use
with attribution required.

🎓 Citation

If you use this project in your research or work, please cite:

@misc{joshi2025nlp,
  title={Natural Language Processing: Comprehensive Practicals},
  author={Joshi, Prexit},
  year={2025},
  url={https://github.com/intronep666/Natural-Language-Processing}
}

📋 Version History

Current Version: 1.0.0 (November 29, 2025)

See CHANGELOG.md for detailed version history and planned features.

🙏 Acknowledgments

NLTK & spaCy Teams for exceptional NLP libraries
Hugging Face for transformer models and community
TensorFlow & PyTorch communities
scikit-learn for ML tools
All Contributors and supporters

🔗 Links

GitHub Repository: https://github.com/intronep666/Natural-Language-Processing
Author GitHub: https://github.com/intronep666
Institution: UIET, PU

✨ Conclusion

Natural Language Processing is a rapidly evolving field that combines linguistics, computer science, and machine learning. From simple text preprocessing to advanced transformer-based models, NLP enables machines to understand and generate human language in increasingly sophisticated ways.

The practical implementations in this repository demonstrate fundamental and advanced NLP concepts, providing hands-on experience with real-world applications and techniques. Whether you're interested in sentiment analysis, text classification, machine translation, or information extraction, NLP offers powerful tools and methodologies to solve complex language-based problems.

This repository is designed to:

✅ Provide comprehensive, hands-on learning
✅ Cover beginner to intermediate concepts
✅ Include well-documented, runnable code
✅ Foster community contributions
✅ Serve as a portfolio project

Happy Learning! 🚀

Made with ❤️ for NLP enthusiasts and students

Last Updated: November 29, 2025

Back to Top

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.gitignore		.gitignore
01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb		01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb
02_N_Gram_Analysis_Tokenization_Probability.ipynb		02_N_Gram_Analysis_Tokenization_Probability.ipynb
03_Feature_Extraction_TF_TF-IDF.ipynb		03_Feature_Extraction_TF_TF-IDF.ipynb
04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb		04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb
05_Text_Classification_Naive_Bayes_SVM.ipynb		05_Text_Classification_Naive_Bayes_SVM.ipynb
06_K-Means_Text_Clustering.ipynb		06_K-Means_Text_Clustering.ipynb
07_POS_Tagging_Part_of_Speech.ipynb		07_POS_Tagging_Part_of_Speech.ipynb
08_Text_Processing_LSTM_Sentiment_Classification.ipynb		08_Text_Processing_LSTM_Sentiment_Classification.ipynb
09_Advanced_LSTM_Sentiment_Classification.ipynb		09_Advanced_LSTM_Sentiment_Classification.ipynb
10_Spam_Detection_Naive_Bayes_Application.ipynb		10_Spam_Detection_Naive_Bayes_Application.ipynb
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
GETTING_STARTED.md		GETTING_STARTED.md
LICENSE		LICENSE
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
requirements.txt		requirements.txt

License

intronep666/Natural-Language-Processing

Folders and files

Latest commit

History

Repository files navigation

📚 Natural Language Processing (NLP) - Comprehensive Practicals

👤 Author Information

📖 Quick Start

📦 Installation

Requirements

Step-by-Step Installation

🎯 Table of Contents

🤔 What is NLP?

Definition

Why is NLP Important?

Key Objectives of NLP

🧠 Core Concepts

1. Tokenization

2. Stemming vs. Lemmatization

3. Stop Words

4. Part-of-Speech (POS) Tagging

5. Named Entity Recognition (NER)

6. Dependency Parsing

🔄 NLP Processing Pipeline

Typical NLP Workflow

🛠️ Key Techniques

1. Bag of Words (BoW)

2. Term Frequency-Inverse Document Frequency (TF-IDF)

3. N-Gram Analysis

4. Word Embeddings

Word2Vec

GloVe (Global Vectors)

FastText

BERT (Bidirectional Encoder Representations from Transformers)

5. Sentiment Analysis

6. Text Classification

7. Clustering

8. Sequence Models: LSTM

🌟 Applications of NLP

📱 1. Virtual Assistants & Chatbots

📧 2. Email & Spam Detection

🗣️ 3. Machine Translation

📰 4. Information Extraction

💬 5. Sentiment Analysis

📚 6. Question Answering Systems

🔍 7. Information Retrieval

🎯 8. Named Entity Recognition (NER)

✍️ 9. Text Generation

📊 10. Document Clustering & Classification

⚠️ Challenges in NLP

1. Ambiguity

2. Context Understanding

3. Language Variation

4. Data Scarcity

5. Long-Range Dependencies

6. Bias in Data

7. Computational Cost

🔧 Tools & Libraries

Python Libraries

Datasets

📖 Practical Implementations

This Repository Contains 10 Comprehensive Practical Implementations:

1️⃣ 01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb

2️⃣ 02_N_Gram_Analysis_Tokenization_Probability.ipynb

3️⃣ 03_Feature_Extraction_TF_TF-IDF.ipynb

Term Frequency (TF)

TF-IDF (Term Frequency-Inverse Document Frequency)

4️⃣ 04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb

1. Word2Vec

2. GloVe (Global Vectors)

3. FastText

4. BERT (Bidirectional Encoder Representations from Transformers)

5️⃣ 05_Text_Classification_Naive_Bayes_SVM.ipynb

Classification Pipeline

Multinomial Naïve Bayes

Support Vector Machine (SVM)

Dataset: 20 Newsgroups

6️⃣ 06_K-Means_Text_Clustering.ipynb

K-Means Algorithm