A comprehensive repository of 10 practical NLP implementations covering everything from basic text processing to advanced deep learning models.
Quick Start • Installation • Practicals • Contributing • License
| Field | Details |
|---|---|
| Name | PREXIT JOSHI |
| Roll Number | UE233118 |
| Branch | Computer Science and Engineering (CSE) |
| Institute | University Institute of Engineering and Technology, Punjab University (UIET, PU) |
| 📧 prexitjoshi@gmail.com | |
| GitHub | @intronep666 |
Get started in minutes:
# 1. Clone repository
git clone https://github.com/intronep666/Natural-Language-Processing.git
cd Natural-Language-Processing
# 2. Install dependencies
pip install -r requirements.txt
# 3. Download NLP data
python -m spacy download en_core_web_sm
python -c "import nltk; nltk.download('punkt')"
# 4. Launch Jupyter
jupyter notebook
# 5. Open and run practicals!For detailed setup instructions, see GETTING_STARTED.md.
- Python 3.8 or higher
- pip/conda
- ~2GB disk space (for models)
- Virtual environment (recommended)
# Create virtual environment
python -m venv nlp_env
source nlp_env/bin/activate # On Windows: nlp_env\Scripts\activate
# Install all dependencies
pip install -r requirements.txt
# Download spaCy model
python -m spacy download en_core_web_smSee GETTING_STARTED.md for detailed setup, troubleshooting, and next steps.
- What is NLP?
- Core Concepts
- NLP Processing Pipeline
- Key Techniques
- Applications
- Challenges
- Tools & Libraries
- Practical Implementations
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) and linguistics that focuses on enabling computers to understand, interpret, and generate human language in a meaningful and useful way. It bridges the gap between human communication and computer understanding.
- 💬 Communication Bridge: Enables machines to understand human language naturally
- 🔍 Data Extraction: Extract valuable insights from unstructured text data
- 🤖 Automation: Automate language-based tasks at scale
- 📊 Business Intelligence: Analyze customer feedback, reviews, and sentiment
- 🌐 Global Reach: Break language barriers through translation
┌─────────────────────────────────────────┐
│ NLP Core Objectives │
├─────────────────────────────────────────┤
│ 1. Understanding (Comprehension) │
│ 2. Generation (Producing text) │
│ 3. Translation (Language to language) │
│ 4. Analysis (Extracting information) │
│ 5. Classification (Categorizing text) │
└─────────────────────────────────────────┘
Breaking down text into smaller units (words, sentences, or subwords).
Example:
Text: "Natural Language Processing is amazing!"
Tokens: ["Natural", "Language", "Processing", "is", "amazing", "!"]
| Stemming | Lemmatization |
|---|---|
| Removes suffixes mechanically | Uses vocabulary and morphology |
| Fast but may oversimplify | Accurate but slower |
| "running", "runs" → "run" | "running", "runs" → "run" |
Common words (the, is, and, etc.) that are often removed for efficiency.
Example:
Original: "The cat is on the mat"
After removal: "cat mat"
Labeling each word with its grammatical role.
The → DET (Determiner)
cat → NN (Noun)
runs → VB (Verb)
quickly → RB (Adverb)
Identifying and classifying named entities in text.
Text: "Apple Inc. is located in Cupertino, California"
Entities:
- "Apple Inc." → Organization
- "Cupertino" → Location
- "California" → Location
Understanding grammatical relationships between words.
"The cat chased the mouse"
↓
nsubj ↓ obj
subject: "cat"
action: "chased"
object: "mouse"
┌──────────────────┐
│ Raw Text │
└────────┬─────────┘
↓
┌──────────────────┐
│ Text Cleaning │ (Remove special characters, lowercasing)
└────────┬─────────┘
↓
┌──────────────────┐
│ Tokenization │ (Break into tokens)
└────────┬─────────┘
↓
┌──────────────────┐
│ Normalization │ (Stemming/Lemmatization)
└────────┬─────────┘
↓
┌──────────────────┐
│ Stop Word │ (Remove common words)
│ Removal │
└────────┬─────────┘
↓
┌──────────────────┐
│ Feature │ (Convert to numerical vectors)
│ Extraction │
└────────┬─────────┘
↓
┌──────────────────┐
│ ML/DL Model │ (Classification, clustering, etc.)
└────────┬─────────┘
↓
┌──────────────────┐
│ Prediction/ │ (Output results)
│ Analysis │
└──────────────────┘
Converts text into a vector of word counts, ignoring word order.
Sentence: "I love NLP, NLP is great"
BoW: {
"I": 1,
"love": 1,
"NLP": 2,
"is": 1,
"great": 1
}Weighs words based on their importance in a document and corpus.
Formula:
TF-IDF(t, d) = TF(t, d) × IDF(t)
Where:
- TF = frequency of term in document
- IDF = log(total documents / documents containing term)
Sequences of N consecutive words.
Text: "Natural Language Processing"
Unigrams (1-gram):
["Natural"], ["Language"], ["Processing"]
Bigrams (2-gram):
["Natural", "Language"], ["Language", "Processing"]
Trigrams (3-gram):
["Natural", "Language", "Processing"]
- Captures semantic similarity between words
- Two models: CBOW (Continuous Bag of Words) and Skip-gram
- Output: Dense vector for each word
- Count-based embedding using word co-occurrence matrix
- Combines global statistics with local context
- Extension of Word2Vec
- Treats words as bags of character n-grams
- Can generate vectors for out-of-vocabulary words
- Contextual embeddings based on transformer architecture
- Understands context from both directions
- State-of-the-art for many NLP tasks
Determining the emotional tone or sentiment of text.
Positive Sentiment: "This movie is absolutely amazing!"
Negative Sentiment: "I hate waiting in long lines"
Neutral Sentiment: "The temperature is 25 degrees"
Assigning documents to predefined categories.
Common Algorithms:
- Naïve Bayes (probabilistic)
- Support Vector Machine (SVM)
- Neural Networks (Deep Learning)
- LSTM (Long Short-Term Memory)
Grouping similar documents without predefined labels.
Popular Method: K-Means
- Partitions documents into K clusters
- Minimizes within-cluster distance
- Maximizes between-cluster distance
- Long Short-Term Memory networks
- Handle sequential data (text)
- Maintain long-term dependencies
- Excellent for sentiment analysis and text generation
- Siri, Alexa, Google Assistant
- Customer support chatbots
- Conversational AI systems
- Filtering spam messages
- Identifying phishing emails
- Priority inbox management
- Google Translate
- Breaking language barriers
- Real-time translation
- Extract structured data from unstructured text
- Resume parsing
- Document analysis
- Monitoring brand reputation
- Analyzing customer reviews
- Social media monitoring
- Market research
- Search engines
- FAQ automation
- Knowledge base systems
- Search engines (Google, Bing)
- Document ranking
- Semantic search
- Person/Place/Organization identification
- Resume screening
- News article analysis
- Autocomplete (Gmail, predictive text)
- Content generation
- Paraphrasing tools
- News categorization
- Document organization
- Topic modeling
- Lexical Ambiguity: Words with multiple meanings
- "bank" (financial institution vs. river bank)
- Syntactic Ambiguity: Multiple grammatical interpretations
- "I saw the man with the telescope"
- Machines struggle with understanding nuanced meanings
- Sarcasm, idioms, and cultural references are difficult
- Different languages have different structures
- Dialects, slang, and informal speech
- Misspellings and typos
- Limited labeled data for training
- Low-resource languages
- Domain-specific terminology
- Understanding relationships between distant words
- Solved partially by LSTM and Transformers
- Training data may contain biases
- Results in biased models and unfair predictions
- Large language models require significant resources
- Training and inference can be expensive
| Library | Purpose | Features |
|---|---|---|
| NLTK | Natural Language Toolkit | Tokenization, POS tagging, stemming, NER |
| spaCy | Industrial-strength NLP | Fast, efficient, production-ready |
| TextBlob | Simple text processing | Sentiment analysis, POS tagging |
| Gensim | Topic modeling & word embeddings | Word2Vec, Doc2Vec, FastText |
| Transformers | Pre-trained models | BERT, GPT, T5 |
| scikit-learn | Machine learning | Text classification, clustering |
| TensorFlow/PyTorch | Deep learning frameworks | Neural networks, LSTM |
| Dataset | Purpose | Size |
|---|---|---|
| 20 Newsgroups | Text classification | ~19,000 documents |
| Movie Reviews | Sentiment analysis | 1,000 positive + 1,000 negative |
| Wikipedia Corpus | General knowledge | Millions of articles |
| Common Crawl | Web data | Petabytes of text |
| GLUE | Model evaluation | Multiple benchmark tasks |
📋 Overview A complete end-to-end NLP pipeline demonstrating all fundamental linguistic analysis techniques using two powerful libraries: spaCy and NLTK.
🎯 Objectives
- Understand complete text processing workflow
- Learn multiple NLP techniques in one integrated example
- Perform comprehensive linguistic analysis on sample text
📚 Key Topics Covered
| Technique | Description | Library |
|---|---|---|
| Tokenization | Breaking text into individual words and sentences | spaCy |
| POS Tagging | Assigning grammatical roles to words | spaCy |
| Lemmatization | Converting words to base form using vocabulary | spaCy |
| Stemming | Reducing words to root form mechanically | NLTK |
| Stop Word Removal | Filtering common, less meaningful words | spaCy |
| Noun Phrase Chunking | Identifying meaningful noun phrases | spaCy |
| Dependency Parsing | Understanding grammatical relationships | spaCy |
| Named Entity Recognition | Identifying persons, places, organizations | spaCy |
💡 Practical Example
Input: "On May 13, 2025, the Israeli Air Force executed strikes on Gaza's European Hospital"
Processing:
- Tokenization: ["On", "May", "13", ",", "2025", ...]
- POS Tags: DET, PROPN, NUM, PUNCT, NUM, ...
- NER: "May" → DATE, "Israeli Air Force" → ORG, "Gaza" → LOC, "Hospital" → ORG
- Lemmatization: "executed" → "execute"
🎓 Learning Outcomes
- Master spaCy and NLTK libraries
- Perform complete linguistic analysis
- Understand relationship between different NLP tasks
- Handle real-world text data
📋 Overview Explores n-gram models, a foundational technique in NLP for understanding word sequences, calculating probabilities, and predicting word patterns.
🎯 Objectives
- Understand tokenization and punctuation removal
- Generate n-grams of varying sizes
- Calculate frequency and probability distributions
📚 Key Topics Covered
| Concept | Definition | Use Case |
|---|---|---|
| Unigrams (1-grams) | Individual words | Word frequency analysis |
| Bigrams (2-grams) | Two consecutive words | Word associations |
| Trigrams (3-grams) | Three consecutive words | Phrase patterns |
| Frequency Counting | How often each n-gram appears | Statistical analysis |
| Probability Calculation | Relative frequency of n-grams | Language modeling |
💡 Practical Example
Text: "NLP is amazing. It is widely used in AI applications"
Unigrams: [NLP, is, amazing, It, widely, used, in, AI, applications]
Frequency: {is: 2, NLP: 1, amazing: 1, ...}
Bigrams: [(NLP, is), (is, amazing), (is, widely), (in, AI), ...]
Probability of "is": 2/9 ≈ 0.222
Trigrams: [(NLP, is, amazing), (is, amazing, It), ...]
🔢 Mathematical Foundation
Unigram Probability: P(w) = Count(w) / Total_words
Bigram Probability: P(w2|w1) = Count(w1, w2) / Count(w1)
Language Model: P(w1, w2, w3) = P(w1) × P(w2|w1) × P(w3|w1,w2)
🎓 Learning Outcomes
- Extract and analyze n-grams from text
- Calculate statistical probabilities
- Understand language modeling foundations
- Prepare for more advanced NLP techniques
📋 Overview Demonstrates two fundamental feature extraction techniques that convert text into numerical vectors suitable for machine learning algorithms.
🎯 Objectives
- Convert text documents into numerical feature vectors
- Understand importance weighting mechanisms
- Compare simple frequency with intelligent weighting
📚 Key Topics Covered
- Simple word count approach
- Represents how often a word appears in a document
- Formula:
TF(t, d) = frequency of term t in document d
Example TF Matrix:
Document 1: "NLP is amazing, NLP is great"
NLP is amazing great
Doc 1 2 2 1 1
Document 2: "Machine learning is powerful"
NLP is learning powerful
Doc 2 0 1 1 1
- Weights terms based on importance across documents
- Reduces weight of common words
- Highlights distinctive terms
Formula:
TF-IDF(t, d) = TF(t, d) × IDF(t)
IDF(t) = log(Total_Documents / Documents_containing_t)
Comparison Example:
Word "is" (appears in most documents):
- TF: 2 (high count)
- IDF: log(4/3) ≈ 0.29 (low importance)
- TF-IDF: 2 × 0.29 ≈ 0.58 (low weight)
Word "NLP" (appears in few documents):
- TF: 2 (high count)
- IDF: log(4/1) ≈ 1.39 (high importance)
- TF-IDF: 2 × 1.39 ≈ 2.78 (high weight) ✓
🎓 Learning Outcomes
- Convert text to numerical vectors
- Understand importance weighting
- Choose appropriate feature extraction method
- Prepare data for ML algorithms
📋 Overview Comprehensive exploration of modern word embedding techniques that capture semantic and syntactic relationships between words.
🎯 Objectives
- Learn multiple word embedding approaches
- Understand semantic relationships
- Compare different embedding methods
📚 Key Topics Covered
- Two architectures: CBOW (Continuous Bag of Words) and Skip-gram
- Predicts words from context (Skip-gram) or context from word (CBOW)
- Vector size: 50-300 dimensions
- Limitation: Cannot handle out-of-vocabulary words
Example:
Word: "king"
Vector: [0.2, -0.4, 0.1, 0.5, -0.2, ...]
Similar words: ["queen", "prince", "emperor"]
Vector distances measure similarity
- Count-based approach using global word-word co-occurrence
- Combines global statistics with local context
- Generally more stable than Word2Vec
- Pre-trained models available (Wikipedia, Common Crawl)
Matrix Factorization:
X[i,j] = count of word j in context of word i
GloVe decomposes this matrix into embeddings
- Extension of Word2Vec
- Treats words as bags of character n-grams
- Advantage: Can generate vectors for out-of-vocabulary words
- Better for morphologically rich languages
Example (OOV handling):
Training vocabulary: ["running", "runner", "run"]
Unknown word: "runs" (not in training)
Word2Vec: Cannot create vector ✗
FastText: Uses character n-grams ["ru", "un", "nn", "ni", "in", "ng"] ✓
- Contextual embeddings (word meaning changes with context)
- Bidirectional: understands context from both directions
- Pre-trained on massive corpus
- State-of-the-art for many tasks
Contextual Example:
Sentence 1: "I saw the bank by the river"
Sentence 2: "I deposited money at the bank"
Word: "bank"
- Embedding 1: Vector representing financial institution
- Embedding 2: Vector representing river bank
BERT generates different vectors based on context! ✓
Comparison Table:
| Method | Type | OOV Handling | Speed | Context |
|---|---|---|---|---|
| Word2Vec | Predictive | ✗ | Fast | Static |
| GloVe | Count-based | ✗ | Medium | Static |
| FastText | Hybrid | ✓ | Medium | Static |
| BERT | Neural | ✓ | Slow | Dynamic |
🎓 Learning Outcomes
- Train and use Word2Vec models
- Utilize pre-trained GloVe embeddings
- Handle OOV words with FastText
- Implement contextual embeddings with BERT
- Choose embeddings based on task requirements
📋 Overview Implements two classic supervised learning algorithms for text categorization using the 20 Newsgroups dataset.
🎯 Objectives
- Build text classification models
- Compare probabilistic vs. geometric approaches
- Evaluate model performance with multiple metrics
📚 Key Topics Covered
Raw Text
↓
TF-IDF Vectorization (convert to numerical features)
↓
Train/Test Split (prepare data)
↓
Model Training (Naïve Bayes or SVM)
↓
Prediction & Evaluation
- Probabilistic classifier based on Bayes' Theorem
- Assumes feature independence (Naïve assumption)
- Fast training and prediction
- Works well with text (TF-IDF vectors)
Formula:
P(Category|Document) = P(Document|Category) × P(Category) / P(Document)
For text: P(category|words) ∝ ∏ P(word|category)
Advantages:
- ✓ Fast training
- ✓ Good with high-dimensional data
- ✓ Effective for text
- ✓ Handles missing values well
Disadvantages:
- ✗ Independence assumption too strong
- ✗ May underestimate probabilities
- Geometric classifier finding optimal hyperplane
- Maximizes margin between classes
- Kernel trick for non-linear problems
- Linear kernel works well for text (TF-IDF)
Concept:
┌─────────────────────────────┐
│ Feature Space │
│ │
│ ● Class 1 (Spam) │
│ ● ● ════════ │ Optimal
│ ● ● ║ Margin │ Hyperplane
│ ● ║ │
│ ────────────────║──────── │
│ ║ Margin ║ │
│ ○ ○ ║ ○ │
│ ○ ○ ════════ │
│ ○ ○ Class 0 (Ham) │
└─────────────────────────────┘
Advantages:
- ✓ Effective in high dimensions
- ✓ Memory efficient
- ✓ Versatile (different kernels)
- ✓ Handles complex boundaries
Disadvantages:
- ✗ Slower training on large datasets
- ✗ Requires careful kernel selection
- ✗ Hard to interpret
- 18,846 documents
- 20 categories
- Real-world news articles
- Imbalanced distribution
Categories (sample):
alt.atheismsoc.religion.christiancomp.graphicssci.med
Evaluation Metrics:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP) (of predicted positive, how many correct)
Recall = TP / (TP + FN) (of actual positive, how many caught)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
🎓 Learning Outcomes
- Implement classification pipelines
- Train Naïve Bayes and SVM classifiers
- Evaluate models with multiple metrics
- Compare algorithm performance
- Make informed algorithm choices
📋 Overview Unsupervised learning approach to automatically group similar documents into clusters based on their content.
🎯 Objectives
- Understand unsupervised learning
- Apply clustering to text documents
- Analyze cluster characteristics
📚 Key Topics Covered
An iterative algorithm that partitions documents into K clusters:
Algorithm Steps:
Step 1: Choose K (number of clusters)
↓
Step 2: Randomly initialize K centroids
↓
Step 3: Assign each document to nearest centroid (Euclidean distance)
↓
Step 4: Recalculate centroids as mean of assigned points
↓
Step 5: Repeat steps 3-4 until convergence
↓
Step 6: Analyze clusters
Visualization:
Iteration 1: Iteration 2: Final:
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ ★ ●● │ │ ★ ●● │ │ ★ ●● │
│ ●●● ● ● │ │●●●● ● ● │ │●●●● ● ● │
│ ●● ▲ ● │→ │●● ▲ ● │→ │●● ▲ ● │
│ ★ │ │ ★ │ │ ★ │
└──────────────┘ └──────────────┘ └──────────────┘
Initial Converging Final Clusters
Documents → TF-IDF Vectorization → K-Means → Cluster Analysis
Example Output:
Documents:
1. "Machine learning provides systems ability to learn"
2. "Artificial intelligence and ML are related"
3. "Cricket is popular sport in India"
4. "Indian cricket team won match"
TF-IDF Vector Space (sparse)
↓
K-Means with K=2
↓
Cluster 0: [Doc 1, Doc 2] - ML/AI related
Cluster 1: [Doc 3, Doc 4] - Sports related
-
Choosing K: How many clusters?
- Elbow method
- Silhouette analysis
- Domain knowledge
-
Convergence: May find local optima
- Multiple runs with different initializations
- Select best result
-
Scalability: Slow on very large datasets
- Mini-batch K-Means
- Approximate methods
Top Terms per Cluster:
Cluster 0: ["machine", "learning", "model", "data", "algorithm"]
→ ML/AI cluster
Cluster 1: ["cricket", "team", "match", "player", "game"]
→ Sports cluster
🎓 Learning Outcomes
- Implement K-Means clustering
- Vectorize text for clustering
- Determine optimal number of clusters
- Interpret and analyze clusters
- Understand unsupervised learning concepts
📋 Overview Assigns grammatical roles (parts of speech) to each word, enabling syntactic and semantic analysis.
🎯 Objectives
- Learn POS tagging concepts
- Implement using NLTK
- Understand grammatical relationships
📚 Key Topics Covered
Common POS tags in English:
| Tag | Meaning | Examples |
|---|---|---|
| NN | Noun | cat, dog, house |
| VB | Verb | run, jump, eat |
| JJ | Adjective | beautiful, quick, tall |
| RB | Adverb | quickly, carefully, very |
| DET | Determiner | the, a, an |
| IN | Preposition | in, on, at, by |
| PRP | Pronoun | he, she, it, they |
| CD | Cardinal Number | one, two, 42 |
Sentence: "The quick brown fox jumps over the lazy dog"
Words: [The quick brown fox jumps over the lazy dog]
│ │ │ │ │ │ │ │ │
Tags: [DET JJ JJ NN VB IN DET JJ NN]
- Rule-based: Hand-crafted linguistic rules
- Stochastic: Uses probabilistic models
- Neural: Deep learning approaches
- Hybrid: Combination of methods
Example Output:
Sentence: "Prexit submitted the practical on time"
Word POS Tag Description
─────────────────────────────────────────
Prexit NNP Proper Noun
submitted VBD Verb (past tense)
the DT Determiner
practical NN Noun
on IN Preposition
time NN Noun
- Information extraction
- Parsing and syntax analysis
- Named entity recognition (filter nouns)
- Spell checking (context-aware)
- Machine translation
- Speech recognition (disambiguation)
🎓 Learning Outcomes
- Understand linguistic grammatical concepts
- Implement POS tagging with NLTK
- Interpret POS tag sequences
- Prepare data for downstream NLP tasks
- Recognize word roles in sentences
📋 Overview Introduces neural networks for NLP, specifically LSTM (Long Short-Term Memory) networks for sentiment classification.
🎯 Objectives
- Preprocess text for neural networks
- Build and train LSTM models
- Classify sentiment (positive/negative)
📚 Key Topics Covered
Text → Tokenization → Integer Sequences → Padding → Embedding → Neural Network
- Tokenization: Convert words to integers
Vocabulary: {love: 1, this: 2, hate: 3, bad: 4}
Text: "I love this"
Tokens: [1, 2] (numbers replacing words)
- Padding: Make all sequences same length
Original: [[1, 2], [3, 4, 5], [6]]
Padded: [[0, 1, 2],
[3, 4, 5],
[0, 0, 6]] (length=3)
- Embedding: Dense vector representation
Word: "love" (ID: 1)
Embedding: [0.2, -0.4, 0.1, 0.5] (50-300 dimensions)
Problem: Regular RNNs suffer from vanishing gradient
RNN: h_t = tanh(W_h * h_{t-1} + W_x * x_t)
Problem: Gradient → 0 over many time steps
Long-range dependencies lost
LSTM Solution: Memory cells + gates
Cell State (C_t): "Long-term memory" (relatively unchanged)
Hidden State (h_t): "Short-term output"
Three Gates:
1. Forget Gate: What to forget from previous cell state
2. Input Gate: What new information to add
3. Output Gate: What to output from cell state
LSTM Cell Equations:
Forget Gate: f_t = σ(W_f · [h_{t-1}, x_t] + b_f)
Input Gate: i_t = σ(W_i · [h_{t-1}, x_t] + b_i)
Cell Update: C̃_t = tanh(W_c · [h_{t-1}, x_t] + b_c)
Cell State: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
Output Gate: o_t = σ(W_o · [h_{t-1}, x_t] + b_o)
Hidden: h_t = o_t ⊙ tanh(C_t)
Network Architecture:
Input Layer (Embedding)
↓
[Embedding Vectors] (text → 50-dim vectors)
↓
LSTM Layer
↓
[Hidden States] (sequential processing)
↓
Dense Layer
↓
Output Layer (Sigmoid)
↓
Sentiment: [0] Negative or [1] Positive
Text: "I love this product"
Label: positive (1)
Text: "This is the worst"
Label: negative (0)
1. Forward pass: Input → LSTM → Dense → Sigmoid → Prediction
2. Calculate loss: Binary Crossentropy
3. Backpropagation: Compute gradients
4. Update weights: Using Adam optimizer
5. Repeat for multiple epochs
🎓 Learning Outcomes
- Preprocess text for neural networks
- Understand LSTM architecture
- Build sentiment classification models
- Train deep learning models
- Handle sequential text data
📋 Overview Enhanced version of LSTM sentiment classification with advanced techniques including dropout regularization and improved architecture.
🎯 Objectives
- Implement advanced regularization techniques
- Improve model performance
- Handle overfitting in neural networks
📚 Key Topics Covered
Training Loss Training Loss & Validation Loss
╲ ╲ Training Loss
╲ ╲ ↓
╲ ╲ ║
╲ (Good) ╲║ (Overfitting)
╲ ╱║
╲_____ ╱ Validation Loss ↑
Good Generalization Poor Generalization
Random deactivation of neurons during training to prevent co-adaptation.
Without Dropout: With Dropout (50%):
┌─────────────┐ ┌──────────────┐
│ ● ● ● ● │ │ ● ✗ ● ✗ │
│ ╲ │ ╱ │ │ ╲ │ ╱ │ (Some neurons
│ ╲│╱ │ → │ ╲│╱ │ randomly turned off)
│ ● │ │ ● │
└─────────────┘ └──────────────┘
Benefits:
- ✓ Prevents co-adaptation of neurons
- ✓ Forces learning of robust features
- ✓ Acts as ensemble of models
- ✓ Reduces overfitting
Implementation:
Dropout Rate: 0.5 (50% neurons dropped)
After Training: All neurons active, weights × (1 - dropout_rate)
Input Layer (Embedding)
↓
LSTM Layer 1 (64 units)
↓
Dropout (0.5) ← Prevents overfitting
↓
LSTM Layer 2 (32 units)
↓
Dropout (0.5) ← Additional regularization
↓
Dense Layer (16 units, ReLU)
↓
Output Layer (1 unit, Sigmoid)
↓
Sentiment Prediction
| Parameter | Purpose | Common Values |
|---|---|---|
| Embedding Dim | Vector size for words | 50, 100, 300 |
| LSTM Units | Hidden state size | 32, 64, 128, 256 |
| Dropout Rate | Fraction to drop | 0.2, 0.5, 0.7 |
| Learning Rate | Optimization step size | 0.001, 0.01, 0.1 |
| Batch Size | Samples per update | 16, 32, 64, 128 |
| Epochs | Training iterations | 10-100 |
Epoch 1/50
Loss: 0.693, Accuracy: 0.50, Val_Loss: 0.691, Val_Accuracy: 0.50
Epoch 2/50
Loss: 0.620, Accuracy: 0.67, Val_Loss: 0.620, Val_Accuracy: 0.65
...
Epoch 50/50
Loss: 0.180, Accuracy: 0.95, Val_Loss: 0.320, Val_Accuracy: 0.88
🎓 Learning Outcomes
- Implement regularization techniques
- Build deeper neural networks
- Tune hyperparameters effectively
- Monitor training with metrics
- Improve model generalization
- Understand overfitting and solutions
📋 Overview A complete real-world NLP application demonstrating spam detection using Bag-of-Words and Multinomial Naïve Bayes.
🎯 Objectives
- Develop a practical NLP application
- Preprocess diverse text data
- Classify messages as spam or legitimate (ham)
📚 Key Topics Covered
Binary Classification Task:
- Spam: Unsolicited, marketing, phishing messages
- Ham: Legitimate messages
Real-World Examples:
Spam Messages:
"Congratulations! You won a free lottery"
"Call now to claim your prize"
"Earn money fast by clicking this link"
"URGENT: Verify your account immediately"
Ham Messages:
"This is a meeting reminder"
"Let's have lunch tomorrow"
"Your appointment is scheduled"
"Thanks for your help!"
┌──────────────────────────┐
│ Raw Text Message │
│ "Congratulations! You │
│ won a free lottery" │
└────────────┬─────────────┘
↓
┌──────────────────────────┐
│ Text Preprocessing │
│ • Lowercase │
│ • Remove special chars │
│ • Strip whitespace │
└────────────┬─────────────┘
↓
│ "congratulations you won │
│ a free lottery" │
↓
┌──────────────────────────┐
│ Bag-of-Words (BoW) │
│ CountVectorizer │
└────────────┬─────────────┘
↓
│ {won: 1, free: 1, │
│ lottery: 1, ...} │
↓
┌──────────────────────────┐
│ Naïve Bayes Classifier │
└────────────┬─────────────┘
↓
│ P(Spam|Words) = ? │
│ P(Ham|Words) = ? │
↓
┌──────────────────────────┐
│ Prediction: SPAM ✓ │
└──────────────────────────┘
Step 1: Original
Input: "Congratulations! You won a free lottery"
Step 2: Lowercase
"congratulations! you won a free lottery"
Step 3: Remove non-letters (punctuation, numbers)
"congratulations you won a free lottery"
Step 4: Strip whitespace
["congratulations", "you", "won", "a", "free", "lottery"]
Vocabulary (from training):
{congratulations: 0, you: 1, won: 2, a: 3, free: 4, lottery: 5, ...}
Message 1: "Congratulations you won a free lottery"
BoW Vector: [1, 1, 1, 1, 1, 1, 0, 0, 0, ...]
Message 2: "Let's have lunch tomorrow"
BoW Vector: [0, 0, 0, 0, 0, 0, 1, 1, 1, ...]
Probability calculation:
P(Spam|Message) = P(Message|Spam) × P(Spam) / P(Message)
For Bag-of-Words:
P(Message|Spam) = ∏ P(word_i|Spam)
Decision:
If P(Spam|Message) > P(Ham|Message) → Classify as SPAM
Else → Classify as HAM
Example:
Message: "Win cash now!"
P(Spam|"win", "cash", "now") =
P("win"|Spam) × P("cash"|Spam) × P("now"|Spam) × P(Spam) / P(Message)
P(win|Spam) = 0.05 (5% of spam contain "win")
P(cash|Spam) = 0.08 (8% of spam contain "cash")
P(now|Spam) = 0.03 (3% of spam contain "now")
P(Spam) = 0.4 (40% of messages are spam)
Result: P(Spam|Message) = 0.8 > 0.2 = P(Ham|Message) → SPAM ✓
Confusion Matrix:
Predicted Spam Predicted Ham
Actual Spam TP FN
Actual Ham FP TN
Metrics:
Accuracy = (TP + TN) / Total
Precision = TP / (TP + FP) (of predicted spam, how many correct)
Recall = TP / (TP + FN) (of actual spam, how many caught)
F1-Score = 2 × (Precision × Recall) / (Precision + Recall)
Example Results:
TP = 95 (correctly identified spam)
FP = 5 (incorrectly marked ham as spam)
FN = 10 (missed spam messages)
TN = 90 (correctly identified ham)
Accuracy = (95 + 90) / 200 = 92.5%
Precision = 95 / (95 + 5) = 95%
Recall = 95 / (95 + 10) = 90.5%
F1-Score = 2 × (0.95 × 0.905) / (0.95 + 0.905) = 0.926
Test 1: "Win cash now!"
Prediction: SPAM (Probability: 92%)
Test 2: "Are we meeting today?"
Prediction: HAM (Probability: 88%)
Test 3: "Claim your free prize"
Prediction: SPAM (Probability: 95%)
Test 4: "See you at the meeting"
Prediction: HAM (Probability: 91%)
- ✓ Simple and interpretable
- ✓ Fast training and prediction
- ✓ Effective for spam detection
- ✓ Works with limited data
- ✓ Easy to update with new messages
- ✓ Good baseline for classification
Challenges:
1. Spam variations: Attackers constantly change messages
2. False positives: Legitimate messages marked as spam
3. False negatives: Spam gets through
4. Language evolution: New words, slang, emojis
5. Multiple languages: Different preprocessing needed
Solutions:
1. Regular model retraining
2. Balanced evaluation metrics
3. Combine with other features (sender, links, etc.)
4. Use ensemble methods
5. Handle multiple languages
🎓 Learning Outcomes
- Develop end-to-end NLP application
- Preprocess diverse text data
- Implement practical feature extraction
- Apply Naïve Bayes for binary classification
- Evaluate model performance
- Handle real-world spam detection problem
- Understand practical NLP deployment
After studying these practicals, you will understand:
✅ How to preprocess text data
✅ How to extract meaningful features from text
✅ How to train machine learning models for NLP tasks
✅ How word embeddings capture semantic meaning
✅ How to classify text using various algorithms
✅ How to cluster similar documents
✅ How to build deep learning models (LSTM) for NLP
✅ How to implement real-world NLP applications
2000s: Statistical methods (n-grams, HMMs)
↓
2010s: Machine learning (SVM, Naïve Bayes)
↓
2013: Word embeddings (Word2Vec)
↓
2015: Deep learning (RNN, LSTM)
↓
2017: Transformer architecture (Attention is All You Need)
↓
2018: BERT and contextual embeddings
↓
2020+: Large Language Models (GPT-3, T5, ELECTRA)
↓
2023+: Multimodal models, RAG, Fine-tuning
- Multimodal Learning: Combining text with images, audio, and video
- Few-Shot Learning: Learning from minimal examples
- Retrieval-Augmented Generation (RAG): Combining retrieval with generation
- Domain Adaptation: Transferring knowledge between domains
- Ethical NLP: Fair, transparent, and responsible AI
- Low-Resource Languages: Improving NLP for under-resourced languages
- Efficient Models: Smaller, faster models for edge devices
- Stanford CS224N: NLP with Deep Learning
- Andrew Ng's Deep Learning Specialization
- Hugging Face NLP Course
- "Speech and Language Processing" by Jurafsky & Martin
- "Natural Language Processing with Python" (NLTK Book)
- "Deep Learning for NLP" by Yoav Goldberg
- "Attention is All You Need" (Transformer)
- "BERT: Pre-training of Deep Bidirectional Transformers"
- "Sequence to Sequence Learning with Neural Networks"
For questions or clarifications regarding this summary or the practical implementations:
📧 Email: prexitjoshi@gmail.com
🎓 Institution: University Institute of Engineering and Technology, Punjab University (UIET, PU)
👤 Author: PREXIT JOSHI (Roll No. UE233118)
🏫 Department: Computer Science and Engineering (CSE)
Natural-Language-Processing/
├── 01_Comprehensive_NLP_Pipeline_Linguistic_Analysis.ipynb
├── 02_N_Gram_Analysis_Tokenization_Probability.ipynb
├── 03_Feature_Extraction_TF_TF-IDF.ipynb
├── 04_Word_Embeddings_Word2Vec_GloVe_FastText_BERT.ipynb
├── 05_Text_Classification_Naive_Bayes_SVM.ipynb
├── 06_K-Means_Text_Clustering.ipynb
├── 07_POS_Tagging_Part_of_Speech.ipynb
├── 08_Text_Processing_LSTM_Sentiment_Classification.ipynb
├── 09_Advanced_LSTM_Sentiment_Classification.ipynb
├── 10_Spam_Detection_Naive_Bayes_Application.ipynb
├── README.md # This file
├── GETTING_STARTED.md # Setup and quick start guide
├── CONTRIBUTING.md # Contribution guidelines
├── CHANGELOG.md # Version history
├── LICENSE # MIT License
├── requirements.txt # Python dependencies
└── .gitignore # Git ignore rules
- NLTK - Natural Language Toolkit
- spaCy - Industrial-strength NLP
- Gensim - Word embeddings (Word2Vec, FastText)
- Transformers - Pre-trained models (BERT, GPT)
- scikit-learn - Classic ML algorithms
- TensorFlow/Keras - Deep learning framework
- PyTorch - Alternative DL framework
- Pandas - Data manipulation
- NumPy - Numerical computing
- Jupyter - Interactive notebooks
| Metric | Value |
|---|---|
| Total Practicals | 10 |
| Total Code Cells | 100+ |
| Documentation Lines | 1400+ |
| Code Examples | 50+ |
| Diagrams/Visualizations | 30+ |
| Topics Covered | 50+ |
| Estimated Learning Time | 30-40 hours |
We welcome contributions! See CONTRIBUTING.md for:
- How to report bugs
- How to suggest features
- Pull request process
- Coding standards
- Commit message guidelines
# 1. Fork the repository
# 2. Create feature branch
git checkout -b feature/amazing-addition
# 3. Make changes and commit
git commit -m "feat: add amazing NLP feature"
# 4. Push and create PR
git push origin feature/amazing-addition- Issues: GitHub Issues
- Email: prexitjoshi@gmail.com
- Discussions: GitHub Discussions
This project is licensed under the MIT License - see LICENSE file for details.
MIT License - Free for personal, educational, and commercial use
with attribution required.
If you use this project in your research or work, please cite:
@misc{joshi2025nlp,
title={Natural Language Processing: Comprehensive Practicals},
author={Joshi, Prexit},
year={2025},
url={https://github.com/intronep666/Natural-Language-Processing}
}Current Version: 1.0.0 (November 29, 2025)
See CHANGELOG.md for detailed version history and planned features.
- NLTK & spaCy Teams for exceptional NLP libraries
- Hugging Face for transformer models and community
- TensorFlow & PyTorch communities
- scikit-learn for ML tools
- All Contributors and supporters
- GitHub Repository: https://github.com/intronep666/Natural-Language-Processing
- Author GitHub: https://github.com/intronep666
- Institution: UIET, PU
Natural Language Processing is a rapidly evolving field that combines linguistics, computer science, and machine learning. From simple text preprocessing to advanced transformer-based models, NLP enables machines to understand and generate human language in increasingly sophisticated ways.
The practical implementations in this repository demonstrate fundamental and advanced NLP concepts, providing hands-on experience with real-world applications and techniques. Whether you're interested in sentiment analysis, text classification, machine translation, or information extraction, NLP offers powerful tools and methodologies to solve complex language-based problems.
This repository is designed to:
- ✅ Provide comprehensive, hands-on learning
- ✅ Cover beginner to intermediate concepts
- ✅ Include well-documented, runnable code
- ✅ Foster community contributions
- ✅ Serve as a portfolio project
Happy Learning! 🚀