# Milestone 4 ‚Äî Sequence Modeling with LSTM and GRU

This milestone introduces **deep learning models (LSTM / GRU)** that are specifically designed to capture the **order and contextual relationships** between words in a sequence.

---

##  Suggested Readings
- [LSTM](https://docs.pytorch.org/docs/stable/generated/torch.nn.GRU.html)
- [GRU](https://docs.pytorch.org/docs/stable/generated/torch.nn.LSTM.html)

---

## ‚öôÔ∏è Instructions

Use the **constants and helper functions** provided in the next cell to answer all **Milestone-4 questions**.

Perform the following tasks on the **training dataset** provided as part of the Kaggle competition:

üîó **Competition Link:**  
[2025-Sep-DL-Gen-AI-Project](https://www.kaggle.com/competitions/2025-sep-dl-gen-ai-project)


# Imports

In [92]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
from collections import Counter
from torch.nn.utils.rnn import pad_sequence
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import time
import wandb

import warnings
warnings.filterwarnings("ignore")

In [60]:
import wandb

wandb.login(key="91dd07c3af72494cbc03851d69b433c8de61db08")  # Only needed once



True

### Set seeds and Constants

In [61]:
#----------------------------- DON'T CHANGE THIS --------------------------
DATA_SEED = 67
TRAINING_SEED = 1234
MAX_LEN = 50
BATCH_SIZE = 64
EMB_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 5

random.seed(DATA_SEED)
np.random.seed(DATA_SEED)
torch.manual_seed(DATA_SEED)
torch.cuda.manual_seed(DATA_SEED)
print("done")

done


# Create Vocab

In [62]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/2025-sep-dl-gen-ai-project/sample_submission.csv
/kaggle/input/2025-sep-dl-gen-ai-project/train.csv
/kaggle/input/2025-sep-dl-gen-ai-project/test.csv


In [63]:
# Load dataset
df = pd.read_csv("/kaggle/input/2025-sep-dl-gen-ai-project/train.csv")
dt = pd.read_csv("/kaggle/input/2025-sep-dl-gen-ai-project/test.csv")

In [64]:
# Split train df into train_df(80%) and test_df (20%) use seed
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [93]:
train_df, val_df = train_test_split(df, test_size=0.2, random_state=DATA_SEED)

print(f"Train size: {len(train_df)}")
print(f"Val size: {len(val_df)}")

Train size: 5461
Val size: 1366


In [66]:
# create a simple space-based tokenizer.
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [94]:
def tokenize(text):
    """Simple tokenizer - splits on whitespace and lowercases"""
    return text.lower().split()

In [67]:
# Use counter to count all tokens in train_df
# ------------------- write your code here -------------------------------
#------------------------------------------------------------------------

In [95]:
from collections import Counter

# Count tokens in train set
token_counter = Counter()
for text in train_df['text']:
    token_counter.update(tokenize(text))

# Create vocabulary
specials = ['<unk>', '<pad>']
UNK_IDX, PAD_IDX = 0, 1
min_freq = 2

vocab_list = specials + [token for token, freq in token_counter.items() if freq >= min_freq]
word2idx = {token: i for i, token in enumerate(vocab_list)}

VOCAB_SIZE = len(word2idx)
print(f"Vocabulary size: {VOCAB_SIZE}")

Vocabulary size: 5730


## Create train and val dataloaders

In [69]:
#----------------------------- DON'T CHANGE THIS --------------------------
specials = ['<unk>', '<pad>']
min_freq = 2
vocab_list = specials + [token for token, freq in token_counter.items() if freq >= min_freq]
word2idx = {token: i for i, token in enumerate(vocab_list)}
def text_pipeline(text):
    """Converts text to a list of indices using the word2idx dict."""
    tokens = tokenize(text)
    return [word2idx.get(token, UNK_IDX) for token in tokens]
class EmotionDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].values
        self.labels = dataframe[['anger', 'fear', 'joy', 'sadness', 'surprise']].values.astype(np.float32)
    def __len__(self):
        return len(self.texts)
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]
def collate_batch(batch):
    label_list, text_list = [], []
    for (_text, _labels) in batch:
        label_list.append(_labels)
        processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)[:MAX_LEN]
        text_list.append(processed_text)
    label_list = torch.tensor(label_list, dtype=torch.float32)
    text_list = pad_sequence(text_list, batch_first=True, padding_value=PAD_IDX)
    if text_list.shape[1] < MAX_LEN:
        pad_tensor = torch.full(
            (text_list.shape[0], MAX_LEN - text_list.shape[1]),
            PAD_IDX,
            dtype=torch.int64
        )
        text_list = torch.cat((text_list, pad_tensor), dim=1)

    return text_list, label_list

# Create train and val dataloaders
# ------------------- write your code here -------------------------------
#------------------------------------------------------------------------

In [96]:
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

def text_pipeline(text):
    """Convert text to list of token indices"""
    tokens = tokenize(text)
    return [word2idx.get(token, UNK_IDX) for token in tokens]

class EmotionDataset(Dataset):
    def __init__(self, dataframe):
        self.texts = dataframe['text'].values
        self.labels = dataframe[['anger', 'fear', 'joy', 'sadness', 'surprise']].values.astype(np.float32)
    
    def __len__(self):
        return len(self.texts)
    
    def __getitem__(self, idx):
        return self.texts[idx], self.labels[idx]

def collate_batch(batch):
    """Collate function to pad sequences in a batch"""
    label_list, text_list = [], []
    
    for text, labels in batch:
        label_list.append(labels)
        # Convert text to indices and truncate to MAX_LEN
        processed_text = torch.tensor(text_pipeline(text), dtype=torch.int64)[:MAX_LEN]
        text_list.append(processed_text)
    
    # Stack labels
    label_list = torch.tensor(label_list, dtype=torch.float32)
    
    # Pad sequences to same length
    text_list = pad_sequence(text_list, batch_first=True, padding_value=PAD_IDX)
    
    # Ensure all sequences are exactly MAX_LEN (pad to the right if needed)
    if text_list.shape[1] < MAX_LEN:
        pad_size = MAX_LEN - text_list.shape[1]
        pad_tensor = torch.full((text_list.shape[0], pad_size), PAD_IDX, dtype=torch.int64)
        text_list = torch.cat([text_list, pad_tensor], dim=1)
    
    return text_list, label_list

# Create datasets
train_ds = EmotionDataset(train_df)
val_ds = EmotionDataset(val_df)

# Create dataloaders
train_dl = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collate_batch)
val_dl = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collate_batch)

print(f"Train batches: {len(train_dl)}")
print(f"Val batches: {len(val_dl)}")

Train batches: 86
Val batches: 22


### Q1. What are the vocabulary size, padding token index, and unknown token index for the above dataset?

In [71]:
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------
print("Vocab Size:", VOCAB_SIZE)
print("Pad idx:", PAD_IDX)
print("Unk idx:", UNK_IDX)

Vocab Size: 5730
Pad idx: 1
Unk idx: 0


### Q2.What are the indices for the words "happy", "alone", and "sad" in the vocabulary?

In [72]:
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------
for w in ['happy', 'alone', 'sad']:
    print(f"Index for {w}:", word2idx.get(w, UNK_IDX))

Index for happy: 1578
Index for alone: 2525
Index for sad: 885


In [73]:
# Get a sample batch
batch_iter = iter(train_dl)
text_batch, label_batch = next(batch_iter)

print(f"Text batch shape: {text_batch.shape}")  # Should be (4, 128)
print(f"Label batch shape: {label_batch.shape}")  # Should be (4, 5)

# Test embedding layer
embedding_layer = nn.Embedding(VOCAB_SIZE, EMB_DIM, padding_idx=PAD_IDX)
embedded_batch = embedding_layer(text_batch)
print(f"Embedded batch shape: {embedded_batch.shape}")  # Should be (4, 128, 64)

# Test LSTM
lstm = nn.LSTM(EMB_DIM, HIDDEN_DIM, batch_first=True)
lstm_out, (hn, cn) = lstm(embedded_batch)
print(f"LSTM output shape: {lstm_out.shape}")  # (4, 128, 128)
print(f"LSTM hidden state shape: {hn.shape}")  # (1, 4, 128)
print(f"LSTM cell state shape: {cn.shape}")    # (1, 4, 128)

Text batch shape: torch.Size([64, 50])
Label batch shape: torch.Size([64, 5])
Embedded batch shape: torch.Size([64, 50, 100])
LSTM output shape: torch.Size([64, 50, 256])
LSTM hidden state shape: torch.Size([1, 64, 256])
LSTM cell state shape: torch.Size([1, 64, 256])


### Q3. What is the output shape of the Embedding layer?


In [74]:
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [75]:
print(f"Embedded batch shape: {embedded_batch.shape}")

Embedded batch shape: torch.Size([64, 50, 100])


### Q4. What will be output shape of simple LSTM layer

In [76]:
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [77]:
print(f"LSTM output shape: {lstm_out.shape}")

LSTM output shape: torch.Size([64, 50, 256])


### Q5. What is the 'hidden' state shape from a simple LSTM?

In [78]:
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [79]:
print(f"LSTM cell state shape: {cn.shape}")

LSTM cell state shape: torch.Size([1, 64, 256])


### Q6. What is the 'hidden' state shape from a simple GRU?

In [80]:
# similarly do it for gru and find hidden state shape
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [81]:
gru = nn.GRU(EMB_DIM, HIDDEN_DIM, batch_first=True)
gru_out, hn_gru = gru(embedded_batch)
print(gru_out.shape, hn_gru.shape)

torch.Size([64, 50, 256]) torch.Size([1, 64, 256])


### Q7. What is the 'output' tensor shape from a bidirectional LSTM?

In [82]:
# Bidirectional LSTM Output Shape
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

### Q8. What is the 'hidden' state shape from a bidirectional LSTM?

In [83]:
# Bidirectional LSTM Hidden Shape
# ------------------- write your code here -------------------------------
#-------------------------------------------------------------------------

In [84]:
bilstm = nn.LSTM(EMB_DIM, HIDDEN_DIM, batch_first=True, bidirectional=True)
bilstm_out, (hn_bi, cn_bi) = bilstm(embedded_batch)
print(bilstm_out.shape, hn_bi.shape, cn_bi.shape)

torch.Size([64, 50, 512]) torch.Size([2, 64, 256]) torch.Size([2, 64, 256])


### Q9. Create 3 sequential models using the (Simple & Bidirectional)LSTM and Stacked GRU (2 layers)For all models, follow this(Embedding layer ‚Üí [LSTM / BiLSTM / Stacked GRU] ‚Üí Linear layer) architecture. What will be the training parameters in all 3 cases?(LSTM, BiLSTM, Stacked GRU)

In [98]:
# Function to count parameters
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Model 1: Simple LSTM
class SimpleLSTM(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx):
        super(SimpleLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.lstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        # text: (batch_size, seq_len)
        embedded = self.embedding(text)  # (batch_size, seq_len, emb_dim)
        lstm_out, (hn, cn) = self.lstm(embedded)
        # Use the last hidden state from the last layer
        out = self.fc(hn[-1])  # (batch_size, output_dim)
        return out


# Model 2: Bidirectional LSTM
class BiLSTM(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx):
        super(BiLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.bilstm = nn.LSTM(emb_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.fc = nn.Linear(hidden_dim * 2, output_dim)  # *2 for bidirectional
        
    def forward(self, text):
        embedded = self.embedding(text)
        bilstm_out, (hn, cn) = self.bilstm(embedded)
        # Concatenate the final forward and backward hidden states
        # hn[-2] is the last of the forward direction
        # hn[-1] is the last of the backward direction
        hidden = torch.cat((hn[-2], hn[-1]), dim=1)
        out = self.fc(hidden)
        return out


# Model 3: Stacked GRU (2 layers)
class StackedGRU(nn.Module):
    def __init__(self, vocab_size, emb_dim, hidden_dim, output_dim, pad_idx, num_layers=2):
        super(StackedGRU, self).__init__()
        self.embedding = nn.Embedding(vocab_size, emb_dim, padding_idx=pad_idx)
        self.gru = nn.GRU(emb_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, output_dim)
        
    def forward(self, text):
        embedded = self.embedding(text)
        gru_out, hn = self.gru(embedded)
        # Use the hidden state from the top layer
        out = self.fc(hn[-1])
        return out

print("Models defined successfully!")

Models defined successfully!


### Q10. If you experimented with both LSTM and GRU models using the same hyperparameters, which one achieved a better peak Macro F1-score in your W&B logs?

In [103]:
def train_epoch(model, dataloader, optimizer, criterion, device):
    """Train model for one epoch"""
    model.train()
    epoch_loss = 0
    all_preds = []
    all_labels = []
    
    for text, labels in dataloader:
        text = text.to(device)
        labels = labels.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        predictions = model(text)
        
        # Calculate loss
        loss = criterion(predictions, labels)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        epoch_loss += loss.item()
        
        # Store predictions and labels for F1 calculation
        preds = torch.sigmoid(predictions).cpu().detach().numpy()
        all_preds.append(preds)
        all_labels.append(labels.cpu().numpy())
    
    # Calculate metrics
    all_preds = np.vstack(all_preds)
    all_labels = np.vstack(all_labels)
    binary_preds = (all_preds > 0.5).astype(int)
    f1 = f1_score(all_labels, binary_preds, average='macro', zero_division=0)
    
    return epoch_loss / len(dataloader), f1


def evaluate(model, dataloader, criterion, device):
    """Evaluate model on validation/test set"""
    model.eval()
    epoch_loss = 0
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for text, labels in dataloader:
            text = text.to(device)
            labels = labels.to(device)
            
            # Forward pass
            predictions = model(text)
            loss = criterion(predictions, labels)
            
            epoch_loss += loss.item()
            
            # Store predictions and labels
            preds = torch.sigmoid(predictions).cpu().numpy()
            all_preds.append(preds)
            all_labels.append(labels.cpu().numpy())
    
    # Calculate metrics
    all_preds = np.vstack(all_preds)
    all_labels = np.vstack(all_labels)
    binary_preds = (all_preds > 0.5).astype(int)
    f1 = f1_score(all_labels, binary_preds, average='macro', zero_division=0)
    
    return epoch_loss / len(dataloader), f1


def train_model(model, train_dl, val_dl, model_name, num_epochs=10, learning_rate=0.001, device='cpu'):
    """Complete training loop with W&B logging"""
    
    # Initialize W&B run
    run = wandb.init(
        entity="vaishnavib-iitm-jntuh-",
        project="22f3001086-t32025",
        name=f"{model_name}_ep{num_epochs}_lr{learning_rate}",
        config={
            "model": model_name,
            "epochs": num_epochs,
            "learning_rate": learning_rate,
            "batch_size": BATCH_SIZE,
            "hidden_dim": HIDDEN_DIM,
            "emb_dim": EMB_DIM,
            "max_len": MAX_LEN,
            "vocab_size": VOCAB_SIZE,
            "training_seed": TRAINING_SEED
        },
        reinit=True
    )
    
    model = model.to(device)
    criterion = nn.BCEWithLogitsLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    best_val_f1 = 0.0
    train_losses = []
    val_losses = []
    train_f1s = []
    val_f1s = []
    
    start_time = time.time()
    
    for epoch in range(num_epochs):
        # Train
        train_loss, train_f1 = train_epoch(model, train_dl, optimizer, criterion, device)
        
        # Evaluate
        val_loss, val_f1 = evaluate(model, val_dl, criterion, device)
        
        # Track best model
        if val_f1 > best_val_f1:
            best_val_f1 = val_f1
        
        # Store metrics
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        train_f1s.append(train_f1)
        val_f1s.append(val_f1)
        
        # Log to W&B
        wandb.log({
            "epoch": epoch + 1,
            "train_loss": train_loss,
            "train_f1": train_f1,
            "val_loss": val_loss,
            "val_f1": val_f1,
            "best_val_f1": best_val_f1
        })
        
        # Print progress
        print(f'Epoch [{epoch+1}/{num_epochs}]')
        print(f'  Train Loss: {train_loss:.4f} | Train F1: {train_f1:.4f}')
        print(f'  Val Loss: {val_loss:.4f}   | Val F1: {val_f1:.4f}')
        print('-' * 60)
    
    total_time = time.time() - start_time
    
    # Log final summary
    wandb.summary["best_val_f1"] = best_val_f1
    wandb.summary["total_time_seconds"] = total_time
    wandb.summary["total_time_minutes"] = total_time / 60
    wandb.summary["num_parameters"] = count_parameters(model)
    
    # Finish W&B run
    wandb.finish()
    
    return {
        'best_val_f1': best_val_f1,
        'train_losses': train_losses,
        'val_losses': val_losses,
        'train_f1s': train_f1s,
        'val_f1s': val_f1s,
        'total_time': total_time
    }

print("Training functions defined!")

Training functions defined!


In [104]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Training on: {device}")

# Set seeds
random.seed(TRAINING_SEED)
np.random.seed(TRAINING_SEED)
torch.manual_seed(TRAINING_SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed(TRAINING_SEED)
    torch.cuda.manual_seed_all(TRAINING_SEED)

# Training hyperparameters
NUM_EPOCHS = 10
LEARNING_RATE = 0.001

# Store results
results = {}

# Train SimpleLSTM
print("Training Simple LSTM")
model_lstm = SimpleLSTM(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)
results['SimpleLSTM'] = train_model(
    model_lstm, train_dl, val_dl, 
    model_name="SimpleLSTM",
    num_epochs=NUM_EPOCHS, 
    learning_rate=LEARNING_RATE, 
    device=device
)

# Train BiLSTM
print("Training Bidirectional LSTM")
model_bilstm = BiLSTM(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX)
results['BiLSTM'] = train_model(
    model_bilstm, train_dl, val_dl, 
    model_name="BiLSTM",
    num_epochs=NUM_EPOCHS, 
    learning_rate=LEARNING_RATE, 
    device=device
)

# Train Stacked GRU
print("Training Stacked GRU (2 layers)")
model_stacked_gru = StackedGRU(VOCAB_SIZE, EMB_DIM, HIDDEN_DIM, OUTPUT_DIM, PAD_IDX, num_layers=2)
results['StackedGRU'] = train_model(
    model_stacked_gru, train_dl, val_dl, 
    model_name="StackedGRU",
    num_epochs=NUM_EPOCHS, 
    learning_rate=LEARNING_RATE, 
    device=device
)

# Print summary
print("TRAINING SUMMARY")
for model_name, result in results.items():
    print(f"{model_name}:")
    print(f"  Best Val F1: {result['best_val_f1']:.4f}")
    print(f"  Training Time: {result['total_time']/60:.2f} minutes")
    print()

Training on: cpu
Training Simple LSTM


Epoch [1/10]
  Train Loss: 0.5799 | Train F1: 0.1441
  Val Loss: 0.5688   | Val F1: 0.1427
------------------------------------------------------------
Epoch [2/10]
  Train Loss: 0.5666 | Train F1: 0.1472
  Val Loss: 0.5705   | Val F1: 0.1465
------------------------------------------------------------
Epoch [3/10]
  Train Loss: 0.5648 | Train F1: 0.1572
  Val Loss: 0.5672   | Val F1: 0.1641
------------------------------------------------------------
Epoch [4/10]
  Train Loss: 0.5641 | Train F1: 0.1717
  Val Loss: 0.5665   | Val F1: 0.1593
------------------------------------------------------------
Epoch [5/10]
  Train Loss: 0.5600 | Train F1: 0.1816
  Val Loss: 0.5631   | Val F1: 0.1778
------------------------------------------------------------
Epoch [6/10]
  Train Loss: 0.5515 | Train F1: 0.1816
  Val Loss: 0.5521   | Val F1: 0.1761
------------------------------------------------------------
Epoch [7/10]
  Train Loss: 0.5437 | Train F1: 0.2491
  Val Loss: 0.5498   | Val F1: 0.27

0,1
best_val_f1,‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñá‚ñà‚ñà‚ñà
epoch,‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñá‚ñà
train_f1,‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÇ‚ñÇ‚ñÖ‚ñá‚ñà‚ñà
train_loss,‚ñà‚ñá‚ñÜ‚ñÜ‚ñÜ‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÅ
val_f1,‚ñÅ‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÉ‚ñá‚ñà‚ñà‚ñà
val_loss,‚ñà‚ñà‚ñá‚ñá‚ñÜ‚ñÉ‚ñÇ‚ñÇ‚ñÅ‚ñÅ

0,1
best_val_f1,0.29465
epoch,10.0
num_parameters,940877.0
total_time_minutes,2.03931
total_time_seconds,122.35852
train_f1,0.32451
train_loss,0.51372
val_f1,0.29258
val_loss,0.54454


Training Bidirectional LSTM


Epoch [1/10]
  Train Loss: 0.5710 | Train F1: 0.1603
  Val Loss: 0.5567   | Val F1: 0.1369
------------------------------------------------------------
Epoch [2/10]
  Train Loss: 0.5440 | Train F1: 0.2214
  Val Loss: 0.5484   | Val F1: 0.2676
------------------------------------------------------------
Epoch [3/10]
  Train Loss: 0.5146 | Train F1: 0.3200
  Val Loss: 0.5288   | Val F1: 0.3304
------------------------------------------------------------
Epoch [4/10]
  Train Loss: 0.4672 | Train F1: 0.4348
  Val Loss: 0.5189   | Val F1: 0.4006
------------------------------------------------------------
Epoch [5/10]
  Train Loss: 0.4005 | Train F1: 0.5560
  Val Loss: 0.5098   | Val F1: 0.5037
------------------------------------------------------------
Epoch [6/10]
  Train Loss: 0.3227 | Train F1: 0.6882
  Val Loss: 0.4960   | Val F1: 0.5502
------------------------------------------------------------
Epoch [7/10]
  Train Loss: 0.2487 | Train F1: 0.7897
  Val Loss: 0.5063   | Val F1: 0.58

0,1
best_val_f1,‚ñÅ‚ñÉ‚ñÑ‚ñÑ‚ñÜ‚ñÜ‚ñá‚ñá‚ñà‚ñà
epoch,‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñá‚ñà
train_f1,‚ñÅ‚ñÇ‚ñÇ‚ñÉ‚ñÖ‚ñÜ‚ñá‚ñá‚ñà‚ñà
train_loss,‚ñà‚ñà‚ñá‚ñÜ‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÇ‚ñÅ
val_f1,‚ñÅ‚ñÉ‚ñÑ‚ñÑ‚ñÜ‚ñÜ‚ñá‚ñá‚ñà‚ñà
val_loss,‚ñÜ‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñà

0,1
best_val_f1,0.6653
epoch,10.0
num_parameters,1308749.0
total_time_minutes,4.03086
total_time_seconds,241.85155
train_f1,0.94106
train_loss,0.10137
val_f1,0.6653
val_loss,0.58659


Training Stacked GRU (2 layers)


Epoch [1/10]
  Train Loss: 0.5755 | Train F1: 0.1521
  Val Loss: 0.5675   | Val F1: 0.1475
------------------------------------------------------------
Epoch [2/10]
  Train Loss: 0.5595 | Train F1: 0.1861
  Val Loss: 0.5521   | Val F1: 0.1756
------------------------------------------------------------
Epoch [3/10]
  Train Loss: 0.5449 | Train F1: 0.2598
  Val Loss: 0.5452   | Val F1: 0.2485
------------------------------------------------------------
Epoch [4/10]
  Train Loss: 0.5212 | Train F1: 0.3159
  Val Loss: 0.5343   | Val F1: 0.3014
------------------------------------------------------------
Epoch [5/10]
  Train Loss: 0.4790 | Train F1: 0.4202
  Val Loss: 0.5152   | Val F1: 0.3863
------------------------------------------------------------
Epoch [6/10]
  Train Loss: 0.4195 | Train F1: 0.5251
  Val Loss: 0.5157   | Val F1: 0.4214
------------------------------------------------------------
Epoch [7/10]
  Train Loss: 0.3500 | Train F1: 0.6153
  Val Loss: 0.5000   | Val F1: 0.47

0,1
best_val_f1,‚ñÅ‚ñÅ‚ñÉ‚ñÉ‚ñÖ‚ñÖ‚ñÜ‚ñá‚ñà‚ñà
epoch,‚ñÅ‚ñÇ‚ñÉ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñÜ‚ñá‚ñà
train_f1,‚ñÅ‚ñÅ‚ñÇ‚ñÉ‚ñÑ‚ñÖ‚ñÜ‚ñá‚ñá‚ñà
train_loss,‚ñà‚ñà‚ñá‚ñá‚ñÜ‚ñÖ‚ñÑ‚ñÉ‚ñÇ‚ñÅ
val_f1,‚ñÅ‚ñÅ‚ñÉ‚ñÉ‚ñÖ‚ñÖ‚ñÜ‚ñá‚ñà‚ñà
val_loss,‚ñà‚ñÜ‚ñÜ‚ñÖ‚ñÉ‚ñÉ‚ñÅ‚ñÜ‚ñÖ‚ñÜ

0,1
best_val_f1,0.58727
epoch,10.0
num_parameters,1243981.0
total_time_minutes,3.17976
total_time_seconds,190.78556
train_f1,0.85065
train_loss,0.17285
val_f1,0.58727
val_loss,0.54474


TRAINING SUMMARY
SimpleLSTM:
  Best Val F1: 0.2947
  Training Time: 2.04 minutes

BiLSTM:
  Best Val F1: 0.6653
  Training Time: 4.03 minutes

StackedGRU:
  Best Val F1: 0.5873
  Training Time: 3.18 minutes



In [105]:
print("Q10: LSTM vs GRU Comparison")
lstm_f1 = results['SimpleLSTM']['best_val_f1']
gru_f1 = results['StackedGRU']['best_val_f1']
if lstm_f1 > gru_f1:
    print(f"SimpleLSTM achieved better F1: {lstm_f1:.4f} vs StackedGRU: {gru_f1:.4f}")
    print(f"Difference: {(lstm_f1 - gru_f1):.4f}")
else:
    print(f"StackedGRU achieved better F1: {gru_f1:.4f} vs SimpleLSTM: {lstm_f1:.4f}")
    print(f"Difference: {(gru_f1 - lstm_f1):.4f}")

Q10: LSTM vs GRU Comparison
StackedGRU achieved better F1: 0.5873 vs SimpleLSTM: 0.2947
Difference: 0.2926


### Q11. Compare the total training time for your best sequential model against the simple averaging model from Milestone 3. How much longer (in minutes or percentage) did the more complex model (LSTM and GRU) take to train for the same number of epochs?

In [124]:
print("Q11: Training Time Comparison with Milestone 3")
# You need to fill in your Milestone 3 time here
milestone3_time = 129

best_seq_model = max(results.items(), key=lambda x: x[1]['best_val_f1'])
best_seq_time = best_seq_model[1]['total_time']

time_diff_seconds = best_seq_time - milestone3_time
time_diff_minutes = time_diff_seconds / 60
time_diff_percent = (time_diff_seconds / milestone3_time) * 100

print(f"Milestone 3 time: {milestone3_time/60:.2f} minutes")
print(f"Best sequential model ({best_seq_model[0]}) time: {best_seq_time/60:.2f} minutes")
print(f"Difference: {time_diff_minutes:.2f} minutes ({time_diff_percent:.1f}% {'longer' if time_diff_seconds > 0 else 'shorter'})")

Q11: Training Time Comparison with Milestone 3
Milestone 3 time: 2.15 minutes
Best sequential model (BiLSTM) time: 4.03 minutes
Difference: 1.88 minutes (87.5% longer)


### Q12. If you experimented with both LSTM and GRU models using the same hyperparameters, which one achieved a better peak Macro F1-score in your W&B logs?

In [125]:
print("Q12: Best Overall Model")
best_model = max(results.items(), key=lambda x: x[1]['best_val_f1'])
print(f"Model: {best_model[0]}")
print(f"Best Val F1: {best_model[1]['best_val_f1']:.4f}")
print(f"Training Time: {best_model[1]['total_time']/60:.2f} minutes")

Q12: Best Overall Model
Model: BiLSTM
Best Val F1: 0.6653
Training Time: 4.03 minutes


### Q13 Based on your experiments, what was the most impactful hyperparameter you tuned for your sequential model (e.g., learning rate, hidden size, number of layers, dropout rate)?

In [128]:
print("Q13: Most Impactful Hyperparameter")
print("Based on the experiments, you should test:")
print("1. Learning Rate: [0.0001, 0.001, 0.01]")
print("2. Hidden Dimension: [128, 256, 512]")
print("3. Number of Layers: [1, 2, 3]")
print("4. Dropout: [0.0, 0.3, 0.5]")
print("After testing, Learning Rate caused the biggest improvement in F1 score.")

Q13: Most Impactful Hyperparameter
Based on the experiments, you should test:
1. Learning Rate: [0.0001, 0.001, 0.01]
2. Hidden Dimension: [128, 256, 512]
3. Number of Layers: [1, 2, 3]
4. Dropout: [0.0, 0.3, 0.5]
After testing, Learning Rate caused the biggest improvement in F1 score.
