sybil / EXTRACTION_README.md

Niko.Koutsoubis

Add embedding extraction pipeline for federated learning

a091733 4 months ago

6.46 kB

	# Sybil Embedding Extraction Pipeline

	This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for federated learning deployments where sites need to generate embeddings locally without sharing raw medical images.

	## Features

	- ✅ Automatic Model Download: Downloads Sybil model from HuggingFace automatically
	- ✅ Multi-GPU Support: Process scans in parallel across multiple GPUs
	- ✅ Smart Filtering: Automatically filters out localizer/scout scans
	- ✅ PID-Based Extraction: Extract embeddings for specific patient cohorts
	- ✅ Checkpoint System: Save progress every N scans to prevent data loss
	- ✅ Timepoint Detection: Automatically detects T0, T1, T2... from scan dates
	- ✅ Directory Caching: Cache directory scans for 100x faster reruns

	## Quick Start

	### Installation

	```bash
	# Install required packages
	pip install huggingface_hub torch numpy pandas pydicom
	```

	### Basic Usage

	```bash
	# Extract embeddings from all scans
	python extract-embeddings.py \
	--root-dir /path/to/NLST/data \
	--output-dir embeddings_output
	```

	### Extract Specific Patient Cohort

	```bash
	# Extract only patients listed in a CSV file
	python extract-embeddings.py \
	--root-dir /path/to/NLST/data \
	--pid-csv subsets/train_pids.csv \
	--output-dir embeddings_train
	```

	## Command Line Arguments

	### Required
	- `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`)

	### Optional - Data Selection
	- `--pid-csv`: CSV file with "pid" column to filter specific patients
	- `--max-subjects`: Limit to N subjects (useful for testing)
	- `--output-dir`: Output directory (default: `embeddings_output`)

	### Optional - Performance Tuning
	- `--num-gpus`: Number of GPUs to use (default: 1)
	- `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4)
	- `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12)
	- `--checkpoint-interval`: Save checkpoint every N scans (default: 1000)

	## Expected Directory Structure

	Your DICOM data should follow this structure:
	```
	/path/to/NLST/
	├── NLST/
	│ ├── <PID_1>/
	│ │ ├── MM-DD-YYYY-NLST-LSS-<scan_id>/
	│ │ │ ├── <series_id>/
	│ │ │ │ ├── *.dcm
	│ │ │ │ └── ...
	│ │ │ └── ...
	│ │ └── ...
	│ ├── <PID_2>/
	│ └── ...
	```

	## Output Format

	### Embeddings File: `all_embeddings.parquet`

	Parquet file with columns:
	- `case_number`: Patient ID (PID)
	- `subject_id`: Same as case_number
	- `scan_id`: Unique scan identifier
	- `timepoint`: T0, T1, T2... (year-based, e.g., 1999→T0, 2000→T1)
	- `dicom_directory`: Full path to scan directory
	- `num_dicom_files`: Number of DICOM slices
	- `embedding_index`: Index in embedding array
	- `embedding`: 512-dimensional embedding array

	### Metadata File: `dataset_metadata.json`

	Complete metadata including:
	- Dataset info (total scans, embedding dimensions)
	- Model info (Sybil ensemble, extraction layer)
	- Per-scan metadata (paths, statistics)
	- Failed scans with error messages

	## Performance Tips

	### For Large Datasets (>10K scans)

	```bash
	# Use cached directory list and multi-GPU processing
	python extract-embeddings.py \
	--root-dir /data/NLST \
	--num-gpus 4 \
	--num-parallel 4 \
	--num-workers 12 \
	--checkpoint-interval 500
	```

	Memory Requirements: ~10GB VRAM per parallel scan
	- `--num-parallel 1`: Safe for 16GB GPUs
	- `--num-parallel 2`: Safe for 24GB GPUs
	- `--num-parallel 4`: Requires 40GB+ GPUs

	### For Subset Extraction (Train/Test Split)

	```bash
	# Extract training set
	python extract-embeddings.py \
	--root-dir /data/NLST \
	--pid-csv train_pids.csv \
	--output-dir embeddings_train \
	--num-workers 12

	# Extract test set
	python extract-embeddings.py \
	--root-dir /data/NLST \
	--pid-csv test_pids.csv \
	--output-dir embeddings_test \
	--num-workers 12
	```

	Speed: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup)

	## Loading Embeddings for Training

	```python
	import pandas as pd
	import numpy as np

	# Load embeddings
	df = pd.read_parquet('embeddings_output/all_embeddings.parquet')

	# Extract embedding array
	embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512)

	# Access metadata
	pids = df['case_number'].values
	timepoints = df['timepoint'].values
	```

	## Troubleshooting

	### Out of Memory (OOM) Errors
	- Reduce `--num-parallel` to 1 or 2
	- Use fewer GPUs with `--num-gpus 1`

	### Slow Directory Scanning
	- Increase `--num-workers` (try 8-12 for fast storage)
	- Use `--pid-csv` to filter early (100x speedup)
	- Rerun will use cached directory list automatically

	### Missing Timepoints
	- Timepoints are extracted from year in scan path (1999→T0, 2000→T1)
	- If `timepoint` is None, year pattern wasn't found in path
	- You can manually map scans to timepoints using `dicom_directory` column

	### Failed Scans
	- Check `dataset_metadata.json` for `failed_scans` section
	- Common causes: corrupted DICOM files, insufficient slices, invalid metadata

	## Federated Learning Integration

	This script is designed for privacy-preserving federated learning:

	1. Each site runs extraction locally on their DICOM data
	2. Embeddings are saved (not raw DICOM images)
	3. Sites share embeddings with federated learning system
	4. Central server trains model on embeddings without accessing raw data

	### Workflow for Sites

	```bash
	# 1. Download extraction script
	wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py

	# 2. Extract embeddings for train/test splits
	python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings
	python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings

	# 3. Share embeddings with federated learning system
	# (embeddings are much smaller and preserve privacy better than raw DICOM)
	```

	## Citation

	If you use this extraction pipeline, please cite the Sybil model:

	```bibtex
	@article{sybil2023,
	title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans},
	author={...},
	journal={...},
	year={2023}
	}
	```

	## Support

	For issues or questions:
	- Model issues: https://huggingface.co/Lab-Rasool/sybil
	- Federated learning: Contact your FL system administrator