| # Sybil Embedding Extraction Pipeline |
|
|
| This script extracts 512-dimensional embeddings from chest CT DICOM scans using the Sybil lung cancer risk prediction model. It's designed for **federated learning** deployments where sites need to generate embeddings locally without sharing raw medical images. |
|
|
| ## Features |
|
|
| - β
**Automatic Model Download**: Downloads Sybil model from HuggingFace automatically |
| - β
**Multi-GPU Support**: Process scans in parallel across multiple GPUs |
| - β
**Smart Filtering**: Automatically filters out localizer/scout scans |
| - β
**PID-Based Extraction**: Extract embeddings for specific patient cohorts |
| - β
**Checkpoint System**: Save progress every N scans to prevent data loss |
| - β
**Timepoint Detection**: Automatically detects T0, T1, T2... from scan dates |
| - β
**Directory Caching**: Cache directory scans for 100x faster reruns |
|
|
| ## Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| # Install required packages |
| pip install huggingface_hub torch numpy pandas pydicom |
| ``` |
|
|
| ### Basic Usage |
|
|
| ```bash |
| # Extract embeddings from all scans |
| python extract-embeddings.py \ |
| --root-dir /path/to/NLST/data \ |
| --output-dir embeddings_output |
| ``` |
|
|
| ### Extract Specific Patient Cohort |
|
|
| ```bash |
| # Extract only patients listed in a CSV file |
| python extract-embeddings.py \ |
| --root-dir /path/to/NLST/data \ |
| --pid-csv subsets/train_pids.csv \ |
| --output-dir embeddings_train |
| ``` |
|
|
| ## Command Line Arguments |
|
|
| ### Required |
| - `--root-dir`: Root directory containing DICOM files (e.g., `/data/NLST`) |
|
|
| ### Optional - Data Selection |
| - `--pid-csv`: CSV file with "pid" column to filter specific patients |
| - `--max-subjects`: Limit to N subjects (useful for testing) |
| - `--output-dir`: Output directory (default: `embeddings_output`) |
|
|
| ### Optional - Performance Tuning |
| - `--num-gpus`: Number of GPUs to use (default: 1) |
| - `--num-parallel`: Process N scans simultaneously (default: 1, recommend 1-4) |
| - `--num-workers`: Parallel workers for directory scanning (default: 4, recommend 4-12) |
| - `--checkpoint-interval`: Save checkpoint every N scans (default: 1000) |
|
|
| ## Expected Directory Structure |
|
|
| Your DICOM data should follow this structure: |
| ``` |
| /path/to/NLST/ |
| βββ NLST/ |
| β βββ <PID_1>/ |
| β β βββ MM-DD-YYYY-NLST-LSS-<scan_id>/ |
| β β β βββ <series_id>/ |
| β β β β βββ *.dcm |
| β β β β βββ ... |
| β β β βββ ... |
| β β βββ ... |
| β βββ <PID_2>/ |
| β βββ ... |
| ``` |
|
|
| ## Output Format |
|
|
| ### Embeddings File: `all_embeddings.parquet` |
| |
| Parquet file with columns: |
| - `case_number`: Patient ID (PID) |
| - `subject_id`: Same as case_number |
| - `scan_id`: Unique scan identifier |
| - `timepoint`: T0, T1, T2... (year-based, e.g., 1999βT0, 2000βT1) |
| - `dicom_directory`: Full path to scan directory |
| - `num_dicom_files`: Number of DICOM slices |
| - `embedding_index`: Index in embedding array |
| - `embedding`: 512-dimensional embedding array |
|
|
| ### Metadata File: `dataset_metadata.json` |
| |
| Complete metadata including: |
| - Dataset info (total scans, embedding dimensions) |
| - Model info (Sybil ensemble, extraction layer) |
| - Per-scan metadata (paths, statistics) |
| - Failed scans with error messages |
| |
| ## Performance Tips |
| |
| ### For Large Datasets (>10K scans) |
| |
| ```bash |
| # Use cached directory list and multi-GPU processing |
| python extract-embeddings.py \ |
| --root-dir /data/NLST \ |
| --num-gpus 4 \ |
| --num-parallel 4 \ |
| --num-workers 12 \ |
| --checkpoint-interval 500 |
| ``` |
| |
| **Memory Requirements**: ~10GB VRAM per parallel scan |
| - `--num-parallel 1`: Safe for 16GB GPUs |
| - `--num-parallel 2`: Safe for 24GB GPUs |
| - `--num-parallel 4`: Requires 40GB+ GPUs |
| |
| ### For Subset Extraction (Train/Test Split) |
| |
| ```bash |
| # Extract training set |
| python extract-embeddings.py \ |
| --root-dir /data/NLST \ |
| --pid-csv train_pids.csv \ |
| --output-dir embeddings_train \ |
| --num-workers 12 |
| |
| # Extract test set |
| python extract-embeddings.py \ |
| --root-dir /data/NLST \ |
| --pid-csv test_pids.csv \ |
| --output-dir embeddings_test \ |
| --num-workers 12 |
| ``` |
| |
| **Speed**: With PID filtering, scanning 100K subjects for 100 PIDs takes ~5 seconds (100x speedup) |
| |
| ## Loading Embeddings for Training |
| |
| ```python |
| import pandas as pd |
| import numpy as np |
| |
| # Load embeddings |
| df = pd.read_parquet('embeddings_output/all_embeddings.parquet') |
|
|
| # Extract embedding array |
| embeddings = np.stack(df['embedding'].values) # Shape: (num_scans, 512) |
| |
| # Access metadata |
| pids = df['case_number'].values |
| timepoints = df['timepoint'].values |
| ``` |
| |
| ## Troubleshooting |
| |
| ### Out of Memory (OOM) Errors |
| - Reduce `--num-parallel` to 1 or 2 |
| - Use fewer GPUs with `--num-gpus 1` |
| |
| ### Slow Directory Scanning |
| - Increase `--num-workers` (try 8-12 for fast storage) |
| - Use `--pid-csv` to filter early (100x speedup) |
| - Rerun will use cached directory list automatically |
| |
| ### Missing Timepoints |
| - Timepoints are extracted from year in scan path (1999βT0, 2000βT1) |
| - If `timepoint` is None, year pattern wasn't found in path |
| - You can manually map scans to timepoints using `dicom_directory` column |
| |
| ### Failed Scans |
| - Check `dataset_metadata.json` for `failed_scans` section |
| - Common causes: corrupted DICOM files, insufficient slices, invalid metadata |
| |
| ## Federated Learning Integration |
| |
| This script is designed for **privacy-preserving federated learning**: |
| |
| 1. **Each site runs extraction locally** on their DICOM data |
| 2. **Embeddings are saved** (not raw DICOM images) |
| 3. **Sites share embeddings** with federated learning system |
| 4. **Central server trains model** on embeddings without accessing raw data |
| |
| ### Workflow for Sites |
| |
| ```bash |
| # 1. Download extraction script |
| wget https://huggingface.co/Lab-Rasool/sybil/resolve/main/extract-embeddings.py |
|
|
| # 2. Extract embeddings for train/test splits |
| python extract-embeddings.py --root-dir /local/NLST --pid-csv train_pids.csv --output-dir train_embeddings |
| python extract-embeddings.py --root-dir /local/NLST --pid-csv test_pids.csv --output-dir test_embeddings |
|
|
| # 3. Share embeddings with federated learning system |
| # (embeddings are much smaller and preserve privacy better than raw DICOM) |
| ``` |
| |
| ## Citation |
| |
| If you use this extraction pipeline, please cite the Sybil model: |
| |
| ```bibtex |
| @article{sybil2023, |
| title={A Deep Learning Model to Predict Lung Cancer Risk from Chest CT Scans}, |
| author={...}, |
| journal={...}, |
| year={2023} |
| } |
| ``` |
| |
| ## Support |
| |
| For issues or questions: |
| - Model issues: https://huggingface.co/Lab-Rasool/sybil |
| - Federated learning: Contact your FL system administrator |
| |