Hierarchical Training on Partial Annotations Enables Density-Robust Crowd Counting and Localization

Abstract

Reliable crowd analysis requires both accurate counting and precise head-point localization under severe density and scale variation. In practice, dense scenes exhibit heavy occlusion and perspective distortion, while the same camera can undergo abrupt distribution shifts over time due to zoom and viewpoint changes or event dynamics. We present the model, obtained by fine-tuning Point Query Tranformer(PET) on a curated, multi-source dataset with partial and heterogeneous annotations. Our training recipe combines (i) a hierarchical iterative loop that aligns count distributions across partial ground truth, fine-tuned predictions, and the pre-trained baseline to guide outlier-driven data refinement, (ii) multi-patch resolution training (128x128, 256x256, and 512x512) to reduce scale sensitivity, (iii) count-aware patch sampling to mitigate long-tailed density skew, and (iv) adaptive background-query loss weighting to prevent resolution-dependent background dominance. This approach improves F1 scores F1@4px and F1@8px on ShanghaiTech Part A (SHHA), ShanghaiTech Part B (SHHB), JHU-Crowd++, and UCF-QNRF, and exhibits more stable behavior during sparse-to-dense density transitions.

For detailed data curation and training recipe, refer to our technical report: Technical Report.

Evaluation and Results

Across four benchmarks, PET-Finetuned shows the strongest overall transfer, with consistent gains in both counting and localization on SHHB, UCF-QNRF, and JHU-Crowd++. On SHHB, it reduces MAE/MSE to 13.794/22.163 from 19.472/29.651 (PET-SHHA) and 19.579/28.398 (APGCC-SHHA), while increasing F1@8 to 0.820. The same pattern holds on UCF-QNRF (MAE 105.772, MSE 199.544, F1@8 0.738) and JHU-Crowd++ (MAE 74.778, MSE 271.886, F1@8 0.698), where PET-Finetuned outperforms both references by clear margins. On SHHA, counting error is higher than PET-SHHA and APGCC-SHHA (MAE 62.742 vs 48.879/48.725), but localization is best in table (F1@4 0.614, F1@8 0.794), indicating a stronger precision-recall balance for head-point prediction at both matching thresholds.

Note (evaluation protocol): PET-SHHA and APGCC-SHHA numbers in this section can differ from values reported in the original papers. The original works typically train one model per target dataset and evaluate in-domain. In contrast, PET-Finetuned(Ours) is initialized from PET-SHHA weights and fine-tuned in our framework. For cross-dataset baseline comparison, we use the best public SHHA Part A checkpoints released by the authors for PET-SHHA and APGCC-SHHA (APGCC publicly provides only the SHHA-best checkpoint). Therefore, the PET-SHHA and APGCC-SHHA rows above reflect transfer from SHHA initialization rather than per-dataset retraining. All metrics in this section are evaluated at threshold = 0.5.

ShanghaiTech Part A (SHHA)

Model MAE MSE AP@4px AR@4px F1@4px AP@8px AR@8px F1@8px
PET-Finetuned(Ours) 62.742 102.996 0.615 0.613 0.614 0.796 0.793 0.794
PET-SHHA 48.879 76.520 0.596 0.604 0.600 0.781 0.792 0.786
APGCC-SHHA 48.725 76.721 0.439 0.428 0.433 0.773 0.754 0.764

ShanghaiTech Part B (SHHB)

Model MAE MSE AP@4px AR@4px F1@4px AP@8px AR@8px F1@8px
PET-Finetuned(Ours) 13.794 22.163 0.666 0.596 0.629 0.869 0.777 0.820
PET-SHHA 19.472 29.651 0.640 0.547 0.590 0.847 0.724 0.781
APGCC-SHHA 19.579 28.398 0.517 0.441 0.476 0.837 0.714 0.771

UCF-QNRF

Model MAE MSE AP@4px AR@4px F1@4px AP@8px AR@8px F1@8px
PET-Finetuned(Ours) 105.772 199.544 0.533 0.505 0.519 0.759 0.719 0.738
PET-SHHA 123.135 240.943 0.495 0.487 0.491 0.708 0.696 0.702
APGCC-SHHA 126.763 228.998 0.311 0.284 0.297 0.638 0.583 0.609

JHU-Crowd++

Model MAE MSE AP@4px AR@4px F1@4px AP@8px AR@8px F1@8px
PET-Finetuned(Ours) 74.778 271.886 0.467 0.491 0.479 0.681 0.715 0.698
PET-SHHA 115.861 393.281 0.379 0.449 0.411 0.582 0.690 0.632
APGCC-SHHA 102.461 331.883 0.303 0.330 0.316 0.578 0.630 0.603

Qualitative Analysis

Full-resolution qualitative comparisons in the report use horizontal stacked panels ordered as PET-Finetuned(Ours), PET-SHHA, and APGCC-SHHA, with point colors green, yellow, and red. Inference for these comparisons uses threshold = 0.5 and upper_bound = -1. Qualitatively, PET-Finetuned(Ours) shows fewer sparse-scene false positives, stronger dense-scene recall under occlusion, and more stable localization under perspective and scale variation.

Qualitative comparison for pexels-558331748-30295833

Qualitative comparison for pexels-ilyasajpg-7038431

Qualitative comparison for pexels-peter-almario-388108-19472286

Qualitative comparison for pexels-rafeeque-kodungookaran-374579689-18755903

Qualitative comparison for pexels-wendywei-4945353

Model Inference

Use the official PET repository to run single-image inference with this release model.

  1. Clone PET and move into the repository root.
    git clone https://github.com/cxliu0/PET.git
    cd PET
    
  2. Install dependencies.
    pip install -r requirements.txt
    pip install safetensors pillow
    
  3. Copy test.py from this release folder into the PET repository root.
  4. Place PET_Finetuned.safetensors in the PET repository root.
  5. Run inference (dummy example).
    python test.py \
      --image_path path/to/image.jpg \
      --resume PET_Finetuned.safetensors \
      --device cpu \
      --output_json outputs/prediction.json \
      --output_image outputs/prediction.jpg 
    

Summary

We present a practical adaptation of PET for density-robust crowd counting and head-point localization under partial and heterogeneous annotations. The training framework combines a hierarchical iterative fine-tuning loop with outlier-driven data refinement, mixed patch-resolution optimization (128x128/256x256/512x512), count-aware sampling for dense-scene emphasis, and adaptive background-query loss weighting to stabilize supervision across scales.

Under the reported cross-dataset transfer protocol from SHHA initialization, the model achieves the strongest overall transfer on SHHB, UCF-QNRF, and JHU-Crowd++, while maintaining the best localization balance on SHHA at both matching thresholds. Qualitative evidence is consistent with these trends, showing fewer sparse-scene false positives and stronger dense-scene recall under occlusion and perspective variation.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support