| Pediatric wrist pathologies recognition from radiographs is challenging because normal anatomy changes rapidly with development: evolving carpal ossification and open physes can resemble pathology, and maturation timing differs by sex. Image-only models trained on limited medical datasets therefore risk confusing normal developmental variation with true pathologies. We address this by framing pediatric wrist diagnosis as a fine-grained visual recognition (FGVR) problem and proposing a demographic-aware hybrid convolution--transformer model that fuses X-rays with patient age and sex. To leverage demographic context while avoiding shortcut reliance, we introduce progressive metadata masking during training. We evaluate on a curated dataset that mirrors the typical constraints in real-world medical studies. The hybrid FGVR backbone outperforms traditional and modern CNNs, and demographic fusion yields additional gains. Finally, we show that initializing from a fine-grained pretraining source improves transfer relative to standard ImageNet initialization, suggesting that label granularity, even from non-medical data, can be a key driver of generalization for subtle radiographic findings. |
*** Title, author list and abstract as submitted during Camera-Ready version delivery. Small changes that may have occurred during processing by Springer may not appear in this window.