In this work, we fuse imaging features from Chest X-Ray (CXR) scans and audio features from dictations of a radiologist to improve thoracic disease classification. Recent deep learning-based disease classification methods mostly use imaging modalities. Dictation audio from a radiologist contains rich auxiliary diseaserelated contextual information. The main hypothesis of this proposed work is that leveraging complementary imaging and audio representations improves disease classification. We use shifting window (Swin) transformer architectures as encoders for both visual and audio modalities and finally fuse the feature representations using cross-correlational feature multiplication fusion strategy. This fused feature representation is fed to a classification head for downstream disease classification. We experimentally show that the proposed fused model outperforms the individual modality models for multi-class thoracic disease classification that includes normal, pneumonia, and congestive heart failure cases. We report F1-score of 0.5415 and 0.5353 for shifting window transformer base and small architectures respectively, for fused modalities, while the corresponding baselines are reported at 0.5046 and 0.5076 for the audio modality and 0.4676 and 0.5261 for the imaging modality, respectively.
|