Audio-Visual Class Association Based on Two-stage Self-supervised Contrastive Learning towards Robust Scene Analysis | IEEE Conference Publication | IEEE Xplore