This paper describes the BIT(Beijing Institute of Technology) system submitted to the Conversational Speaker Diarization Challenge. We firstly present the details of the front-end system, which comprises a Speech Activity Detection (SAD) module and a speaker embedding extraction module. Then based on the results of the clustering-based module, two iterative back-end models with multi-scale similarity measure are investigated: Support Vector Classifier (SVC) system and U-Net system. Finally, DOVER algorithm is adopted for model fusion. Experimental results show that our system yields a DER of 5.18% in the challenge, a relative improvement of 34% over the baseline system provided by the organizer. Our system won the first place among all submitted systems without needing to use any of additional embedding extracting model.
Cite as: Hu, C., Zhan, Q., Liu, M., Xie, X. (2022) BIT Submission for the Conversational Speaker Diarization Challenge. Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 148-155, doi: 10.21437/Odyssey.2022-21
@inproceedings{hu22_odyssey, author={Chenguang Hu and Qingran Zhan and Miao Liu and Xiang Xie}, title={{BIT Submission for the Conversational Speaker Diarization Challenge}}, year=2022, booktitle={Proc. The Speaker and Language Recognition Workshop (Odyssey 2022)}, pages={148--155}, doi={10.21437/Odyssey.2022-21} }