JHU Diarization System Description

Huang, Zili; García-Perera, L. Paola; Villalba, Jesús; Povey, Daniel; Dehak, Najim

doi:10.21437/IberSPEECH.2018-49

JHU Diarization System Description

Zili Huang, L. Paola García-Perera, Jesús Villalba, Daniel Povey, Najim Dehak

We present the JHU system for Iberspeech-RTVE Speaker Diarization Evaluation. This assessment combines Spanish language and broadcast audio in the same recordings, conditions in which our system has not been tested before. To tackle this problem, the pipeline of our general system, developed entirely in Kaldi, includes an acoustic feature extraction, a SAD, an embedding extractor, a PLDA and a clustering stage. This pipeline was used for both, the open and the closed conditions (described in the evaluation plan). All the proposed solutions use wide-band data (16KHz) and MFCCs as their input. For the closed condition, the system trains a DNN SAD using the Albayzin2016 data. Due to the small amount of data available, the i-vector embedding extraction was the only approach explored for this task. The PLDA training utilizes Albayzin data followed by an Agglomerative Hierarchical Clustering (AHC) to obtain the speaker segmentation. The open condition employs the DNN SAD obtained in the closed condition. Four types of embeddings were extracted, xvector-basic, xvector-factored, i-vector-basic and BNF-i-vector. The x-vector-basic is a TDNN trained on augmented Voxceleb1 and Voxceleb2. The x-vector-factored is a factored-TDNN (F-TDNN) trained on SRE12-micphn, MX6-micphn, VoxCeleb and SITWdev-core. The i-vector-basic was trained on Voxceleb1 and Voxceleb2 data (no augmentation). The BNF-i-vector is a BNF-posterior i-vector trained with the same data as xvector-factored. The PLDA training for the new scenario uses the Albayzin2016 data. The four systems were fused at the score level. Once again, the AHC commputed the final speaker segmentation. We tested our systems in the dev2 data and observed that the SAD is of importance to improve the results. Moreover, we noticed that xvectors were better than i-vectors, as already observed in previous experiments.

doi: 10.21437/IberSPEECH.2018-49

Cite as: Huang, Z., García-Perera, L.P., Villalba, J., Povey, D., Dehak, N. (2018) JHU Diarization System Description. Proc. IberSPEECH 2018, 236-239, doi: 10.21437/IberSPEECH.2018-49

@inproceedings{huang18_iberspeech,
  author={Zili Huang and L. Paola García-Perera and Jesús Villalba and Daniel Povey and Najim Dehak},
  title={{JHU Diarization System Description}},
  year=2018,
  booktitle={Proc. IberSPEECH 2018},
  pages={236--239},
  doi={10.21437/IberSPEECH.2018-49}
}