ABSTRACT
Classification quality criteria such as precision, recall, and F-measure are generally the basis for evaluating contributions in automatic speaker recognition. Specifically, comparisons are carried out mostly via mean values estimated on a set of media. Whilst this approach is relevant to assess improvement w.r.t. the state-of-the-art, or ranking participants in the context of an automatic annotation challenge, it gives little insight to system designers in terms of cues for improving algorithms, hypothesis formulation, and evidence display. This paper presents a design study of a visual and interactive approach to analyze errors made by automatic annotation algorithms. A timeline-based tool emerged from prior steps of this study. A critical review, driven by user interviews, exposes caveats and refines user objectives. The next step of the study is then initiated by sketching designs combining elements of the current prototype to principles newly identified as relevant.
Supplemental Material
Available for Download
A Visual Analytics Approach to Finding Factors Improving Automatic Speaker Identifications
- Adobe Premiere Pro CC, 2015.Google Scholar
- O. Aubert and Y. Prié. Advene: active reading through hypervideo. In ACM conference on Hypertext and hypermedia, pages 235--244, 2005. Google ScholarDigital Library
- P. Bruneau, M. Stefas, H. Bredin, A.-P. Ta, T. Tamisier, and C. Barras. A Web-Based Tool for the Visual Analysis of Media Annotations. In IV 2014, pages 145--150, 2014.Google ScholarCross Ref
- D. Charlet, J. Poignant, H. Bredin, C. Fredouille, and S. Meignier. What Makes a Speaker Recognizable in TV Broadcast? Going Beyond Speaker Identification Error Rate. In ERRARE Workshop, 2015.Google Scholar
- P. Ercolessi, H. Bredin, and C. Sénac. StoViz: story visualization of TV series. In ACM Multimedia, pages 1329--1330, 2012. Google ScholarDigital Library
- A. Giraudel, M. Carré, V. Mapelli, J. Kahn, O. Galibert, and L. Quintard. The REPERE Corpus: a multimodal corpus for person recognition. In LREC, pages 1102--1107, 2012.Google Scholar
- C. S. Greenberg, A. Martin, and M. Przybocki. The 2011 best speaker recognition interim assessment. In Odyssey, pages 275--282, 2012.Google Scholar
- C. S. Greenberg, V. M. Stanford, A. F. Martin, M. Yadagiri, G. R. Doddington, J. J. Godfrey, and J. Hernandez-Cordero. The 2012 NIST speaker recognition evaluation. In INTERSPEECH, pages 1971--1975, 2013.Google Scholar
- S. Johansson Fernstad, J. Shaw, and J. Johansson. Quality-based guidance for exploratory dimensionality reduction. Information Visualization, 12(1):44--64, 2013. Google ScholarDigital Library
- A. Kapoor, B. Lee, D. Tan, and E. Horvitz. Interactive optimization for steering machine classification. In CHI, pages 1343--1352, 2010. Google ScholarDigital Library
- M. Kipp. Anvil: The video annotation research tool. Oxford University Press, 2011.Google Scholar
- M. Sedlmair, M. Meyer, and T. Munzner. Design study methodology: Reflections from the trenches and the stacks. IEEE TVCG, 18(12):2431--2440, 2012. Google ScholarDigital Library
- C. J. Stubben and B. G. Milligan. Estimating and analyzing demographic models using the popbio package in R. J. Stat. Softw., 22(11), 2007.Google Scholar
- F. W. Young, P. M. Valero-Mora, and M. Friendly. Visual statistics: seeing data with dynamic interactive graphics. John Wiley & Sons, 2011. Google ScholarDigital Library
Index Terms
- A Visual Analytics Approach to Finding Factors Improving Automatic Speaker Identifications
Recommendations
Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM
We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Text-Independent Speaker Identification Using Vowel Formants
Automatic speaker identification has become a challenging research problem due to its wide variety of applications. Neural networks and audio-visual identification systems can be very powerful, but they have limitations related to the number of ...
Speaker Identification Using Whispered Speech
CSNT '13: Proceedings of the 2013 International Conference on Communication Systems and Network TechnologiesThe study of closed set text-independent speaker identification using whisper speech is presented in this paper. A new feature called temporal Teager energy based sub band cepstral coefficients (TTESBCC) is proposed. The work presented compares the ...
Comments