Skip to main content
Log in

Search for speaker identity in historical oral archives

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

We present our ongoing research focused on speaker recognition in historical oral archives. This research is part of our long-term effort aimed at enabling versatile access to the archive of the Czech Radio (CRo). Based on a manually annotated partition of the archive, we compiled a database covering a time span of more than 30 years to carry out our experimental study. Hence we were able to investigate the impact of various aspects that make it challenging to process historical data. We show the shift of scores for target (genuine) speaker trials introduced by the aging effect, the value of the signal-to-noise ratio or by the variable amount of the enrollment and test data. Scores for speaker detection trials were assessed by a system based on the i-vector paradigm and probabilistic linear discriminative analysis. We also assessed the performance of this system using an evaluation database containing contemporary recordings collected over a time span of approximately 4 years. Although using state-of-the-art techniques, capable of dealing with nuisance inter-session variability, we demonstrate remarkable degradation in the performance of the system in the evaluation containing historical data compared to the one containing contemporary data only. Specifically, the Equal Error Rate (EER) of the system rose to 8.27 % from 1.93 %. The revealed difference thus exemplifies that compensation techniques need to be employed to cope with additional variability introduced in the historical data by various sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. The original name of the company was the Radiojournal company.

  2. http://www.nist.gov/itl/iad/mig/sre.cfm

  3. In [19], the nuisance variability is expressed as sum of an intra-speaker variability confined to a lower-dimensional subspace and a residual noise which is assumed to be Gaussian with a diagonal covariance matrix. A full covariance matrix used in our case is thus simply a generalization of the model.

  4. We used relevance factor of 16.0 in our experiments.

  5. We used the Bosaris toolkit available at https://sites.google.com/site/bosaristoolkit/ to plot our DET curves.

  6. Please note that the results presented in [23] and in this work are not directly comparable. In [23] we used the test database in a two-fold cross-validation setup with one fold used for calibration training and the second for testing, and vice versa. Here we pooled all the test data together. Furthermore, different development data sets were used in [23] and in this work. In the former study, the available data was much more limited, particularly for estimation of intra-speaker variability, requiring recordings from multiple sessions per speaker.

  7. Let us stress that we strictly distinguished between different excerpts and different sessions. Hence, no model was trained for a speaker having multiple excerpts available but they were all drawn from a single session.

  8. The linear regression fit curves displayed in the figure are not intended to represent a true dependency but just its trend.

References

  1. Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, SODA ’07. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp 1027–1035

    MATH  Google Scholar 

  2. Boháč M, Blavka K (2013) Text-to-speech alignment for imperfect transcriptions. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue, lecture notes in computer science, vol 8082. Springer, Berlin Heidelberg, pp 536–543

    Google Scholar 

  3. Brümmer N (2009) EM for JFA. Tech. rep., South Africa. Available at https://sites.google.com/site/nikobrummer/EMforJFA.pdf?attredirects=0

  4. Brummer N, Burget L, Kenny P, Matějka P, de EV, Karafiát M, Kockmann M, Glembek O, Plchot O, Baum D, Senoussauoi M (2010) ABC system description for NIST SRE 2010. In: Proc. NIST 2010 speaker recognition evaluation. Brno University of Technology, pp 1–20

  5. Chaloupka J, Nouza J, Červa P, Málek J (2013) Downdating lexicon and language model for automatic transcription of Czech historical spoken documents. In: Habernal I, Matoušek V (eds) Text, speech, and dialogue, lecture notes in computer science, vol 8082. Springer, Berlin Heidelberg, pp 201–208

    Google Scholar 

  6. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19 (4):788–798

    Article  Google Scholar 

  7. Doddington GR, Przybocki MA, Martin AF, Reynolds DA (2000) The NIST speaker recognition evaluation - overview, methodology, systems, results, perspective. Speech Commun 31 (2–3):225–254

    Article  Google Scholar 

  8. Ferrer L, Graciarena M, Zymnis A, Shriberg E (2008) System combination using auxiliary information for speaker verification. In: IEEE international conference on acoustics, speech and signal processing - ICASSP 2008, Las Vegas, pp 4853–4856

  9. Garcia-Romero D, Espy-Wilson CY (2011) Analysis of i-vector length normalization in speaker recognition systems. In: Interspeech’11. Florence, pp 249–252

  10. Kanagasundaram A, Dean D, Gonzalez-Dominguez J, Sridharan S, Ramos D, Gonzalez-Rodriguez J (2013) Improving the PLDA based speaker verification in limited microphone data conditions. In: Interspeech 2013. International Speech communication association (ISCA ), Lyon, pp 3674–3678

  11. Kelly F, Drygajlo A, Harte N (2012) Speaker verification with long-term ageing data. In: 2012 5th IAPR international conference on biometrics (ICB), pp 478–483

  12. Kelly F, Harte N (2011) Effects of long-term ageing on speaker verification. In: Vielhauer C, Dittmann J, Drygajlo A, Juul N, Fairhurst M (eds) Biometrics and ID management, lecture notes in computer science, vol 6583. Springer, Berlin Heidelberg, pp 113–124

    Google Scholar 

  13. Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Process 13 (3):345–354

    Google Scholar 

  14. Kenny P, Stafylakis T, Ouellet P, Alam M, Dumouchel P (2013) PLDA for speaker verification with utterances of arbitrary duration. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 7649–7653

  15. Kim C, Stern RM (2008) Robust signal-to-noise ratio estimation based on waveform amplitude distribution analysis. In: INTERSPEECH. ISCA, pp 2598–2601

  16. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Commun 52:12–40

    Article  Google Scholar 

  17. Matveev Y (2013) The problem of voice template aging in speaker recognition systems. In: železný M, Habernal I, Ronzhin A (eds) Speech and computer, lecture notes in computer science, vol 8113, pp 345–353. Springer International Publishing

  18. Nouza J, Blavka K, Bohac M, Cerva P, Zdansky J, Silovsky J, Prazak J (2012) Voice technology to enable sophisticated access to historical audio archive of the czech radio. In: Grana C, Cucchiara R (eds) Multimedia for cultural heritage, communications in computer and information science, vol 247. Springer, Berlin Heidelberg, pp 27–38

    Google Scholar 

  19. Prince SJD, Elder JH (2007) Probabilistic linear discriminant analysis for inferences about identity. In: Proceedings ICCV 2007, Rio de Janeiro, Brazil, pp 1–8

  20. Rajan P, Tomi Kinnunen VH (2013) Effect of multicondition training on i-vector PLDA configurations for speaker recognition. In: Interspeech’13. Lyon, pp 3694–3697

  21. Reynolds DA, Quatieri TF, Dunn RB (2000) Speaker verification using Adapted Gaussian mixture models. Digit Signal Process 1–3:19–41

    Article  Google Scholar 

  22. Sarkar AK, Matrouf D, Bousquet P-M, Bonastre J-F (2012) Study of the Effect of I-vector Modeling on Short and Mismatch Utterance Duration for Speaker Verification. In: INTERSPEECH’12. ISCA, Portland, OR, USA

  23. Silovsky J, Cerva P, Zdansky J (2009) Comparison of generative and discriminative approaches for speaker recognition with limited data. Radioengineering 18 (3):307–316

    Google Scholar 

  24. Silovsky J, Zdansky J, Nouza J, Cerva P, Prazak J (2012) Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams. In: MMSP’12. Banff, pp 118–123

  25. van Leeuwen DA, Brummer N (2007) An introduction to application-independent evaluation of speaker recognition systems. Lect Notes Comput Sci 4343/2007:330–353

    Article  Google Scholar 

Download references

Acknowledgments

This research work was supported by the Czech Ministry of Culture (project no. DF11P01OVV013 in program NAKI).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jan Silovsky.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Silovsky, J., Nouza, J. & Kucharova, M. Search for speaker identity in historical oral archives. Multimed Tools Appl 75, 3767–3786 (2016). https://doi.org/10.1007/s11042-014-2067-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2067-2

Keywords

Navigation