research-article

Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence

Authors:
Zhuzi Chen

Tsinghua University, Tsinghua University, Haidian, Beijing, China

Tsinghua University, Tsinghua University, Haidian, Beijing, China
View Profile

,
Yi Liu

Tsinghua University, Tsinghua University, Haidian, Beijing, China

Tsinghua University, Tsinghua University, Haidian, Beijing, China
View Profile

,
Jia Liu

Tsinghua University, Tsinghua University, Haidian, Beijing, China

Tsinghua University, Tsinghua University, Haidian, Beijing, China
View Profile

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and SystemsJuly 2018Pages 84–88https://doi.org/10.1145/3242840.3242873

Published:27 July 2018Publication History

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

Pages 84–88

ABSTRACT

For short duration text prompted speaker verification where the amount of enrollment data is limited for each speaker model, it is hard to obtain a robust speaker representation. In these situations of short utterance speaker verification I-vector/GMM approaches work even worse than traditional GMM-MAP modeling method. GMM/HMM framework content matching is one of the state-of-the-art paradigms for short duration text-dependent speaker verification, in which models for individual lexical such as words, syllables, or phonemes are established for the background and speaker to make up mismatch. However, some of the phonemes do not occur in enrollment but happen in the testing recordings, and most of the phonemes have different preceding and succeeding phonemes, both of which leads to coarticulation difference. These are called lexical and context mismatch. In this work, to overcome the data sparceness caused lexical mismatch and context mismatch, phoneme states are clustered applying Earth Mover's Distance and Cauchy-Schwarz divergence as metrics. Performance improved as EER lowered by 6.2%, minDCF08 lowered by 1.9% for Earth Mover's Distance metric, and EER lowered by 3.7%, minDCF08 rised 1.9% for Cauchy-Schwarz divergence metric.

References

Najim DehakPatrick J. KennyReda DehakPierre DumouchelPierre Ouellet. 2011. Front-End Factor Analysis for Speaker Verification. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING. 19, 4. 788--798. 445 HOES LANE, PISCATAWAY, NJ 08855-4141 USA. Google ScholarDigital Library
Achintya Kr. SarkarZheng-Hua Tan. 2016. Text Dependent Speaker Verification Using Un-supervised HMM-UBM and Temporal GMM-UBM. 425--429.Google Scholar
Dong WangLantian LiZhiyuan TangThomas Fang Zheng. 2017. Deep Speaker Verification: Do We Need End to End? 177--181.Google Scholar
Chao LiXiaokong MaBing JiangXiangang LiXuewei ZhangXiao LiuYing CaoAjay KannanZhenyao Zhu. 2017. Deep Speaker: an End-to-End Neural Speaker Embedding System.Google Scholar
Jinxi GuoUsha Amrutha NookalaAbeer Alwan. 2017. CNN-Based Joint Mapping of Short and Long Utterance i-Vectors for Speaker Verification Using Short Utterances. 3712--3716.Google Scholar
Yi LiuLiang HeJia LiuMichael T. Johnson. 2017. Investigation of Frame Alignments for GMM-based Text-prompted Speaker Verification.Google Scholar
Anthony LarcherAik Lee KongBin MaHaizhou Li. 2014. Text-dependent speaker verification: Classifiers, databases and RSR2015. Speech Communication. 60, 3. 56--77.Google Scholar
A. LarcherK. A. LeeB. MaH. Li. 2012. The RSR2015: Database for text-dependent speaker verification using multiple pass-phrases. Interspeech.Google Scholar
Nicolas Lei Yun Scheffer. 2014. Content matching for short duration speaker recognition. INTERSPEECH-2014. 1317--1321.Google Scholar
Kong Aik LeeAnthony LarcherHelen ThaiBin MaHaizhou Li. 2011. Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home. 3317--3318.Google Scholar
Lee, Kong Aik. Larcher, Anthony. Wang, Guangsen. etc (2015): "The reddots data collection for speaker recognition", In INTERSPEECH-2015, 2996--3000.Google Scholar
S. J. YoungG. EvermannM. J. F. GalesT. HainD. KershawX. LiuG. MooreJ. OdellD. OllasonD. Povey. 2006. The HTK book (v3. 4).Google Scholar
Robert JenssenDeniz ErdogmusKenneth E. HildJose C. PrincipeTorbjørn Eltoft. 2005. Optimizing the Cauchy-Schwarz PDF Distance for Information Theoretic, Non-parametric Clustering. 34--45. Google ScholarDigital Library
Peihua LiQilong WangLei Zhang. 2014. A Novel Earth Mover's Distance Methodology for Image Matching with Gaussian Mixture Models. 1689--1696. Google ScholarDigital Library
Douglas A. ReynoldsThomas F. QuatieriRobert B. Dunn. 2000. Speaker Verification Using Adapted Gaussian Mixture Models.Google Scholar
Guangsen WangAik Lee KongTrung Hieu NguyenHanwu SunBin Ma. 2016. Joint Speaker and Lexical Modeling for Short-Term Characterization of Speaker. 415--419.Google Scholar
W. M. CampbellD. E. SturimD. A. ReynoldsA. Solomonoff. 2006. SVM Based Speaker Verification using a GMM Supervector Kernel and NAP Variability Compensation.Google Scholar
Wei WuThomas Fang ZhengMing Xing XuFrank K. Soong. 2007. A Cohort-Based Speaker Model Synthesis for Mismatched Channels in Speaker Verification. IEEE Transactions on Audio Speech & Language Processing. 15, 6. 1893--1903. Google ScholarDigital Library
K. KampaE. HasanbelliuJ. C. Principe. 2011. Closed-form cauchy-schwarz PDF divergence for mixture of Gaussians. 2578--2585.Google Scholar
Fan WangLeonidas J. Guibas. 2012. Supervised Earth Mover's Distance Learning and Its Computer Vision Applications. Springer Berlin Heidelberg.Google Scholar
Y. RubnerC. TomasiL. J. Guibas. 1998. A Metric for Distributions with Applications to Image Databases. 59. Google ScholarDigital Library
T. StadelmannB. Freisleben. 2006. Fast and Robust Speaker Clustering Using the Earth Mover'S Distance and Mixmax Models.Google Scholar

Index Terms

Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition

Recommendations

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

We presented a new text-independent/text-prompted speaker recognition method by combining speaker-specific Gaussian Mixture Model (GMM) with syllable-based HMM adapted by MLLR or MAP. The robustness of this speaker recognition method for speaking style'...
Read More
Nonparametric Speaker Recognition Method Using Earth Mover's Distance

In this paper, we propose a distributed speaker recognition method using a nonparametric speaker model and Earth Mover's Distance (EMD). In distributed speaker recognition, the quantized feature vectors are sent to a server. The Gaussian mixture model (...
Read More
Concatenated phoneme models for text-variable speaker recognition
ICASSP'93: Proceedings of the 1993 IEEE international conference on Acoustics, speech, and signal processing: speech processing - Volume II

This paper investigates methods that create models to specify both speaker and phonetic information accurately by using only a small amount of training data for each speaker. For a text-dependent speaker recognition method, in which arbitrary key texts ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems
July 2018
245 pages
ISBN:9781450365093
DOI:10.1145/3242840

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cauchy-Schwarz divergence
Earth Mover's distance
Phoneme clustering
Text prompted speaker verification
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 26
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

Nonparametric Speaker Recognition Method Using Earth Mover's Distance

Concatenated phoneme models for text-variable speaker recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Text Prompted Speaker Verification Based on Phoneme Clustering with Earth Mover's Distane and Cauchy-Schwarz Divergence

ICACS '18: Proceedings of the 2nd International Conference on Algorithms, Computing and Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Text-Independent/Text-Prompted Speaker Recognition by Combining Speaker-Specific GMM with Speaker Adapted Syllable-Based HMM

Nonparametric Speaker Recognition Method Using Earth Mover's Distance

Concatenated phoneme models for text-variable speaker recognition

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media