Speech corpora subset selection based on time-continuous utterances features

Dong, Luobing; Guo, Qiumin; Wu, Weili

doi:10.1007/s10878-018-0350-2

Speech corpora subset selection based on time-continuous utterances features

Published: 21 September 2018

Volume 37, pages 1237–1248, (2019)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

429 Accesses
40 Citations
Explore all metrics

Abstract

An extremely large corpus with rich acoustic properties is very useful for training new speech recognition and semantic analysis models. However, it also brings some troubles, because the complexity of the acoustic model training usually depends on the size of the corpora. In this paper, we propose a corpora subset selection method considering data contributions from time-continuous utterances and multi-label constraints that are not limited to single-scale metrics. Our goal is to extract a sufficiently rich subset from large corpora under certain meaningful constraints. In addition, taking into account the uniform coverage of the target subset and its internal property, we design a constrained subset selection algorithm. Specifically, a fast subset selection algorithm is designed by introducing n-grams models. Experiments are implemented based on very large real speech corpora database and validate the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score

Article Open access 15 March 2016

Unsupervised Language Model Adaptation by Data Selection for Speech Recognition

Semi-supervised minimum redundancy maximum relevance feature selection for audio classification

Article 30 December 2016

References

Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of the 39th annual meeting on association for computational linguistics—ACL’01. Toulouse, France, pp 26–33
Boleda G et al (2006) CUCWeb: a Catalan corpus built from the Web. In: Wac’06 processing of the 2nd international workshop on web as corpus. April. Trento, Italy, pp 19–26
Braunschweiler N, Buchholz S (2011) Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality. In: Proceedings of the annual conference of the international speech communication association, Interspeech. August. Florence, Italy, pp 1821–1824
Brown PF et al (1992) Class-based n-gram models of natural language. Comput Linguist 4(18):467–479
Google Scholar
Clarke CLA et al (2002) The impact of corpus size on question answering performance. In: Proceedings of the 25th annual international ACM SIGIR conference on research development on information retrieval, pp 369–370
Curran JR, Osborne M (2002) A very very large corpus doesn’t always yield reliable estimates. In: Proceedings of the 6th conference on natural language learning—COLING-02. Vol. 20. Stroudsburg, PA, USA, pp 1–6
Drouin P (2004) Detection of domain specific terminology using corpora comparison. In: Proceedings of the 4th international conference on language resources and evaluation. Lisbon, Portugal, pp 79–82
Fujishige S (2005) Submodular functions and optimization, vol 58. C. Elsevier, Amsterdam, pp 315–363
MATH Google Scholar
Glavas G, Ponzetto SP (2017) Dual tensor model for detecting asymmetric lexico-semantic relations. In: Proceedings of the 2017 conference on empirical methods in natural language processing. September. Copenhagen, Denmark, pp 1757–1767
Gómez-Adorno H et al (2018) Document embeddings learned on various types of n-grams for cross-topic authorship attribution. In: Computing September, pp 1–16
King S, Bartels C, Bilmes J (2005) SVitchboard 1: small vocabulary tasks from switchboard 1. In: Ninth European conference on speech communication and technology. Lisbon, Portugal, pp 2–5
Kumar VV, Satyanarayana N (2017) Probability of semantic similarity and N-grams pattern learning for data classification. In: Global journal of computer science and technology, pp 1–5
Lin H, Bilmes J (2011) Optimal selection of limited vocabulary speech corpora. In: Proceedings of the annual conference of the international speech communication association, interspeech, Florence, Italy, pp 1489–1492
Liu Y et al (2017) SVitchboard II and FiSVer I: high-quality limited-complexity corpora of conversational English speech. In: Proceedings of the annual conference of the international speech communication association, interspeech, vol 42, pp 122–142
Matthew S (2018) An extensible schema for building large weakly-labeled semantic corpora. Proced Comput Sci 128:65–71
Article Google Scholar
McDonald G, Macdonald C, Ounis I (1999) Finding parts in very large corpora, vol June, College Park, pp 57–64
Ogren PV et al (2006) Building and evaluating annotated corpora for medical NLP systems. In: AMIA annual symposium proceedings/AMIA symposium. AMIA symposium 36.2003, p 1050
Peris Álvaro, Chinea-Rios Mara, Casacuberta Francisco (2017) Neural networks classifier for data selection in statistical machine translation. Prague Bull Math Linguist 108(1):283–294
Article Google Scholar
Richey C (2007) https://web.stanford.edu/dept/linguistics/corpora/material/X_Speech_Corpora.pdf. Accessed 6 Feb 2007
Richmond K, Hoole P, King S (2011) Announcing the electromagnetic articulography (day 1) subset of the mngu0 articulatory corpus. In: Proceedings of the annual conference of the international speech communication association, interspeech. August. Florence, Italy, pp 1505–1508
Schwenk H, Gauvain J-L (2005) Training neural network language models on very large corpora. In: Proceedings of the conference on human language technology and empirical methods in natural language processing—HLT’05. Vancouver, B.C., Canada, pp 201–208
Walter L, Radauer A, Moehrle MG (2017) The beauty of brimstone butterfly: novelty of 290 patents identified by near environment analysis based on text mining. Scientometrics 111(1):103–115
Article Google Scholar

Download references

Acknowledgements

This work is partly supported by National Science Foundation under Grant 1747818, and the Fundamental Research Funds for Central Universities (JB161004).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xidian University, No. 2 South Taibai Road, Xi’an, 710071, Shanxi, China
Luobing Dong
School of Science, Beijing University of Chemical Technology, Beijing, China
Qiumin Guo
Department of Computer Science, University of Texas at Dallas, Dallas, TX, USA
Weili Wu

Authors

Luobing Dong
View author publications
You can also search for this author in PubMed Google Scholar
Qiumin Guo
View author publications
You can also search for this author in PubMed Google Scholar
Weili Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luobing Dong.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dong, L., Guo, Q. & Wu, W. Speech corpora subset selection based on time-continuous utterances features. J Comb Optim 37, 1237–1248 (2019). https://doi.org/10.1007/s10878-018-0350-2

Download citation

Published: 21 September 2018
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10878-018-0350-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech corpora subset selection based on time-continuous utterances features

Abstract

Access this article

Similar content being viewed by others

Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score

Unsupervised Language Model Adaptation by Data Selection for Speech Recognition

Semi-supervised minimum redundancy maximum relevance feature selection for audio classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech corpora subset selection based on time-continuous utterances features

Abstract

Access this article

Similar content being viewed by others

Semi-supervised feature selection for audio classification based on constraint compensated Laplacian score

Unsupervised Language Model Adaptation by Data Selection for Speech Recognition

Semi-supervised minimum redundancy maximum relevance feature selection for audio classification

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation