Abstract
This paper describes several text independent speech segmentation methods. State-of-the-art applications and the prospected use of automatic speech segmentation techniques are presented, including the direct applicability of automatic segmentation in recognition, coding and speech corpora annotation, which is a central issue in today’s speech technology. Moreover, a novel parametric segmentation algorithm will be presented and performance will be evaluated by comparing its effectiveness against other text independent speech segmentation methods proposed in literature.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Altosaar, T., Karjalainen, M.: Event-Based Multiple Resolution Analysis of Speech Signals. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, New-York, pp. 327–330 (1988)
Andre-Obrecht, R.: A New Statistical Approach for the Automatic Segmentation of Continuous Speech Signals. IEEE Transactions on Acoustics, Speech Signal Processing 36, 29–40 (1988)
Atal, B.S.: Efficient Coding of LPC Parameters by Temporal Decomposition. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pp. 81–84 (1983)
Aversano, G.: Phone Level Automatic Speech Segmentation. A Text-Independent Segmentation Algorithm and a Software Tool for Speech Annotation and Analysis. Ph.D. Thesis, Università di Salerno. Italy (2004)
Aversano, G., Esposito, A.: Automatic Parameter Estimation for a Context-Independent Speech Segmentation Algorithm. In: Sojka, P., Kopecek, I., Pala, K. (eds.) Text Speech and Dialogue, 5th International Conference. LNCS (LNAI), pp. 293–300. Springer, Heidelberg (2002)
Aversano, G., Esposito, A., Esposito, A., Marinaro, M.: A New Text-Independent Method for Phoneme Segmentation. In: Ewing, R.L., et al. (eds.) Proceedings of the IEEE International Workshop on Circuits and Systems, vol. 2, pp. 516–519 (2001)
Backfried, G., Rainoldi, R., Riedler, J.: Automatic Language Identification in Broadcast News. In: Proceedings of International Joint Conference on Neural Networks, vol. 2, pp. 1406–1410 (2002)
Basseville, M., Nikiforov, I.V.: Detection of Abrupt Changes: Theory and Applications. Prentice-Hall, Englewood Cliffs (1993)
Basseville, M.: Distance Measures for Signal Processing and Pattern Recognition. Signal Processing 18, 349–369 (1989)
Baudoin, G., Capman, F., Cernocky, J., El Chami, F., Charbit, M., Chollet, G., Petrovska- Delacretaz, D.: Advances in Very Low Bit-rate Speech Coding using Recognition and Synthesis. In: Sojka, P., Kopecek, I., Pala, K. (eds.) Text Speech and Dialogue, 5th International Conference. LNCS (LNAI), pp. 269–276. Springer, Heidelberg (2002)
Beringer, N., Neff, M.: Regional Pronunciation Variants for Automatic Segmentation. In: Proceedings of the 2nd International Conference on Language Resources and Evaluation, Athens, Greece (2000)
Beringer, N., Schiel, F.: The Quality of Multilingual Automatic Segmentation Using German MAUS. In: Proceedings of the 6th Int. Conference on Spoken Language Processing, Beijing, China, pp. 728–731 (2000)
Beringer, N., Schiel, F.: Independent Automatic Segmentation of Speech by Pronunciation Modeling. In: Proceedings of the 14th Int. Congress of Phonetic Sciences, San Francisco, pp. 1653–1656 (1999)
Beulen, K., Ney, H.: Automatic Question Generation for Decision Tree Based State Tying. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 805–808 (1998)
Beulen, K., Bransch, E., Ney, H.: State Tying for Context Dependent Phoneme Models. In: Proceedings of European Conference on Speech Communication and Technology, pp. 1179–1182 (1997)
Binnenpoorte, D., Goddijn, S., Cucchiarini, C.: How to Improve Human and Machine Transcriptions of Spontaneous Speech. In: ISCA/IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, pp. 147–150 (2003)
Bishop, C.M.: Neural Networks for Pattern Recognition. Clarendon Press, Oxford (1995)
Brugnara, F., Falavigna, D., Omologo, M.: Automatic Segmentation and Labeling of Speech Based on Hidden Markov Models. Speech Communication 12, 357–370 (1993)
Brugnara, F., De Mori, A., Giuliani, D., Omologo, M.: Improved Connected Digit Recognition Using Spectral Variation Functions. In: Proceedings of International Conference on Spoken Language Processing, pp. 627–630 (1992)
Chang, S., Shastri, L., Greenberg, S.: Automatic Phonetic Transcription of Spontaneous Speech (American English). In: Proceedings of the 6th International Conference on Spoken Language Processing, Beijing, China, pp. 330–333 (2000)
Church, K.W.: Speech and Language Processing: Where Have We Been and Where Are We Going? In: Proceedings of the 8th European Conference on Speech Communication and Technology - Eurospeech 2003, Geneva, Switzerland, pp. 1–4 (2003)
Daoudi, K., Fohr, D., Antoine, C.: Continuous Multi-Band Speech Recognition using Bayesian Networks. In: Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop, Trento, Italy (2001)
Deshayes, J., Picard, D.: Off-line Statistical Analysis in Change-point Models Using Nonparametric and Likelihood Methods. In: Basseville, M., Beneviste, A. (eds.) Detection of Abrupt Changes in Signals and Dynamical Systems, Springer, New-York (1986)
du Jeu, C., Charbit, M., Chollet, G.: Very-low-rate Speech Compression by Indexation of Polyphones. In: Proceedings of the 8th European Conference on Speech Communication and Technology - Eurospeech 2003, Geneva, Switzerland, pp. 1085–1088 (2003)
Eisen, B., Tillman, H.G.: Consistency of Judgments in Manual Labeling of Phonetic Segments: The Distinction between Clear and Nnclear Cases. In: Proceedings of ICSLP, Banf., Canada, pp. 871–874 (1992)
Eisen, B.: Reliability of Speech Segmentation and Labeling at Different Levels of Transcription. In: Proceedings of the 3rd European Conference on Speech Communication and Technology, Eurospeech 1993, Berlin, Germany, pp. 673–676 (1991)
Esposito, A.: The Importance of Data for Training Intelligent Devices. In: Apolloni, B., Kurfess, F. (eds.) From Synapses to Rules: Discovering Symbolic Rules from Neural Processed Data, pp. 229–250. Kluwer Academic/Plenum Publishers (2002)
Esposito, A., Pannacci, L., Perfetti, R., Russo, R.C.: Speech Segmentation by Parametric Filtering: Two New Distortion Measures and Experimental Evaluation, Technical Report n. IIASS-1 2000, International Institute for Advanced Scientific Studies, Vietri sul Mare (SA), Italy (2000)
Fairbanks, G., Everitt, W., Jaeger, R.: Method for Time or Frequency Compression Expansion of Speech. IEEE Transactions on Audio and Electro-acoustics AU-2, 7–12 (1954)
Faundez-Zanuy, M., Vallverdù-Bayes, F.: Speech Segmentation Using Multilevel Hybrid Filters. In: Proceedings of European Signal Processing Conference (EUSIPCO), pp. 1003–1006 (1996)
Finster, H.: Automatic speech segmentation using neural network and phonetic transcription. In: Proceedings of International Conference on Neural Networks, vol. 4, pp. 734–736 (1992)
Flammia, G., Dalsgaard, P., Andersen, O., Lindberg, B.: Segment Based Variable Frame Rate Speech Analysis and Recognition Using Spectral Variation Function. In: Proceedings of International Conference on Spoken Language Processing, pp. 983–986 (1992)
Furuichi, C., Aizawa, K., Inoue, K.: Speech Recognition Using Stochastic Phonemic Segment Model Based on Phoneme Segmentation. Systems and Computers in Japan 31(10), 1111–1119 (2000)
Garofolo, J. S., Lamel, L. F., Fisher, W. M., Fiscus, J. G., Pallett, D. S., Dahlgren, N. L.: The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus. CDROM (1992); NTIS order number PB91-100354
Gemello, R., Albesano, D., Mana, F.: CSELT Hybrid HMM/Neural Networks Technology for Continuous Speech Recognition. In: Proceedings of IEEE-INNS-ENNS International Joint Conference on Neural Networks, vol. 5, pp. 103–108 (2000)
Glass, J.R.: A Probabilistic Framework for Segment -Based Speech Recognition. Computer Speech and Language 17, 137–152 (2003)
Glass, J.R., Zue, V.W.: Multilevel Acoustic Segmentation of Continuous Speech. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pp. 429–432 (1988)
Gómez, J.A., Castro, M.J.: Automatic Segmentation of Speech at the Phonetic Level. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 672–680. Springer, Heidelberg (2002)
Gray, R.M., Buzo, A., Gray, A., Matsuyama, Y.: Distortion Measures for Speech Processing. IEEE Transactions on Acoustics, Speech Signal Processing 28, 367–376 (1980)
Green, D., Swets, J.: Signal Detection Theory and Psychophysics. John Wiley and Sons, Chichester (1996)
Greenberg, S.: Strategies for Automatic Multi-Tier Annotation of Spoken Language Corpora. In: Proceedings of the 8th European Conference on Speech Communication and Technology - Eurospeech 2003, Geneva, Switzerland, pp. 45–48 (2003)
Greenberg, S.: The Switchboard Transcription Project. Technical Report # 24, Center for Language and Speech Processing, Johns Hopkins University, Baltimore USA (1997)
Hermansky, H.: Analysis in Automatic Recognition of Speech. In: Chollet, G., Di Benedetto, M., Esposito, A., Marinaro, M. (eds.) Speech Processing, Recognition and Artificial Neural Networks, 3rd International School on Neural Nets ”Eduardo R. Caianiello”, pp. 115–137. Springer, Heidelberg (1999)
Hermansky, H.: Auditory Modeling in Automatic Recognition of Speech. In: Proceedings of the ESCA Workshop on the Auditory Basis of Speech Perception, Keele, Sweden (1996)
Hermansky, H., Morgan, N.: RASTA Processing of Speech. IEEE Transactions. Speech and Audio Processing 2(4), 578–589 (1994)
Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. Journal of Acoustical Society of America 87(4), 1738–1752 (1990)
Horak, P.: Automatic Speech Segmentation Based on DTW with the Application of the Czech TTS System. In: Keller, E., Bailly, G., Monaghan, A., Terken, J., Huckwale, M. (eds.) Improvements in Speech Synthesis, pp. 331–340. John Wiley and Sons Ltd., Chichester (2001)
Jankowski, C., Kalyanswamy, A., Basson, S., Spitz, J.: NTIMIT: A Phonetically Balanced, Continuous Speech, Telephone Bandwidth Speech Database. In: Proceedings of ICASSP, pp. 109–112 (1990)
Jeong, C.G., Jeong, H.: Automatic Phone Segmentation and Labeling of Continuous Speech. Speech Communication 20, 291–311 (1997)
Kanthak, S., Ney, H.: Multilingual Acoustic Modeling Using Graphemes. In: Proceedings of European Conference on Speech Communication and Technology, vol. 2, pp. 1145–1148 (2003)
Kolokolov, A.S.: Preprocessing and Segmentation of the Speech Signal in the Frequency Domain for Speech Recognition. Automation and Remote Control 64(6), 985–994 (2003)
Laroche, J.: Time and Pitch Scale Modification of Audio Signals. In: Kahrs, M., Brandenburg, K. (eds.) Applications of Digital Signal Processing to Audio and Acoustics, Kluwer Academic Publishers, Dordrecht (1998)
Laureys, T., Demuynck, K., Duchateau, J., Wambacq, P.: An Improved Algorithm for the Automatic Segmentation of Speech Corpora. In: González Rodriguez, M., Paz Suárez Araujo, C. (eds.) Proceedings of Third International Conference on Language Resources and Evaluation, pp. 1564–1567 (2002)
Lavielle, M.: Detection of Changes in the Spectrum of Multidimensional Process. IEEE Transactions on Signal Processing 41, 742–749 (1993)
Le Cerf, P., Demuynck, K., Duchateau, J., Van Compernolle, D.: Pseudo-Segment Based Speech Recognition Using Neural Recurrent Whole-Word Recognizers. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 609–612 (1994)
Li, B.N.L., Liu, J.N.K.: A Comparative Study of Speech Segmentation for Automatic Multi-Lingual Recognition. In: Proceedings of Second ACM Hong Kong Postgraduate Research Conference (1999), http://www.cse.cuhk.edu.hk/~acm-hk/activity/pg/polyu-nlli.pdf
Li, T.H., Gibson, J.D.: Speech Analysis and Segmentation by Parametric Filtering. IEEE Transactions on Speech and Audio Processing, Vol 4(3), 203–213 (1996)
Li, T.H., Gibson, J.D.: Time-Correlation Analysis of Non-stationary Signals with Application to Speech Processing. In: Proceedings of International Symposium on Time- Frequency &Time-Scale Analysis, Paris, France, pp. 449–452 (1996)
Lin, M.-T., Lee, C.-K., Lin, C.-Y.: Consonant/Vowel Segmentation for Mandarin Syllable Recognition. Computer Speech and Language 23, 207–222 (1999)
Malfrère, F., Deroo, O., Dutoit, T., Ris, C.: Phonetic Alignment: Speech Synthesis-Based vs. Viterbi-Based. Speech Communication 40(4), 503–515 (2003)
Makhoul, J.: Spectral Linear Prediction: Properties and Applications. IEEE Transactions ASSP 23(5), 283–296 (1975)
Mirghafori, N.: A Multi-Band Approach to Automatic Speech Recognition. Ph.D. thesis, University of California, Berkeley (December 1988), ch. 4. Reprinted as ICSI Technical Report, TR-99-04, Berkeley, CA (1999)
Mitchell, C.D., Harper, M.P., Jamieson, L.H.: Using Explicit Segmentation to Improve HMM Phone Recognition. In: Proceedings of International Conference on Acoustic, Speech and Signal Processing, pp. 229–232 (I995)
Park, E.-Y., Kim, S.-H., Chung, J.-H.: Automatic Speech Synthesis Unit Generation with MLP Based Postprocessor against Auto-segmented Phoneme Errors. In: Proceedings of International Joint Conference on Neural Networks, vol. 5, pp. 2985–2990 (1999)
Parson, T.: Voice and Speech Processing. McGraw-Hill, New York (1986)
Parzen, E.: Time Series, Statistics and Information. In: Brillinger, D., Caines, P., Geweke, J., Parzen, E., Rosenblatt, M., Taqqu, M.S. (eds.) New Directions in Time Series Analysis, Part I. The IMA Volumes in Mathematics and its Applications. Series, vol. 45, Springer, New York (1992)
Pellom, B.L., Hansen, J.H.L.: Automatic Segmentation of Speech Recorded in UnknownNoisy Channel Characteristics. Speech Communication 25, 97–116 (1998)
Petek, B., Andersen, O., Dalsgaard, P.: On the Robust Automatic Segmentation of Spontaneous Speech. In: Proceedings of International Conference on Spoken Language Processing, pp. 913–916 (1996)
Peterson, W., Birdsall, T., Fox, W.: The Theory of Signal Detectability. IEEE Transactions on Information Theory 4(4), 171–212 (1954)
Picone, J.: Continuous Speech Recognition Using Hidden Markov Models. IEEE ASSP Magazine, 26–41 (1990)
Prasad, V.K., Nagarajan, T., Mutrhy, H.A.: Automatic Segmentation of Continuous Speech Using Phase Group Delay Functions. Speech Communication 42, 429–446 (2004)
Quackenbush, S.R., Barnwell, T.P., Clements, M.A.: Objective Measures of Speech Quality. Prentice Hall, Englewood Cliffs (1988)
Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Inc., Upper Saddle River (1993)
Rabiner, L.R., Juang, B.H.: An Introduction to Hidden Markov Models. IEEE ASSP Magazine, 4–16 (1986)
Raymond, W.D., et al.: An Analysis of Transcription Consistency in Spontaneous Speech from the Buckeye Corpus. In: Proceedings of ICSLP 2002, Denver, USA (2002)
Schiel, F.: Automatic Phonetic Transcription of Non-Prompted Speech. In: Proceedings of the 14th International Congress on Phonetic Sciences, San Francisco, pp. 607–610 (1999)
Schillo, C., Fink, G.A., Kummert, F.: Grapheme Based Recognition for Large Vocabularies. In: Procceedings of International Conference on Spoken Processing, pp. 129–132 (2000)
Sharma, M., Mammone, R.: Automatic Speech Segmentation Using Neural Tree Networks. In: Proceedings of IEEE Workshop on Neural Networks for Signal Processing, pp. 282–290 (1995)
Segura-Luna, J.C., Soler, J.M., Peinado, A.M., Sanchez, V., Rubio, A.: Signal Segmentation into Spectral Homogeneous Units. In: Proceedings of European Signal Processing Conference, pp. 1251–1254 (1990)
Silverman, H.F., Morgan, D.P.: The Application of Dynamic Programming to Connected Speech Recognition. IEEE ASSP Magazine, 6–25 (1990)
Stephens, S.S., Volkman, J.: The Relation of Pitch to Frequency. American Journal of Psychology 53(3), 329–353 (1940)
Suontasuta, J., Hakkinen, J.: Decision Tree Based Text-to-Mapping for Speech Recognition. In: Procceedings of International Conference on Spoken Processing, pp. 199–202 (2000)
Svendsen, T., Soong, F.K.: On Automatic Segmentation of Speech Signals. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, Dallas, pp. 77–80 (1987)
Torkolla, K.: An Efficient Way to Learn English Grapheme-to-Phoneme Rules Automatically. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing, vol. 2, pp. 199–202 (1993)
van Hemert, J.P.: Automatic Segmentation of Speech. IEEE Transactions on Signal Processing 39(4), 1008–1012 (1991)
Vidal, E., Marzal, A.: A Review and New Approaches for Automatic Segmentation of Continuous Speech Signals. In: Torress, L., et al. (eds.) Signal Processing V: Theories and Applications, pp. 43–53. Elsevier Publisher, New-York (1990)
Vorstermans, A., Martens, J.P., Van Coile, B.: Automatic Segmentation and Labeling of Multi-lingual Speech Data. Speech Communication 19(4), 271–293 (1996)
Wei, B., Gibson, J.D.: A New Discrete Spectral Modeling Method and an Application to CELP Coding. IEEE Signals Processing Letters 10(4), 101–103 (2003)
Wei, B., Gibson, J.D.: Comparison of Distance Measure in Discrete Spectral Modeling. In: Proceedings of IEEE Digital Signal Processing Workshop, pp. 1–4 (2000)
Wendt, C., Petropulu, A.P.: Pitch Determination and Speech Segmentation Using the Discrete Wavelet Transform. In: Proccedings of IEEE International Symposium on Circuits and Systems, vol. 2, pp. 45–48 (1996)
Wesenick, M.B., Kipp, A.: Estimating the Quality of Phonetic Transcriptions and Segmentations of Speech Signals. In: Proceedings of ICSLP 1996, Philadelphia, USA, pp. 129–132 (1996)
Wester, M., Kessens, J.M., Cucchiarini, C., Strik, H.: Comparison between Expert Listeners and Continuous Speech Recognizers in Selecting Pronunciation Variants. In: Proceedings of the 14th Int. Congress of Phonetic Sciences, San Francisco, pp. 723–726 (1999)
Wokurek, W.: Corpus Based Evaluation of Entropy Rate Speech Segmentation. In: Proceedings of 14th International Congress of Phonetic Sciences, pp. 1217–1220 (1999)
Young, S.J., Woodland, P.C.: State Clustering in Hidden Markov Model-Based Continuous Speech Recognition. Computer Speech and Language 8, 369–383 (1994)
Zue, V.W., Glass, J.R., Philips, M., Seneff, S.: Acoustic Segmentation and Phonetic Classification in the Summit System. In: Proceedings of International Conference on Acoustics, Speech, and Signal Processing, pp. 389–392 (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Esposito, A., Aversano, G. (2005). Text Independent Methods for Speech Segmentation. In: Chollet, G., Esposito, A., Faundez-Zanuy, M., Marinaro, M. (eds) Nonlinear Speech Modeling and Applications. NN 2004. Lecture Notes in Computer Science(), vol 3445. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11520153_12
Download citation
DOI: https://doi.org/10.1007/11520153_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27441-4
Online ISBN: 978-3-540-31886-6
eBook Packages: Computer ScienceComputer Science (R0)