Abstract
Recovering sentence boundaries from speech and its transcripts is essential for readability and downstream speech and language processing tasks. In this paper, we propose to use deep recurrent neural network to detect sentence boundaries in broadcast news by modeling rich prosodic and lexical features extracted at each inter-word position. We introduce an unsupervised word embedding to represent word identity, learned from the Continuous Bag-of-Words (CBOW) model, into sentence boundary detection task as an effective feature. The word embedding contains syntactic information that is essential for this detection task. In addition, we propose another two low-dimensional word embeddings derived from a neural network that includes class and context information to represent words by supervised learning: one is extracted from the projection layer, the other one comes from the last hidden layer. Furthermore, we propose a deep bidirectional Long Short Term Memory (LSTM) based architecture with Viterbi decoding for sentence boundary detection. Under this framework, the long-range dependencies of prosodic and lexical information in temporal sequences are modeled effectively. Compared with previous state-of-the-art DNN-CRF method, the proposed LSTM approach reduces 24.8% and 9.8% relative NIST SU error in reference and recognition transcripts, respectively.
Similar content being viewed by others
Notes
In word2vec tool, the energy function is simply defined as E(A, C) = −(A ⋅ C), where A is the vector of a word, and C is the sum of context vectors of A. Then the probability \(p(A|C)=\frac {e^{-E(A,C)}}{{\sum }_{v=1}^{V}e^{-E(W_{v},C)}}\).
The corresponding Wikipedia data set with sentence boundaries is used.
LDC2005S16, LDC2004S08 for speech data and LDC2005T24, LDC2004T12 for reference transcriptions.
Available at: http://www.cs.waikato.ac.nz/ml/weka/index.html.
Available at: https://code.google.com/p/crfpp/.
The initial dimension parameter of the tool is equal to each vector’s size. The perplexity parameter is 50.
References
Yu, D., & Deng, L. (2014). Automatic speech recognition: a deep learning approach. New York: Springer.
Jones, D.A., Wolf, F., Gibson, E., Williams, E., Fedorenko, E., Reynolds, D.A., & Zissman, M.A. (2003). Measuring the readability of automatic speech-to-text transcripts. In INTERSPEECH.
Kahn, J.G., Ostendorf, M., & Chelba, C. (2004). Parsing conversational speech using enhanced segmentation. In Proceedings of HLT-NAACL 2004: short papers (pp. 125–128). Association for Computational Linguistics.
Favre, B., Grishman, R., Hillard, D., Ji, H., Hakkani-Tur, D., & Ostendorf, M. (2008). Punctuating speech for information extraction. In ICASSP IEEE international conference on acoustics, speech and signal processing, 2008 (pp. 5013–5016). IEEE.
Mrozinski, J., Whittaker, E.W., Chatain, P., & Furui, S. (2006). Automatic sentence segmentation of speech for automatic summarization. In ICASSP 2006 proceedings ieee international conference on acoustics, speech and signal processing, 2006, (Vol. 1 pp. I–I). IEEE (p. 2006).
Shriberg, E., Stolcke, A., Hakkani-Tür, D., & Tür, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32(1), 127–154.
Wang, X., Xie, L., Lu, M., CHNG, E.S., & Li, H. (2012). Broadcast news story segmentation using conditional random fields and multimodal features. IEICE TRANSACTIONS on Information and Systems, 95(5), 1206–1215.
Xu, J., Zens, R., & Ney, H. (2005). Sentence segmentation using IBM word alignment model 1. In Proceedings of EAMT (pp. 280–287).
Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani-Tür, D.Z., Ostendorf, M., & Ney, H. (2007). Improving speech translation with automatic boundary prediction. In INTERSPEECH, (Vol. 7 pp. 2449–2452).
Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., & et al. (2012). Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.
Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649). IEEE.
Zheng, X., Chen, H., & Xu, T. (2013). Deep learning for chinese word segmentation and POS tagging. In EMNLP (pp. 647–657).
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.
Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S., & Li, H. (2014). A deep neural network approach for sentence boundary detection in broadcast news. In Fifteenth annual conference of the international speech communication association.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Tseng, C., Pin, S., Lee, Y., Wang, H., & Chen, Y. (2005). Fluent speech prosody: Framework and modeling. Speech Communication, 46(3), 284–309.
Mo, Y. (2008). Duration and intensity as perceptual cues for naïve listeners’ prominence and boundary perception. In Proceedings of the 4th speech prosody conference, Campinas, Brazil (pp. 739–742).
Xie, L. (2008). Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news. Multimedia Systems, 14(4), 237–253.
Mahrt, T., Cole, J., Fleck, M., & Hasegawa-Johnson, M. (2012). F0 and the perception of prominence. In Thirteenth annual conference of the international speech communication association.
Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., & Harper, M. (2006). Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1526–1540.
Xie, L., Xu, C., & Wang, X. (2012). Prosody-based sentence boundary detection in chinese broadcast news. In 2012 8th international symposium on chinese spoken language processing (ISCSLP) (pp. 261–265). IEEE.
Haase, M., Kriechbaum, W., Möhler, G., & Stenzel, G. (2001). Deriving document structure from prosodic cues. In Seventh European conference on speech communication and technology.
Gavalda, M., & Zechner, K. (1997). High performance segmentation of spontaneous speech using part of speech and trigger word information. In Proceedings of the fifth conference on applied natural language processing (pp. 12–15). Association for Computational Linguistics.
Lu, W., & Ng, H.T. (2010). Better punctuation prediction with dynamic conditional random fields. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 177–186). Association for Computational Linguistics.
Ueffing, N., Bisani, M., & Vozila, P. (2013). Improved models for automatic punctuation prediction for spoken and written text. In INTERSPEECH (pp. 3097–3101).
Xu, C., Xie, L., & Fu, Z. (2014). Sentence boundary detection in Chinese broadcast news using conditional random fields and prosodic features. In 2014 IEEE China summit and international conference on signal and information processing (ChinaSIP) (pp. 37—41). IEEE.
Hirschberg, J., & Nakatani, C.H. (1996). A prosodic analysis of discourse segments in direction-giving monologues. In Proceedings of the 34th annual meeting on association for computational linguistics (pp. 286–293). Association for Computational Linguistics.
Fung, J.G., Hakkani-Tür, D., Magimai-Doss, M., Shriberg, E., Cuendet, S., & Mirghafori, N. (2007). Cross-linguistic analysis of prosodic features for sentence segmentation. In Eighth annual conference of the international speech communication association.
Zimmerman, M., Hakkani-Tür, D., Fung, J., Mirghafori, N., Gottlieb, L., Shriberg, E., & Liu, Y. (2006). The ICSI + multilingual sentence segmentation system. International Computer Science Inst Berkeley, CA.
Kolá, J., & Liu, Y. (2010). Automatic sentence boundary detection in conversational speech: a cross-lingual evaluation on English and Czech. In 2010 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 5258–5261). IEEE.
Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. In Proceedings of the fourth international conference on spoken language, 1996. ICSLP 96, (Vol. 2 pp. 1005–1008). IEEE.
Stevenson, M., & Gaizauskas, R. (2000). Experiments on sentence boundary detection. In Proceedings of the sixth conference on applied natural language processing (pp. 84–89). Association for Computational Linguistics.
Beeferman, D., Berger, A., & Lafferty, J. (1998). Cyberpunc: a lightweight punctuation annotation system for speech. In Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, 1998 (Vol. 2 pp. 689–692). IEEE.
Mori, S. (2002). An automatic sentence boundary detector based on a structured language model. In Seventh international conference on spoken language processing.
Gravano, A., Jansche, M., & Bacchiani, M. (2009). Restoring punctuation capitalization in transcribed speech. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009 (pp. 4741–4744). IEEE.
Batista, F., Moniz, H., Trancoso, I., & Mamede, N. (2012). Bilingual experiments on automatic recovery of capitalization and punctuation of automatic speech transcripts. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 474–485.
Gotoh, Y., & Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts.
Christensen, H., Gotoh, Y., & Renals, S. (2001). Punctuation annotation using statistical prosody models. In ISCA tutorial and research workshop (ITRW) on prosody in speech recognition and understanding.
Kim, J.-H., & Woodland, P.C. (2001). The use of prosody in a combined system for punctuation generation and speech recognition. In Seventh European conference on speech communication and technology.
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks Vol. 385. Heidelberg: Springer.
Gers, F.A., Schraudolph, N.N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3(Aug), 115–143.
Schuster, M., & Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Williams, R.J., & Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, Architectures, and Applications, 1, 433–486.
Huang, Z., Chen, L., & Harper, M. (2006). An open source prosodic feature extraction tool. In Proceedings of the language resources and evaluation conference (LREC).
Gao, B., Bian, J., & Liu, T.-Y. (2014). Wordrep: a benchmark for research on learning word representations. arXiv:1407.1640.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 513–520).
Tur, G., Deng, L., Hakkani-Tür, D., & He, X. (2012). Towards deeper understanding: deep convex networks for semantic utterance classification. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5045–5048). IEEE.
Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In Aistats, (Vol. 5 pp. 246–252).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).
Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B., Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., & et al. (2014). An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014–112.
Strassel, S. (2004). Simple metadata annotation specification. V6.2.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.
Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In EMNLP (pp. 388–395).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Xu, C., Xie, L. & Xiao, X. A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection. J Sign Process Syst 90, 1063–1075 (2018). https://doi.org/10.1007/s11265-017-1289-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-017-1289-8