Skip to main content
Log in

A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Recovering sentence boundaries from speech and its transcripts is essential for readability and downstream speech and language processing tasks. In this paper, we propose to use deep recurrent neural network to detect sentence boundaries in broadcast news by modeling rich prosodic and lexical features extracted at each inter-word position. We introduce an unsupervised word embedding to represent word identity, learned from the Continuous Bag-of-Words (CBOW) model, into sentence boundary detection task as an effective feature. The word embedding contains syntactic information that is essential for this detection task. In addition, we propose another two low-dimensional word embeddings derived from a neural network that includes class and context information to represent words by supervised learning: one is extracted from the projection layer, the other one comes from the last hidden layer. Furthermore, we propose a deep bidirectional Long Short Term Memory (LSTM) based architecture with Viterbi decoding for sentence boundary detection. Under this framework, the long-range dependencies of prosodic and lexical information in temporal sequences are modeled effectively. Compared with previous state-of-the-art DNN-CRF method, the proposed LSTM approach reduces 24.8% and 9.8% relative NIST SU error in reference and recognition transcripts, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

Notes

  1. https://catalog.ldc.upenn.edu/LDC2005T24.

  2. http://www.darpa.mil/iao/EARS.htm.

  3. http://www.nist.gov/speech/tests/rt/.

  4. In word2vec tool, the energy function is simply defined as E(A, C) = −(AC), where A is the vector of a word, and C is the sum of context vectors of A. Then the probability \(p(A|C)=\frac {e^{-E(A,C)}}{{\sum }_{v=1}^{V}e^{-E(W_{v},C)}}\).

  5. http://mattmahoney.net/dc/text8.zip.

  6. https://catalog.ldc.upenn.edu/LDC2004T12.

  7. https://catalog.ldc.upenn.edu/LDC2005T24.

  8. https://code.google.com/p/word2vec/.

  9. The corresponding Wikipedia data set with sentence boundaries is used.

  10. LDC2005S16, LDC2004S08 for speech data and LDC2005T24, LDC2004T12 for reference transcriptions.

  11. http://www.itl.nist.gov/iad/mig/tests/rt/2003-fall/.

  12. See http://www.itl.nist.gov/iad/894.01/tests/rt/2004-fall/.

  13. Available at: http://www.cs.waikato.ac.nz/ml/weka/index.html.

  14. Available at: https://code.google.com/p/crfpp/.

  15. http://sourceforge.net/projects/currennt/.

  16. The initial dimension parameter of the tool is equal to each vector’s size. The perplexity parameter is 50.

References

  1. Yu, D., & Deng, L. (2014). Automatic speech recognition: a deep learning approach. New York: Springer.

    MATH  Google Scholar 

  2. Jones, D.A., Wolf, F., Gibson, E., Williams, E., Fedorenko, E., Reynolds, D.A., & Zissman, M.A. (2003). Measuring the readability of automatic speech-to-text transcripts. In INTERSPEECH.

  3. Kahn, J.G., Ostendorf, M., & Chelba, C. (2004). Parsing conversational speech using enhanced segmentation. In Proceedings of HLT-NAACL 2004: short papers (pp. 125–128). Association for Computational Linguistics.

  4. Favre, B., Grishman, R., Hillard, D., Ji, H., Hakkani-Tur, D., & Ostendorf, M. (2008). Punctuating speech for information extraction. In ICASSP IEEE international conference on acoustics, speech and signal processing, 2008 (pp. 5013–5016). IEEE.

  5. Mrozinski, J., Whittaker, E.W., Chatain, P., & Furui, S. (2006). Automatic sentence segmentation of speech for automatic summarization. In ICASSP 2006 proceedings ieee international conference on acoustics, speech and signal processing, 2006, (Vol. 1 pp. I–I). IEEE (p. 2006).

  6. Shriberg, E., Stolcke, A., Hakkani-Tür, D., & Tür, G. (2000). Prosody-based automatic segmentation of speech into sentences and topics. Speech Communication, 32(1), 127–154.

    Article  Google Scholar 

  7. Wang, X., Xie, L., Lu, M., CHNG, E.S., & Li, H. (2012). Broadcast news story segmentation using conditional random fields and multimodal features. IEICE TRANSACTIONS on Information and Systems, 95(5), 1206–1215.

    Article  Google Scholar 

  8. Xu, J., Zens, R., & Ney, H. (2005). Sentence segmentation using IBM word alignment model 1. In Proceedings of EAMT (pp. 280–287).

  9. Matusov, E., Hillard, D., Magimai-Doss, M., Hakkani-Tür, D.Z., Ostendorf, M., & Ney, H. (2007). Improving speech translation with automatic boundary prediction. In INTERSPEECH, (Vol. 7 pp. 2449–2452).

  10. Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.R., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., & et al. (2012). Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine, 29(6), 82–97.

    Article  Google Scholar 

  11. Graves, A., Mohamed, A.R., & Hinton, G. (2013). Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6645–6649). IEEE.

  12. Zheng, X., Chen, H., & Xu, T. (2013). Deep learning for chinese word segmentation and POS tagging. In EMNLP (pp. 647–657).

  13. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493–2537.

    MATH  Google Scholar 

  14. Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S., & Li, H. (2014). A deep neural network approach for sentence boundary detection in broadcast news. In Fifteenth annual conference of the international speech communication association.

  15. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781.

  16. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.

    Article  Google Scholar 

  17. Tseng, C., Pin, S., Lee, Y., Wang, H., & Chen, Y. (2005). Fluent speech prosody: Framework and modeling. Speech Communication, 46(3), 284–309.

    Article  Google Scholar 

  18. Mo, Y. (2008). Duration and intensity as perceptual cues for naïve listeners’ prominence and boundary perception. In Proceedings of the 4th speech prosody conference, Campinas, Brazil (pp. 739–742).

  19. Xie, L. (2008). Discovering salient prosodic cues and their interactions for automatic story segmentation in Mandarin broadcast news. Multimedia Systems, 14(4), 237–253.

    Article  Google Scholar 

  20. Mahrt, T., Cole, J., Fleck, M., & Hasegawa-Johnson, M. (2012). F0 and the perception of prominence. In Thirteenth annual conference of the international speech communication association.

  21. Liu, Y., Shriberg, E., Stolcke, A., Hillard, D., Ostendorf, M., & Harper, M. (2006). Enriching speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Transactions on Audio, Speech, and Language Processing, 14(5), 1526–1540.

    Article  Google Scholar 

  22. Xie, L., Xu, C., & Wang, X. (2012). Prosody-based sentence boundary detection in chinese broadcast news. In 2012 8th international symposium on chinese spoken language processing (ISCSLP) (pp. 261–265). IEEE.

  23. Haase, M., Kriechbaum, W., Möhler, G., & Stenzel, G. (2001). Deriving document structure from prosodic cues. In Seventh European conference on speech communication and technology.

  24. Gavalda, M., & Zechner, K. (1997). High performance segmentation of spontaneous speech using part of speech and trigger word information. In Proceedings of the fifth conference on applied natural language processing (pp. 12–15). Association for Computational Linguistics.

  25. Lu, W., & Ng, H.T. (2010). Better punctuation prediction with dynamic conditional random fields. In Proceedings of the 2010 conference on empirical methods in natural language processing (pp. 177–186). Association for Computational Linguistics.

  26. Ueffing, N., Bisani, M., & Vozila, P. (2013). Improved models for automatic punctuation prediction for spoken and written text. In INTERSPEECH (pp. 3097–3101).

  27. Xu, C., Xie, L., & Fu, Z. (2014). Sentence boundary detection in Chinese broadcast news using conditional random fields and prosodic features. In 2014 IEEE China summit and international conference on signal and information processing (ChinaSIP) (pp. 37—41). IEEE.

  28. Hirschberg, J., & Nakatani, C.H. (1996). A prosodic analysis of discourse segments in direction-giving monologues. In Proceedings of the 34th annual meeting on association for computational linguistics (pp. 286–293). Association for Computational Linguistics.

  29. Fung, J.G., Hakkani-Tür, D., Magimai-Doss, M., Shriberg, E., Cuendet, S., & Mirghafori, N. (2007). Cross-linguistic analysis of prosodic features for sentence segmentation. In Eighth annual conference of the international speech communication association.

  30. Zimmerman, M., Hakkani-Tür, D., Fung, J., Mirghafori, N., Gottlieb, L., Shriberg, E., & Liu, Y. (2006). The ICSI + multilingual sentence segmentation system. International Computer Science Inst Berkeley, CA.

  31. Kolá, J., & Liu, Y. (2010). Automatic sentence boundary detection in conversational speech: a cross-lingual evaluation on English and Czech. In 2010 IEEE international conference on acoustics speech and signal processing (ICASSP) (pp. 5258–5261). IEEE.

  32. Stolcke, A., & Shriberg, E. (1996). Automatic linguistic segmentation of conversational speech. In Proceedings of the fourth international conference on spoken language, 1996. ICSLP 96, (Vol. 2 pp. 1005–1008). IEEE.

  33. Stevenson, M., & Gaizauskas, R. (2000). Experiments on sentence boundary detection. In Proceedings of the sixth conference on applied natural language processing (pp. 84–89). Association for Computational Linguistics.

  34. Beeferman, D., Berger, A., & Lafferty, J. (1998). Cyberpunc: a lightweight punctuation annotation system for speech. In Proceedings of the 1998 IEEE international conference on acoustics, speech and signal processing, 1998 (Vol. 2 pp. 689–692). IEEE.

  35. Mori, S. (2002). An automatic sentence boundary detector based on a structured language model. In Seventh international conference on spoken language processing.

  36. Gravano, A., Jansche, M., & Bacchiani, M. (2009). Restoring punctuation capitalization in transcribed speech. In IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009 (pp. 4741–4744). IEEE.

  37. Batista, F., Moniz, H., Trancoso, I., & Mamede, N. (2012). Bilingual experiments on automatic recovery of capitalization and punctuation of automatic speech transcripts. IEEE Transactions on Audio, Speech, and Language Processing, 20(2), 474–485.

    Article  Google Scholar 

  38. Gotoh, Y., & Renals, S. (2000). Sentence boundary detection in broadcast speech transcripts.

  39. Christensen, H., Gotoh, Y., & Renals, S. (2001). Punctuation annotation using statistical prosody models. In ISCA tutorial and research workshop (ITRW) on prosody in speech recognition and understanding.

  40. Kim, J.-H., & Woodland, P.C. (2001). The use of prosody in a combined system for punctuation generation and speech recognition. In Seventh European conference on speech communication and technology.

  41. Graves, A. (2012). Supervised sequence labelling with recurrent neural networks Vol. 385. Heidelberg: Springer.

    Book  MATH  Google Scholar 

  42. Gers, F.A., Schraudolph, N.N., & Schmidhuber, J. (2002). Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research, 3(Aug), 115–143.

    MathSciNet  MATH  Google Scholar 

  43. Schuster, M., & Paliwal, K.K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

    Article  Google Scholar 

  44. Williams, R.J., & Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, Architectures, and Applications, 1, 433–486.

    Google Scholar 

  45. Huang, Z., Chen, L., & Harper, M. (2006). An open source prosodic feature extraction tool. In Proceedings of the language resources and evaluation conference (LREC).

  46. Gao, B., Bian, J., & Liu, T.-Y. (2014). Wordrep: a benchmark for research on learning word representations. arXiv:1407.1640.

  47. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.

    MATH  Google Scholar 

  48. Glorot, X., Bordes, A., & Bengio, Y. (2011). Domain adaptation for large-scale sentiment classification: a deep learning approach. In Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 513–520).

  49. Tur, G., Deng, L., Hakkani-Tür, D., & He, X. (2012). Towards deeper understanding: deep convex networks for semantic utterance classification. In 2012 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5045–5048). IEEE.

  50. Morin, F., & Bengio, Y. (2005). Hierarchical probabilistic neural network language model. In Aistats, (Vol. 5 pp. 246–252).

  51. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111–3119).

  52. Yu, D., Eversole, A., Seltzer, M., Yao, K., Huang, Z., Guenter, B., Kuchaiev, O., Zhang, Y., Seide, F., Wang, H., & et al. (2014). An introduction to computational networks and the computational network toolkit. Microsoft Technical Report MSR-TR-2014–112.

  53. Strassel, S. (2004). Simple metadata annotation specification. V6.2.

  54. Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov), 2579–2605.

    MATH  Google Scholar 

  55. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In EMNLP (pp. 388–395).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chenglin Xu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, C., Xie, L. & Xiao, X. A Bidirectional LSTM Approach with Word Embeddings for Sentence Boundary Detection. J Sign Process Syst 90, 1063–1075 (2018). https://doi.org/10.1007/s11265-017-1289-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-017-1289-8

Keywords

Navigation