Skip to main content

Improving Efficiency of Sentence Boundary Detection by Feature Selection

  • Conference paper
  • 1556 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9622))

Abstract

The goal of sentence boundary detection (SBD) is to predict the presence/absence of sentence boundary in an unstructured word sequence, where there is no punctuation presented. In this paper, we propose a feature selection approach to obtain more effective features used for the SBD classifier. Specifically, the observed words are considered its correlation with the sentence boundary based on the pointwise mutual information before being used as the feature of the classifier. By using the linear chain CRF model to predict sentence boundaries of a text sequence, the experimental results on a part of the English Gigaword \({2^{nd}}\) Edition corpus show that the proposed method helps to reduce the number of model parameters up to 44.87 % while maintaining a comparable F1-score to the original model.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    The formula is explained in the document of this CRF++ Toolkit: https://taku910.github.io/crfpp/.

  2. 2.

    https://catalog.ldc.upenn.edu/LDC2005T12.

References

  1. Christensen, H., Gotoh, Y., Renals, S.: Punctuation annotation using statistical prosody models. In: Proceedings of ISCA Workshop on Prosody in Speech Recognition and Understanding, Red Bank, NJ, USA (2001)

    Google Scholar 

  2. Shriberg, E., Stolcke, A., Hakkani-Tür, D., Tür, G.: Prosody-based automatic segmentation of speech into sentences and topics. Speech Commun. 32(1–2), 127–154 (2000)

    Article  Google Scholar 

  3. Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Comparing and combining generative and posterior probability models: Some advances in sentence boundary detection in speech. In: Proceedings of EMNLP, Barcelona, Spain (2004)

    Google Scholar 

  4. Huang, J., Zweig, G.: Maximum Entropy model for punctuation annotation from speech. In: Proceedings of INTERSPEECH, Denver, Colorado, USA (2002)

    Google Scholar 

  5. Liu, Y., Stolcke, A., Shriberg, E., Hillard, D., Ostendorf, M., Harper, M.: Enrich speech recognition with automatic detection of sentence boundaries and disfluencies. IEEE Trans. Audio Speech Lang. Process. 14(5), 1426–1540 (2006)

    Google Scholar 

  6. Kolář, J.: Automatic segmentation speech into sentence-like units. Ph.D. thesis, University of West Bohemia in Pilsen, Pilsen, Czech (2008)

    Google Scholar 

  7. Gavalda, M., Zechner, K., Aist, G.: High performance segmentation of spontaneous speech using part of speech and trigger word information. In: Proceedings of 5th Conference on Applied Natural Language Processing, Washington D.C., USA (1997)

    Google Scholar 

  8. Xu, C., Xie, L., Huang, G., Xiao, X., Chng, E.S., Li, H.: A deep neural network approach for sentence boundary detection in Broadcast News. In: Proceedings of INTERSPEECH 2014, Singapore (2014)

    Google Scholar 

  9. Huang, G.P., Xu, C., Xiao, X., Xie, L., Chng, E.S., Li, H.: Multi-view features in a DNN-CRF model for improved sentence unit detection on English Broadcast News. In: APSIPA 2014, Cambodia (2014)

    Google Scholar 

  10. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Proceedings of the Biennial GSCL Conference 2009 (2009)

    Google Scholar 

  11. Rosenfeld, R.: A maximum entropy approach to adaptive statistical language modelling. Comput. Speech Lang. 10(3), 187–228 (1996)

    Article  MathSciNet  Google Scholar 

  12. Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T.M., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) SPR 2002 and SSPR 2002. LNCS, vol. 2396, pp. 15–30. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  13. McCallum, A.: Efficiently inducing features of conditional random fields. In: Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence, pp. 403–410. Morgan Kaufmann Publishers Inc. (2002)

    Google Scholar 

  14. Jeong, M., Gary, G.L.: Practical use of non-local features for statistical spoken language understanding. Comput. Speech Lang. 22(2), 148–170 (2008)

    Article  Google Scholar 

  15. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labelling sequence data. In: Proceedings of ICML, pp. 282–28 (2001)

    Google Scholar 

  16. Sha, F., Pereira, F.: Shallow parsing with conditional random fields. In: Proceedings of HLT/NAACL (2003)

    Google Scholar 

Download references

Acknowledgements

This work is supported by DSO funded project MAISON DSOCL14045.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thi-Nga Ho .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ho, TN., Chong, T.Y., Do, V.H., Pham, V.T., Chng, E.S. (2016). Improving Efficiency of Sentence Boundary Detection by Feature Selection. In: Nguyen, N.T., Trawiński, B., Fujita, H., Hong, TP. (eds) Intelligent Information and Database Systems. ACIIDS 2016. Lecture Notes in Computer Science(), vol 9622. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49390-8_58

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-49390-8_58

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-49389-2

  • Online ISBN: 978-3-662-49390-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics