Skip to main content

An Iterative Approach to Text Segmentation

  • Conference paper
Advances in Information Retrieval (ECIR 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6611))

Included in the following conference series:

Abstract

We present divSeg, a novel method for text segmentation that iteratively splits a portion of text at its weakest point in terms of the connectivity strength between two adjacent parts. To search for the weakest point, we apply two different measures: one is based on language modeling of text segmentation and the other, on the interconnectivity between two segments. Our solution produces a deep and narrow binary tree – a dynamic object that describes the structure of a text and that is fully adaptable to a user’s segmentation needs. We treat it as a separate task to flatten the tree into a broad and shallow hierarchy either through supervised learning of a document set or explicit input of how a text should be segmented. The rich structure of our created tree further allows us to segment documents at varying levels such as topic, sub-topic, etc. We evaluated our new solution on a set of 265 articles from Discover magazine where the topic structures are unknown and need to be discovered. Our experimental results show that the iterative approach has the potential to generate better segmentation results than several leading baselines, and the separate flattening step allows us to adapt the results to different levels of details and user preferences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Beeferman, D., Berger, A., Lafferty, J.: Statistical models for text segmentation. Mach. Learn. 34(1-3), 177–210 (1999)

    Article  MATH  Google Scholar 

  2. Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference, pp. 26–33. Morgan Kaufmann Publishers, San Francisco (2000)

    Google Scholar 

  3. Cieri, C., Graff, D., Liberman, M., Martey, N., Strassel, S.: Large, multilingual, broadcast news corpora for cooperative research in topic detection and tracking: The tdt-2 and tdt-3 corpus efforts. In: Proceedings of Language Resources and Evaluation Conference (2000)

    Google Scholar 

  4. Eisenstein, J.: Hierarchical text segmentation from multi-scale lexical cohesion. In: NAACL 2009: Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 353–361. Association for Computational Linguistics, Morristown (2009)

    Google Scholar 

  5. Eisenstein, J., Barzilay, R.: Bayesian unsupervised topic segmentation. In: EMNLP 2008: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 334–343. Association for Computational Linguistics, Morristown (2008)

    Chapter  Google Scholar 

  6. Halliday, M.A.K., Hasan, R.: Cohesion in English (English Language). Longman Pub. Group, Harlow (1976)

    Google Scholar 

  7. Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, pp. 9–16. Association for Computational Linguistics, Morristown (1994)

    Chapter  Google Scholar 

  8. Pevzner, L., Hearst, M.A.: A critique and improvement of an evaluation metric for text segmentation. Comput. Linguist. 28(1), 19–36 (2002)

    Article  Google Scholar 

  9. Reynar, J.C.: Topic Segmentation: Algorithms and Applications. PhD thesis, University of Pennsylvania (1998)

    Google Scholar 

  10. Skorochod’ko, E.F.: Adaptive method of automatic abstracting and indexing. In: Proceedings of the IFIP, vol. 71, pp. 1179–1182 (1972)

    Google Scholar 

  11. Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: ACL 2001: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 499–506. Association for Computational Linguistics, Morristown (2001)

    Google Scholar 

  12. Ye, N., Zhu, J., Zheng, Y., Ma, M.Y., Wang, H., Zhang, B.: A dynamic programming model for text segmentation based on min-max similarity. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 141–152. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Song, F., Darling, W.M., Duric, A., Kroon, F.W. (2011). An Iterative Approach to Text Segmentation. In: Clough, P., et al. Advances in Information Retrieval. ECIR 2011. Lecture Notes in Computer Science, vol 6611. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20161-5_63

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20161-5_63

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20160-8

  • Online ISBN: 978-3-642-20161-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics