Skip to main content

Effect of Preprocessing on Extractive Summarization with Maximal Frequent Sequences

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5317))

Abstract

The task of extractive summarization consists in producing a text summary by extracting a subset of text segments, such as sentences, and concatenating them to form a summary of the original text. The selection of sentences is based on terms they contain, which can be single words or multiword expressions. In a previous work, we have suggested so-called Maximal Frequent Sequences as such terms. In this paper, we investigate the effect of preprocessing on the process of selecting such sequences. Our results suggest that the accuracy of the method is, contrary to expectations, not seriously affected by preprocessing—which is both bad and good news, as we show.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ledeneva, Y., Gelbukh, A., García-Hernández, R.: Terms Derived from Frequent Sequences for Extractive Text Summarization. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 593–604. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  2. Ledeneva, Y., Gelbukh, A., García-Hernández, R.: Keeping Maximal Frequent Sequences Facilitates Extractive Summarization. In: Sidorov, G., et al. (eds.) Advances in Computer Science and Engineering, 9th Conference on Computing (CORE-2008), Research in Computing Science, vol. 34, pp. 163–174 (2008) ISSN: 1870-4069

    Google Scholar 

  3. Pomikálek, J., Rehurek, R.: The Influence of preprocessing parameters on text categorization. In: Proc. of World Academy of Science, Engineering and Technology, vol. 21, pp. 430–434 (2007)

    Google Scholar 

  4. Abu-Salem, H., Al-Omari, M., Evens, M.W.: Stemming methodologies over individual words for an Arabic Information Retrieval System. Journal of the American Society for Information Science 50, 524–529 (1999)

    Article  Google Scholar 

  5. Larkey, L.S., Ballesteros, L., Connell, M.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proc. of ACM SIGID Conference in IR, pp. 275–282 (2002)

    Google Scholar 

  6. Halácsy, P., Trón, V.: Benefits of Resource-Based Stemming in Hungarian Information Retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 99–106. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  7. Hamzah, M.P., Tengku Sembok, M.: On Retrieval Performance of Malay Textual Documents. In: Proc. of IASTED, pp. 156–161. ACTA Press (2006)

    Google Scholar 

  8. Frakes, W., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  9. Villatoro-Tello, E., Villaseñor-Pineda, L., Montes-y-Gómez, M.: Using Word Sequences for Text Summarization. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 293–300. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  10. Liu, D., et al.: Multi-Document Summarization Based on BE-Vector Clustering. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 470–479. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Bolshakov, I.A.: Getting One’s First Million...Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 229–242. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  12. Sidorov, G., Gelbukh, A.: Automatic Detection of Semantically Primitive Words Using Their Reachability in an Explanatory Dictionary. In: IEEE International Workshop on Natural Language Processing and Knowledge Engineering, NLPKE 2001 at Proc. International IEEE SMC-2001 Conference: Systems, Man, And Cybernetics, USA, pp. 1683–1687 (2001) ISBN 0-7803-7087-2

    Google Scholar 

  13. Song, Y., et al.: A Term Weighting Method based on Lexical Chain for Automatic Summarization. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 636–639. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  14. Mihalcea, R.: Random Walks on Text Structures. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 249–262. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain (2004)

    Google Scholar 

  16. Baeza-Yates, R.: Modern Information Retrieval. Addison Wesley/Longman Publishing Co. (1999)

    Google Scholar 

  17. Frakes, W., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  18. Sparck Jones, K., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)

    Google Scholar 

  19. García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences in a Text. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  20. García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 514–523. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. DUC. Document understanding conference 2002 (2002), www-nlpir.nist.gov/projects/duc

  22. Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Proc. of Workshop on Text Summarization of ACL, Spain (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ledeneva, Y. (2008). Effect of Preprocessing on Extractive Summarization with Maximal Frequent Sequences. In: Gelbukh, A., Morales, E.F. (eds) MICAI 2008: Advances in Artificial Intelligence. MICAI 2008. Lecture Notes in Computer Science(), vol 5317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88636-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-88636-5_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-88635-8

  • Online ISBN: 978-3-540-88636-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics