Abstract
The task of extractive summarization consists in producing a text summary by extracting a subset of text segments, such as sentences, and concatenating them to form a summary of the original text. The selection of sentences is based on terms they contain, which can be single words or multiword expressions. In a previous work, we have suggested so-called Maximal Frequent Sequences as such terms. In this paper, we investigate the effect of preprocessing on the process of selecting such sequences. Our results suggest that the accuracy of the method is, contrary to expectations, not seriously affected by preprocessing—which is both bad and good news, as we show.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Ledeneva, Y., Gelbukh, A., García-Hernández, R.: Terms Derived from Frequent Sequences for Extractive Text Summarization. In: Gelbukh, A. (ed.) CICLing 2008. LNCS, vol. 4919, pp. 593–604. Springer, Heidelberg (2008)
Ledeneva, Y., Gelbukh, A., García-Hernández, R.: Keeping Maximal Frequent Sequences Facilitates Extractive Summarization. In: Sidorov, G., et al. (eds.) Advances in Computer Science and Engineering, 9th Conference on Computing (CORE-2008), Research in Computing Science, vol. 34, pp. 163–174 (2008) ISSN: 1870-4069
Pomikálek, J., Rehurek, R.: The Influence of preprocessing parameters on text categorization. In: Proc. of World Academy of Science, Engineering and Technology, vol. 21, pp. 430–434 (2007)
Abu-Salem, H., Al-Omari, M., Evens, M.W.: Stemming methodologies over individual words for an Arabic Information Retrieval System. Journal of the American Society for Information Science 50, 524–529 (1999)
Larkey, L.S., Ballesteros, L., Connell, M.: Improving Stemming for Arabic Information Retrieval: Light Stemming and Co-occurrence Analysis. In: Proc. of ACM SIGID Conference in IR, pp. 275–282 (2002)
Halácsy, P., Trón, V.: Benefits of Resource-Based Stemming in Hungarian Information Retrieval. In: Peters, C., Clough, P., Gey, F.C., Karlgren, J., Magnini, B., Oard, D.W., de Rijke, M., Stempfhuber, M. (eds.) CLEF 2006. LNCS, vol. 4730, pp. 99–106. Springer, Heidelberg (2007)
Hamzah, M.P., Tengku Sembok, M.: On Retrieval Performance of Malay Textual Documents. In: Proc. of IASTED, pp. 156–161. ACTA Press (2006)
Frakes, W., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)
Villatoro-Tello, E., Villaseñor-Pineda, L., Montes-y-Gómez, M.: Using Word Sequences for Text Summarization. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 293–300. Springer, Heidelberg (2006)
Liu, D., et al.: Multi-Document Summarization Based on BE-Vector Clustering. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 470–479. Springer, Heidelberg (2006)
Bolshakov, I.A.: Getting One’s First Million...Collocations. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 229–242. Springer, Heidelberg (2004)
Sidorov, G., Gelbukh, A.: Automatic Detection of Semantically Primitive Words Using Their Reachability in an Explanatory Dictionary. In: IEEE International Workshop on Natural Language Processing and Knowledge Engineering, NLPKE 2001 at Proc. International IEEE SMC-2001 Conference: Systems, Man, And Cybernetics, USA, pp. 1683–1687 (2001) ISBN 0-7803-7087-2
Song, Y., et al.: A Term Weighting Method based on Lexical Chain for Automatic Summarization. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 636–639. Springer, Heidelberg (2004)
Mihalcea, R.: Random Walks on Text Structures. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 249–262. Springer, Heidelberg (2006)
Mihalcea, R., Tarau, P.: TextRank: Bringing Order into Texts. In: Proc. of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2004), Barcelona, Spain (2004)
Baeza-Yates, R.: Modern Information Retrieval. Addison Wesley/Longman Publishing Co. (1999)
Frakes, W., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)
Sparck Jones, K., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A Fast Algorithm to Find All the Maximal Frequent Sequences in a Text. In: Sanfeliu, A., Martínez Trinidad, J.F., Carrasco Ochoa, J.A. (eds.) CIARP 2004. LNCS, vol. 3287, pp. 478–486. Springer, Heidelberg (2004)
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A New Algorithm for Fast Discovery of Maximal Sequential Patterns in a Document Collection. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 514–523. Springer, Heidelberg (2006)
DUC. Document understanding conference 2002 (2002), www-nlpir.nist.gov/projects/duc
Lin, C.Y.: ROUGE: A Package for Automatic Evaluation of Summaries. In: Proc. of Workshop on Text Summarization of ACL, Spain (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ledeneva, Y. (2008). Effect of Preprocessing on Extractive Summarization with Maximal Frequent Sequences. In: Gelbukh, A., Morales, E.F. (eds) MICAI 2008: Advances in Artificial Intelligence. MICAI 2008. Lecture Notes in Computer Science(), vol 5317. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88636-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-88636-5_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88635-8
Online ISBN: 978-3-540-88636-5
eBook Packages: Computer ScienceComputer Science (R0)