Abstract
Research on linear text segmentation has been an on-going focus in NLP for the last decade, and it has great potential for a wide range of applications such as document summarization, information retrieval and text understanding. However, for linear text segmentation, there are two critical problems involving automatic boundary detection and automatic determination of the number of segments in a document. In this paper, we propose a new domain-independent statistical model for linear text segmentation. In our model, Multiple Discriminant Analysis (MDA) criterion function is used to achieve global optimization in finding the best segmentation by means of the largest word similarity within a segment and the smallest word similarity between segments. To alleviate the high computational complexity problem introduced by the model, genetic algorithms (GAs) are used. Comparative experimental results show that our method based on MDA criterion functions has achieved higher Pk measure (Beeferman) than that of the baseline system using TextTiling algorithm.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Salton, G., Singhal, A., Buckley, C., Mitra, M.: Automatic text decomposition using text segments and text themes. In: Proceedings of the seventh ACM conference on Hypertext, Bethesda, Maryland, United States, pp. 53–65 (1996)
Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32th Annual Meeting of the Association for Computational Linguistics, Las Cruces, New Mexico, pp. 9–16 (1994)
Hearst, M.A.: TextTiling: segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23(1), 33–64 (1997)
Youmans, G.: A new tool for discourse analysis: The vocabulary management profile. Language 67(4), 763–789 (1991)
Morris, J., Hirst, G.: Lexical cohesion computed by thesauri relations as an indicator of the structure of text. Computational Linguistics 17(1), 21–42 (1991)
Kozima, H.: Text segmentation based on similarity between words. In: Proceedings of the 31th Annual Meeting of the Association for Computational Linguistics, Student Session, pp. 286–288 (1993)
Reynar, J.C.: An automatic method of finding topic boundaries. In: Proceedings of the 32 nd Annual Meeting of the Association for Computational Linguistics, Student Session, Las Cruces, New Mexico, pp. 331–333 (1994)
Beeferman, D., Berger, A., Lafferty, J.: Text segmentation using exponential models. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pages, Providence, Rhode Island, pp. 35–46 (1997)
Passoneau, R., Litman, D.J.: Intention-based segmentation: Human reliability and correlation with linguistic cues. In: Proceedings of the 31st Meeting of the Association for Computational Linguistics, pp. 148–155 (1993)
Ponte, J.M., Croft, B.W.: Text segmentation by topic. In: Proceeding of the first European conference on research and advanced technology for digital libraries. U.Mass. Computer Science Technical Report TR97-18 (1997)
Reynar, J.C.: Statistical models for topic segmentation. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, pp. 357–364 (1999)
Hirschberg, J., Grosz, B.: Intentional features of local and global discourse. In: Proceedings of the Workshop on Spoken Language Systems, pp. 441–446 (1992)
Choi, F.Y.Y.: Advances in domain independent linear text segmentation. In: Proc. of NAACL-2000 (2000)
Choi, F.Y.Y., Wiemer-Hastings, P., Moore, J.: Latent semantic analysis for text segmentation. In: Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing, pp. 109–117 (2001)
Blei, D.M., Moreno, P.J.: Topic segmentation with an aspect hidden Markov model. Tech. Rep. CRL 2001-07, COMPAQ Cambridge Research Lab (2001)
Yaari, Y.: Segmentation of expository texts by hierarchical agglomerative clustering. In: Proceedings of the conference on recent advances in natural language processing, pp. 59–65 (1997)
Heinonen, O.: Optimal multi-paragraph text segmentation by dynamic programming. In: Proceedings of 17th international conference on computational linguistics, pp. 1484–1486 (1998)
Utiyama, M., Isahara, H.: A statistical model for domain-independent text segmentation. In: Proceedings of the 9th conference of the European chapter of the association for computational linguistics, pp. 491–498 (2001)
Kehagias, A., Fragkou, P., Petridis, V.: Linear Text Segmentation using a Dynamic Programming Algorithm. In: Proceedings of 10th Conference of European chapter of the association for computational linguistics (2003)
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. John Wiley & Sons, Chichester (2001)
Tol, J.T., Gonzaiez, R.C.: Pattern recognition principles. Addison-Wesley Publishing Company, Reading (1974)
Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)
Tianshun, Y., Jingbo, Z., li, Z., Ying, Y.: Natural language processing-research on making computers understand human languages. Tsinghua university press (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Jingbo, Z., Na, Y., Xinzhi, C., Wenliang, C., Tsou, B.K. (2005). Using Multiple Discriminant Analysis Approach for Linear Text Segmentation. In: Dale, R., Wong, KF., Su, J., Kwong, O.Y. (eds) Natural Language Processing – IJCNLP 2005. IJCNLP 2005. Lecture Notes in Computer Science(), vol 3651. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11562214_26
Download citation
DOI: https://doi.org/10.1007/11562214_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29172-5
Online ISBN: 978-3-540-31724-1
eBook Packages: Computer ScienceComputer Science (R0)