Abstract
With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
L.-F. Chien. PAT-tree-based keyword extraction for chinese information retrieval. In Proceedings of ACM SIGIR’97 Conference, 1997.
L.-F. Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval. Informatin Processing and Management, 35:501–521, 1999.
C. L. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing systm. In Proceedings of 1998 ACM Conference on Digital Library, 1998.
C. J. Godby and R. Reighart. Using machine-readable text as a source of novel vocabulary to update the dewey decimal classification. In Proceedings of the 1998 ASIS Classification Workshop, 1998.
G. H. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval Data Structures and Algorithms, pages 66–82. London: Prentice Hall International, 1992.
A. McCallum and K. Nigam et. al. A machine learning approach to building domain-specific search engines. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI’99), 1999.
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR’98, 1998.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chien, LF., Huang, CK., Chiao, HC., Lin, SJ. (2002). Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_50
Download citation
DOI: https://doi.org/10.1007/3-540-47887-6_50
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive