Skip to main content

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2002)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:


With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. L.-F. Chien. PAT-tree-based keyword extraction for chinese information retrieval. In Proceedings of ACM SIGIR’97 Conference, 1997.

    Google Scholar 

  2. L.-F. Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval. Informatin Processing and Management, 35:501–521, 1999.

    Article  Google Scholar 

  3. C. L. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing systm. In Proceedings of 1998 ACM Conference on Digital Library, 1998.

    Google Scholar 

  4. C. J. Godby and R. Reighart. Using machine-readable text as a source of novel vocabulary to update the dewey decimal classification. In Proceedings of the 1998 ASIS Classification Workshop, 1998.

    Google Scholar 

  5. G. H. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval Data Structures and Algorithms, pages 66–82. London: Prentice Hall International, 1992.

    Google Scholar 

  6. A. McCallum and K. Nigam et. al. A machine learning approach to building domain-specific search engines. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI’99), 1999.

    Google Scholar 

  7. O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR’98, 1998.

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chien, LF., Huang, CK., Chiao, HC., Lin, SJ. (2002). Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43704-8

  • Online ISBN: 978-3-540-47887-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics