Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

Chien, Lee-Feng; Huang, Chien-Kang; Chiao, Hsin-Chen; Lin, Shih-Jui

doi:10.1007/3-540-47887-6_50

Lee-Feng Chien⁴,
Chien-Kang Huang⁵,
Hsin-Chen Chiao⁴ &
…
Shih-Jui Lin⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2336))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2204 Accesses
1 Citations

Abstract

With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Document Classification with Hierarchically Structured Dictionaries

Automatic Document Classification Based on J.S. Mill’s Ideas

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

References

L.-F. Chien. PAT-tree-based keyword extraction for chinese information retrieval. In Proceedings of ACM SIGIR’97 Conference, 1997.
Google Scholar
L.-F. Chien. PAT-tree-based adaptive keyphrase extraction for intelligent chinese information retrieval. Informatin Processing and Management, 35:501–521, 1999.
Article Google Scholar
C. L. Giles, K. Bollacker, and S. Lawrence. Citeseer: An automatic citation indexing systm. In Proceedings of 1998 ACM Conference on Digital Library, 1998.
Google Scholar
C. J. Godby and R. Reighart. Using machine-readable text as a source of novel vocabulary to update the dewey decimal classification. In Proceedings of the 1998 ASIS Classification Workshop, 1998.
Google Scholar
G. H. Gonnet, R. Baeza-Yates, and T. Snider. Information Retrieval Data Structures and Algorithms, pages 66–82. London: Prentice Hall International, 1992.
Google Scholar
A. McCallum and K. Nigam et. al. A machine learning approach to building domain-specific search engines. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI’99), 1999.
Google Scholar
O. Zamir and O. Etzioni. Web document clustering: A feasibility demonstration. In Proceedings of SIGIR’98, 1998.
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Information Science, Academic Sinica, China
Lee-Feng Chien, Hsin-Chen Chiao & Shih-Jui Lin
Department of Computer Science and Information Engineering, National Taiwan University, Taiwan
Chien-Kang Huang

Authors

Lee-Feng Chien
View author publications
You can also search for this author in PubMed Google Scholar
Chien-Kang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hsin-Chen Chiao
View author publications
You can also search for this author in PubMed Google Scholar
Shih-Jui Lin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

EE Department, National Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei, Taiwan, ROC
Ming-Syan Chen
IBM Thomas J. Watson Research Center, 30 Sawmill River Road, Hawthorne, NY, 10532, USA
Philip S. Yu
School of Computing, National University of Singapore, Lower Kent Ridge Road, Singapore, 119260
Bing Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chien, LF., Huang, CK., Chiao, HC., Lin, SJ. (2002). Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web. In: Chen, MS., Yu, P.S., Liu, B. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2002. Lecture Notes in Computer Science(), vol 2336. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47887-6_50

Download citation

DOI: https://doi.org/10.1007/3-540-47887-6_50
Published: 29 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43704-8
Online ISBN: 978-3-540-47887-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

Abstract

Access this chapter

Preview

Similar content being viewed by others

Document Classification with Hierarchically Structured Dictionaries

Automatic Document Classification Based on J.S. Mill’s Ideas

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

Abstract

Access this chapter

Preview

Similar content being viewed by others

Document Classification with Hierarchically Structured Dictionaries

Automatic Document Classification Based on J.S. Mill’s Ideas

Hierarchical Multidimensional Classification of Web Documents with MultiWebClass

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation