Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information

Park, Seong-Bae; Zhang, Byoung-Tak

doi:10.1007/3-540-36175-8_9

Seong-Bae Park⁵ &
Byoung-Tak Zhang⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1166 Accesses
2 Citations

Abstract

Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both lexical and syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by a large number of unlabeled data. Since both lexical and syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both lexical and syntactic information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Biber. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, 1995.
Google Scholar
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of CONLT-98, pages 92–100, 1998.
Google Scholar
E. Brill. A simple rule-based part-of-speech tagger. In Proceedings of ANLP-92, pages 152–155, 1992.
Google Scholar
CoNLL. Shared Task for Computational Natural Language Learning (CoNLL). http://lcg-www.uia.ac.be/conll2000/chunking, 2000.
D. Hull. The TREC-7 filtering track: Description and analysis. In Proceedings of TREC-7, pages 33–56, 1998.
Google Scholar
D. Hull, G. Grefenstette, B. Schulze, E. Gaussier, H. Schutze, and J. Pedersen. Xerox TREC-5 site report: Routing, filtering, nlp, and spanish tracks. In Proceedings of TREC-7, 1997.
Google Scholar
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, pages 137–142, 1998.
Google Scholar
T. Kudo and Y. Matsumoto. Use of support vector learning for chunk identification. In Proceedings of CoNLL-2000 and LLL-2000, pages 142–144, 2000.
Google Scholar
M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97, pages 200–214, 1997.
Google Scholar
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of Co-training. In Proceedings of CIKM-2000, pages 86–93, 2000.
Google Scholar
J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323, 1971.
Google Scholar
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, pages 288–297, 1990.
Google Scholar
B. Scholkopf, C. Burges, and A. Smola. Advances in Kernel Methods: Support Vector Machines. MIT Press, 1999.
Google Scholar
E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471–496, 2000.
Article Google Scholar
A. Turpin and A. Moffat. Statistical phrases for vector-space information retrieval. In Proceedings of SIGIR-1999, pages 309–310, 1999.
Google Scholar
T. Zhang and F. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of ICML-2000, pages 1191–1198, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Seoul National University, Seoul, 151-742, Korea
Seong-Bae Park & Byoung-Tak Zhang

Authors

Seong-Bae Park
View author publications
You can also search for this author in PubMed Google Scholar
Byoung-Tak Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, Korea Advanced Institute of Science and Technology, 373-1 Koo-Sung Dong, Yoo-Sung Ku, Daejeon, 305-701, Korea
Kyu-Young Whang
Department of Statistics, Seoul National University, Sillimdong Kwanakgu, Seoul, 151-742, Korea
Jongwoo Jeon
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim
Department of Computer Science and Engineering, University of Minnesota, 200 Union St SE, Minneapolis, MN, 55455, USA
Jaideep Srivastava

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, SB., Zhang, BT. (2003). Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_9

Download citation

DOI: https://doi.org/10.1007/3-540-36175-8_9
Published: 30 April 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics