Abstract
Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both lexical and syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by a large number of unlabeled data. Since both lexical and syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both lexical and syntactic information.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Biber. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, 1995.
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of CONLT-98, pages 92–100, 1998.
E. Brill. A simple rule-based part-of-speech tagger. In Proceedings of ANLP-92, pages 152–155, 1992.
CoNLL. Shared Task for Computational Natural Language Learning (CoNLL). http://lcg-www.uia.ac.be/conll2000/chunking, 2000.
D. Hull. The TREC-7 filtering track: Description and analysis. In Proceedings of TREC-7, pages 33–56, 1998.
D. Hull, G. Grefenstette, B. Schulze, E. Gaussier, H. Schutze, and J. Pedersen. Xerox TREC-5 site report: Routing, filtering, nlp, and spanish tracks. In Proceedings of TREC-7, 1997.
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, pages 137–142, 1998.
T. Kudo and Y. Matsumoto. Use of support vector learning for chunk identification. In Proceedings of CoNLL-2000 and LLL-2000, pages 142–144, 2000.
M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97, pages 200–214, 1997.
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of Co-training. In Proceedings of CIKM-2000, pages 86–93, 2000.
J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323, 1971.
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, pages 288–297, 1990.
B. Scholkopf, C. Burges, and A. Smola. Advances in Kernel Methods: Support Vector Machines. MIT Press, 1999.
E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471–496, 2000.
A. Turpin and A. Moffat. Statistical phrases for vector-space information retrieval. In Proceedings of SIGIR-1999, pages 309–310, 1999.
T. Zhang and F. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of ICML-2000, pages 1191–1198, 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Park, SB., Zhang, BT. (2003). Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_9
Download citation
DOI: https://doi.org/10.1007/3-540-36175-8_9
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-04760-5
Online ISBN: 978-3-540-36175-6
eBook Packages: Springer Book Archive