Skip to main content

Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2003)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2637))

Included in the following conference series:

Abstract

Most document classification systems consider only the distribution of content words of the documents, ignoring the syntactic information underlying the documents though it is also an important factor. In this paper, we present an approach for classifying large scale unstructured documents by incorporating both lexical and syntactic information of documents. For this purpose, we use the co-training algorithm, a partially supervised learning algorithm, in which two separated views for the training data are employed and the small number of labeled data are augmented by a large number of unlabeled data. Since both lexical and syntactic information can play roles of separated views for the unstructured documents, the co-training algorithm enhances the performance of document classification using both of them and a large number of unlabeled documents. The experimental results on Reuters-21578 corpus and TREC-7 filtering documents show the effectiveness of unlabeled documents and the use of both lexical and syntactic information.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Biber. Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge University Press, 1995.

    Google Scholar 

  2. A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of CONLT-98, pages 92–100, 1998.

    Google Scholar 

  3. E. Brill. A simple rule-based part-of-speech tagger. In Proceedings of ANLP-92, pages 152–155, 1992.

    Google Scholar 

  4. CoNLL. Shared Task for Computational Natural Language Learning (CoNLL). http://lcg-www.uia.ac.be/conll2000/chunking, 2000.

  5. D. Hull. The TREC-7 filtering track: Description and analysis. In Proceedings of TREC-7, pages 33–56, 1998.

    Google Scholar 

  6. D. Hull, G. Grefenstette, B. Schulze, E. Gaussier, H. Schutze, and J. Pedersen. Xerox TREC-5 site report: Routing, filtering, nlp, and spanish tracks. In Proceedings of TREC-7, 1997.

    Google Scholar 

  7. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of ECML-98, pages 137–142, 1998.

    Google Scholar 

  8. T. Kudo and Y. Matsumoto. Use of support vector learning for chunk identification. In Proceedings of CoNLL-2000 and LLL-2000, pages 142–144, 2000.

    Google Scholar 

  9. M. Mitra, C. Buckley, A. Singhal, and C. Cardie. An analysis of statistical and syntactic phrases. In Proceedings of RIAO-97, pages 200–214, 1997.

    Google Scholar 

  10. K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of Co-training. In Proceedings of CIKM-2000, pages 86–93, 2000.

    Google Scholar 

  11. J. Rocchio. Relevance feedback in information retrieval. The SMART Retrieval System: Experiments in Automatic Document Processing, pages 313–323, 1971.

    Google Scholar 

  12. G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, pages 288–297, 1990.

    Google Scholar 

  13. B. Scholkopf, C. Burges, and A. Smola. Advances in Kernel Methods: Support Vector Machines. MIT Press, 1999.

    Google Scholar 

  14. E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Automatic text categorization in terms of genre and author. Computational Linguistics, 26(4):471–496, 2000.

    Article  Google Scholar 

  15. A. Turpin and A. Moffat. Statistical phrases for vector-space information retrieval. In Proceedings of SIGIR-1999, pages 309–310, 1999.

    Google Scholar 

  16. T. Zhang and F. Oles. A probability analysis on the value of unlabeled data for classification problems. In Proceedings of ICML-2000, pages 1191–1198, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Park, SB., Zhang, BT. (2003). Large Scale Unstructured Document Classification Using Unlabeled Data and Syntactic Information. In: Whang, KY., Jeon, J., Shim, K., Srivastava, J. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2003. Lecture Notes in Computer Science(), vol 2637. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-36175-8_9

Download citation

  • DOI: https://doi.org/10.1007/3-540-36175-8_9

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-04760-5

  • Online ISBN: 978-3-540-36175-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics