Reference Hub1
Sampling the Web as Training Data for Text Classification

Sampling the Web as Training Data for Text Classification

Wei-Yen Day, Chun-Yi Chi, Ruey-Cheng Chen, Pu-Jen Cheng
Copyright: © 2010 |Volume: 1 |Issue: 4 |Pages: 19
ISSN: 1947-9077|EISSN: 1947-9085|EISBN13: 9781613502785|DOI: 10.4018/jdls.2010100102
Cite Article Cite Article

MLA

Day, Wei-Yen, et al. "Sampling the Web as Training Data for Text Classification." IJDLS vol.1, no.4 2010: pp.24-42. http://doi.org/10.4018/jdls.2010100102

APA

Day, W., Chi, C., Chen, R., & Cheng, P. (2010). Sampling the Web as Training Data for Text Classification. International Journal of Digital Library Systems (IJDLS), 1(4), 24-42. http://doi.org/10.4018/jdls.2010100102

Chicago

Day, Wei-Yen, et al. "Sampling the Web as Training Data for Text Classification," International Journal of Digital Library Systems (IJDLS) 1, no.4: 24-42. http://doi.org/10.4018/jdls.2010100102

Export Reference

Mendeley
Favorite Full-Issue Download

Abstract

Data acquisition is a major concern in text classification. The excessive human efforts required by conventional methods to build up quality training collection might not always be available to research workers. In this paper, the authors look into possibilities to automatically collect training data by sampling the Web with a set of given class names. The basic idea is to populate appropriate keywords and submit them as queries to search engines for acquiring training data. The first of two methods presented in this paper is based on sampling the common concepts among classes and the other is based on sampling the discriminative concepts for each class. A series of experiments were carried out independently on two different datasets and results show that the proposed methods significantly improve classifier performance even without using manually labeled training data. The authors’ strategy for retrieving Web samples substantially helps in the conventional document classification in terms of accuracy and efficiency.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.