research-article

Improving classification accuracy using automatically extracted training data

Authors:

Andrew B. Goldberg,

Rakesh Agrawal,

Panayiotis Tsaparas,

John ShaferAuthors Info & Claims

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1145 - 1154

https://doi.org/10.1145/1557019.1557143

Published: 28 June 2009 Publication History

Abstract

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.

We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

Supplementary Material

JPG File (p1145-fuxman.jpg)

Download
8.46 KB

MP4 File (p1145-fuxman.mp4)

Download
72.23 MB

References

[1]

R. Agrawal, A. Halverson, K. Kenthapandi, N. Mishra, and P. Tsaparas. Generating labels from clicks. In WSDM, pages 172--181, 2009

Digital Library

[2]

G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In International Conference on Machine Learning, pages 33--40, 2007.

Digital Library

[3]

R. A. Baeza-Yates, L. Calderon-Benavides, and C. N. Gonzalez-Caro. The intention behind web queries. In SPIRE, pages 98--109, 2006.

Digital Library

[4]

M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the Association for Computational Linguistics, pages 26--33, 2001.

Digital Library

[5]

S. Beitzel, E. Jensen, O. Frieder, and D. Grossman. Automatic web query classification using labeled and unlabeled training data. In In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 581--582. ACM Press, 2005.

Digital Library

[6]

T. Cover and J. Thomas. Elements of information theory. Wiley New York, 1991.

Digital Library

[7]

H. Dai, Z. Nie, L. Wang, L. Zhao, J. Wen, and Y. Li. Detecting online commercial intention. In In Proceedings of the 15th International World Wide Web Conference (WWW-06), pages 829--837, 2006.

Digital Library

[8]

T. Joachims. Optimizing search engines using clickthrough data. In KDD, pages 133--142. ACM, 2002.

Digital Library

[9]

T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst, 25(2), 2007.

Digital Library

[10]

M. Lapata and F. Keller. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 1(1):1--31, 2005.

Digital Library

[11]

X. Li, Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In ACM SIGIR conference on Research and Development in Information Retrieval, pages 339--346, 2008.

Digital Library

[12]

F. Muhlenbach, S. Lallich, and D. A. Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89--109, 2004.

Digital Library

[13]

P. Nakov and M. A. Hearst. Using the web as an implicit training set: Application to structural ambiguity resolution. In HLT/EMNLP, pages 835--842, 2005.

Digital Library

[14]

U. Rebbapragada and C. E. Brodley. Class noise mitigation through instance weighting. In ECML '07: Proceedings of the 18th European conference on Machine Learning, pages 708--715, Berlin, Heidelberg, 2007. Springer-Verlag.

Digital Library

[15]

D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189--196, 1995.

Digital Library

[16]

X. Zeng and T. R. Martinez. A noise filtering method using neural networks. In Proceedings of the International Workshop of Soft Computing Techniques in Instrumentation, Measurement and Related Applications, pages 26--31, 2003.

[17]

X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

Cited By

Popescu PMihăescu MMocanu M(2022)Building Interpretable Machine Learning Models with Decision TreesData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_4(61-79)Online publication date: 23-Mar-2022
https://doi.org/10.1007/978-3-030-96644-7_4
Mihăescu MPopescu PMocanu M(2022)Building Data Analysis Workflows that Provide Personalized Recommendations for StudentsData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_3(43-59)Online publication date: 23-Mar-2022
https://doi.org/10.1007/978-3-030-96644-7_3
Venkataraman AAnantha A(2017)Intent Understanding in a Virtual AgentProceedings of the 9th International Conference on Machine Learning and Computing10.1145/3055635.3056617(33-37)Online publication date: 24-Feb-2017
https://dl.acm.org/doi/10.1145/3055635.3056617
Show More Cited By

Index Terms

Improving classification accuracy using automatically extracted training data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Improving Text Classification Accuracy by Training Label Cleaning

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain. Semisupervised learning and active learning are two strategies whose aim is maximizing the effectiveness of the resulting ...
DCPE co-training for classification

Co-training is a well-known semi-supervised learning technique that applies two basic learners to train the data source, which uses the most confident unlabeled data to augment labeled data in the learning process. In the paper, we use the diversity of ...
Unlabeling data can improve classification accuracy

In this study we focus on the effects of sample limitations on partially supervised learning algorithms. We analyze the performance of these types of learning algorithms on small datasets under varying trade-offs between labeled and unlabeled samples. ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

June 2009

1426 pages

ISBN:9781605584959

DOI:10.1145/1557019

General Chairs:
John Elder
Elder Research, Inc., USA
,
Françoise Soulié Fogelman
KXEN, France
,
Program Chairs:
Peter Flach
University of Bristol, UK
,
Mohammed Zaki
RPI, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 June 2009

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD09

Sponsor:

KDD09: The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

June 28 - July 1, 2009

Paris, France

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
786
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Popescu PMihăescu MMocanu M(2022)Building Interpretable Machine Learning Models with Decision TreesData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_4(61-79)Online publication date: 23-Mar-2022
https://doi.org/10.1007/978-3-030-96644-7_4
Mihăescu MPopescu PMocanu M(2022)Building Data Analysis Workflows that Provide Personalized Recommendations for StudentsData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_3(43-59)Online publication date: 23-Mar-2022
https://doi.org/10.1007/978-3-030-96644-7_3
Venkataraman AAnantha A(2017)Intent Understanding in a Virtual AgentProceedings of the 9th International Conference on Machine Learning and Computing10.1145/3055635.3056617(33-37)Online publication date: 24-Feb-2017
https://dl.acm.org/doi/10.1145/3055635.3056617
Aghababaei SMakrehchi M(2016)Interpolative self-training approach for sentiment analysis2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC)10.1109/BESC.2016.7804475(1-6)Online publication date: Nov-2016
https://doi.org/10.1109/BESC.2016.7804475
Aghababaei SGultepe EChepurna IMakrehchi M(2016)Activity-based sampling of Twitter users for temporal prediction models2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC)10.1109/BESC.2016.7804474(1-6)Online publication date: Nov-2016
https://doi.org/10.1109/BESC.2016.7804474
Guo WZhao HGao XKong LLi Y(2016)An efficient representative for object recognition in structural health monitoringThe International Journal of Advanced Manufacturing Technology10.1007/s00170-016-9309-694:9-12(3239-3250)Online publication date: 24-Aug-2016
https://doi.org/10.1007/s00170-016-9309-6
Guo QChen TChen YWang RChen HHu WChen G(2014)Pre-Silicon Bug ForecastIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2013.228868833:3(451-463)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1109/TCAD.2013.2288688
Popescu PMihaescu MMocanu MBurdescu D(2014)Building an advanced dense classifierIISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications10.1109/IISA.2014.6878728(1-6)Online publication date: Jul-2014
https://doi.org/10.1109/IISA.2014.6878728
Liao ZZhu Y(2014)When a Classifier Meets More DataProcedia Computer Science10.1016/j.procs.2014.05.38030(50-59)Online publication date: 2014
https://doi.org/10.1016/j.procs.2014.05.380
Stewart ASmith MNejdl WDe Bra PGrønbæk K(2011)A transfer approach to detecting disease reporting events in blog social mediaProceedings of the 22nd ACM conference on Hypertext and hypermedia10.1145/1995966.1996001(271-280)Online publication date: 6-Jun-2011
https://dl.acm.org/doi/10.1145/1995966.1996001
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten