skip to main content
10.1145/1557019.1557143acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Improving classification accuracy using automatically extracted training data

Published: 28 June 2009 Publication History

Abstract

Classification is a core task in knowledge discovery and data mining, and there has been substantial research effort in developing sophisticated classification models. In a parallel thread, recent work from the NLP community suggests that for tasks such as natural language disambiguation even a simple algorithm can outperform a sophisticated one, if it is provided with large quantities of high quality training data. In those applications, training data occurs naturally in text corpora, and high quality training data sets running into billions of words have been reportedly used.
We explore how we can apply the lessons from the NLP community to KDD tasks. Specifically, we investigate how to identify data sources that can yield training data at low cost and study whether the quantity of the automatically extracted training data can compensate for its lower quality. We carry out this investigation for the specific task of inferring whether a search query has commercial intent. We mine toolbar and click logs to extract queries from sites that are predominantly commercial (e.g., Amazon) and non-commercial (e.g., Wikipedia). We compare the accuracy obtained using such training data against manually labeled training data. Our results show that we can have large accuracy gains using automatically extracted training data at much lower cost.

Supplementary Material

JPG File (p1145-fuxman.jpg)
MP4 File (p1145-fuxman.mp4)

References

[1]
R. Agrawal, A. Halverson, K. Kenthapandi, N. Mishra, and P. Tsaparas. Generating labels from clicks. In WSDM, pages 172--181, 2009
[2]
G. Andrew and J. Gao. Scalable training of l1-regularized log-linear models. In International Conference on Machine Learning, pages 33--40, 2007.
[3]
R. A. Baeza-Yates, L. Calderon-Benavides, and C. N. Gonzalez-Caro. The intention behind web queries. In SPIRE, pages 98--109, 2006.
[4]
M. Banko and E. Brill. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the Association for Computational Linguistics, pages 26--33, 2001.
[5]
S. Beitzel, E. Jensen, O. Frieder, and D. Grossman. Automatic web query classification using labeled and unlabeled training data. In In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 581--582. ACM Press, 2005.
[6]
T. Cover and J. Thomas. Elements of information theory. Wiley New York, 1991.
[7]
H. Dai, Z. Nie, L. Wang, L. Zhao, J. Wen, and Y. Li. Detecting online commercial intention. In In Proceedings of the 15th International World Wide Web Conference (WWW-06), pages 829--837, 2006.
[8]
T. Joachims. Optimizing search engines using clickthrough data. In KDD, pages 133--142. ACM, 2002.
[9]
T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst, 25(2), 2007.
[10]
M. Lapata and F. Keller. Web-based models for natural language processing. ACM Transactions on Speech and Language Processing, 1(1):1--31, 2005.
[11]
X. Li, Y. Wang, and A. Acero. Learning query intent from regularized click graphs. In ACM SIGIR conference on Research and Development in Information Retrieval, pages 339--346, 2008.
[12]
F. Muhlenbach, S. Lallich, and D. A. Zighed. Identifying and handling mislabelled instances. Journal of Intelligent Information Systems, 22(1):89--109, 2004.
[13]
P. Nakov and M. A. Hearst. Using the web as an implicit training set: Application to structural ambiguity resolution. In HLT/EMNLP, pages 835--842, 2005.
[14]
U. Rebbapragada and C. E. Brodley. Class noise mitigation through instance weighting. In ECML '07: Proceedings of the 18th European conference on Machine Learning, pages 708--715, Berlin, Heidelberg, 2007. Springer-Verlag.
[15]
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics, pages 189--196, 1995.
[16]
X. Zeng and T. R. Martinez. A noise filtering method using neural networks. In Proceedings of the International Workshop of Soft Computing Techniques in Instrumentation, Measurement and Related Applications, pages 26--31, 2003.
[17]
X. Zhu. Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of Wisconsin-Madison, 2005.

Cited By

View all
  • (2022)Building Interpretable Machine Learning Models with Decision TreesData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_4(61-79)Online publication date: 23-Mar-2022
  • (2022)Building Data Analysis Workflows that Provide Personalized Recommendations for StudentsData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_3(43-59)Online publication date: 23-Mar-2022
  • (2017)Intent Understanding in a Virtual AgentProceedings of the 9th International Conference on Machine Learning and Computing10.1145/3055635.3056617(33-37)Online publication date: 24-Feb-2017
  • Show More Cited By

Index Terms

  1. Improving classification accuracy using automatically extracted training data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
    June 2009
    1426 pages
    ISBN:9781605584959
    DOI:10.1145/1557019
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 28 June 2009

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. automatically labeled data
    2. classification
    3. query intent

    Qualifiers

    • Research-article

    Conference

    KDD09

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)6
    • Downloads (Last 6 weeks)2
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Building Interpretable Machine Learning Models with Decision TreesData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_4(61-79)Online publication date: 23-Mar-2022
    • (2022)Building Data Analysis Workflows that Provide Personalized Recommendations for StudentsData Analytics in e-Learning: Approaches and Applications10.1007/978-3-030-96644-7_3(43-59)Online publication date: 23-Mar-2022
    • (2017)Intent Understanding in a Virtual AgentProceedings of the 9th International Conference on Machine Learning and Computing10.1145/3055635.3056617(33-37)Online publication date: 24-Feb-2017
    • (2016)Interpolative self-training approach for sentiment analysis2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC)10.1109/BESC.2016.7804475(1-6)Online publication date: Nov-2016
    • (2016)Activity-based sampling of Twitter users for temporal prediction models2016 International Conference on Behavioral, Economic and Socio-cultural Computing (BESC)10.1109/BESC.2016.7804474(1-6)Online publication date: Nov-2016
    • (2016)An efficient representative for object recognition in structural health monitoringThe International Journal of Advanced Manufacturing Technology10.1007/s00170-016-9309-694:9-12(3239-3250)Online publication date: 24-Aug-2016
    • (2014)Pre-Silicon Bug ForecastIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2013.228868833:3(451-463)Online publication date: 1-Mar-2014
    • (2014)Building an advanced dense classifierIISA 2014, The 5th International Conference on Information, Intelligence, Systems and Applications10.1109/IISA.2014.6878728(1-6)Online publication date: Jul-2014
    • (2014)When a Classifier Meets More DataProcedia Computer Science10.1016/j.procs.2014.05.38030(50-59)Online publication date: 2014
    • (2011)A transfer approach to detecting disease reporting events in blog social mediaProceedings of the 22nd ACM conference on Hypertext and hypermedia10.1145/1995966.1996001(271-280)Online publication date: 6-Jun-2011
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media