Quantifying counts and costs via classification

Forman, George

doi:10.1007/s10618-008-0097-y

Quantifying counts and costs via classification

Published: 10 June 2008

Volume 17, pages 164–206, (2008)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

George Forman¹

889 Accesses
103 Citations
3 Altmetric
Explore all metrics

Abstract

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Fawcett T (2003) ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, TR HPL-2003-4. http://www.hpl.hp.com/techreports
Fawcett T, Flach P (2005) A response to Webb and Ting’s ‘On the application of ROC analysis to predict classification performance under varying class distributions’. Mach Learn 58(1): 33–38
Article Google Scholar
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar): 1289–1305
Article MATH Google Scholar
Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), Porto, pp 564–575
Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 157–166
Forman G, Kirshenbaum E, Suermondt J (2006) Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 852–861
Ghani R (2000) Using error-correcting codes for text classification. In: Proceedings of the 17th international conference on machine learning (ICML), pp 303–310
Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on the principles of data mining and knowledge discovery (PKDD), pp 424–431
Havre S, Hetzler E, Whitney P, Nowell L (2002) ThemeRiver: visualizing thematic changes in large document collections. IEEE Trans Vis Comput Graph 8(1): 9–20
Article Google Scholar
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), San Francisco, pp 97–106
Lachiche N, Flach PA (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Proceedings of the 20th international conference on machine learning (ICML), Washington DC, pp 416–423
MacKenzie DI, Nichols JD, Lachman GB, Droege S, Royle JA, Langtimm CA (2002) Estimating site occupancy rates when detection probabilities are less than one. Ecology 83: 2248–2255
Article Google Scholar
Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (KDD), Chicago, pp 198–207
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42: 203–231
Article MATH Google Scholar
Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1): 21–41
Article MATH Google Scholar
Seber GAF (1982) The estimation of animal abundance and related parameters, 2nd edn. Blackburn Press, New Jersey
Google Scholar
Turney PD (2000) Types of cost in inductive concept learning. In: Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (WCSL, ICML, Stanford University). Computing Research Repository, vol cs.LG/0212034
Valenstein P (1990) Evaluation diagnostic tests with imperfect standards. Am J Clin Pathol 93: 252–258
Google Scholar
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML), Oregon, pp 935–942
Vucetic S, Obradovic Z (2001) Classification on data with biased class distribution. In: Proceedings of the 12th European conference on machine learning (ECML), Freiburg, pp 527–538
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354
MATH Google Scholar
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco, CA
MATH Google Scholar
Wu G, Chang E (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans on Knowl Data Eng 17(6): 786–795
Article Google Scholar
Zhou X-H, Obuchowski NA, McClish DK (2002) Statistical methods in diagnostic medicine. Wiley, New York
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Hewlett-Packard Labs, Palo Alto, CA, USA
George Forman

Authors

George Forman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to George Forman.

Additional information

Responsible editor: Gary M. Weiss.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Forman, G. Quantifying counts and costs via classification. Data Min Knowl Disc 17, 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y

Download citation

Received: 21 November 2006
Accepted: 28 April 2008
Published: 10 June 2008
Issue Date: October 2008
DOI: https://doi.org/10.1007/s10618-008-0097-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Quantifying counts and costs via classification

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A random forest guided tour

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Quantifying counts and costs via classification

Abstract

Access this article

Similar content being viewed by others

Big data in healthcare: management, analysis and future prospects

Imbalanced data preprocessing techniques for machine learning: a systematic mapping study

A random forest guided tour

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation