Skip to main content
Log in

Quantifying counts and costs via classification

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Fawcett T (2003) ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, TR HPL-2003-4. http://www.hpl.hp.com/techreports

  • Fawcett T, Flach P (2005) A response to Webb and Ting’s ‘On the application of ROC analysis to predict classification performance under varying class distributions’. Mach Learn 58(1): 33–38

    Article  Google Scholar 

  • Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar): 1289–1305

    Article  MATH  Google Scholar 

  • Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), Porto, pp 564–575

  • Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 157–166

  • Forman G, Kirshenbaum E, Suermondt J (2006) Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 852–861

  • Ghani R (2000) Using error-correcting codes for text classification. In: Proceedings of the 17th international conference on machine learning (ICML), pp 303–310

  • Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on the principles of data mining and knowledge discovery (PKDD), pp 424–431

  • Havre S, Hetzler E, Whitney P, Nowell L (2002) ThemeRiver: visualizing thematic changes in large document collections. IEEE Trans Vis Comput Graph 8(1): 9–20

    Article  Google Scholar 

  • Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), San Francisco, pp 97–106

  • Lachiche N, Flach PA (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Proceedings of the 20th international conference on machine learning (ICML), Washington DC, pp 416–423

  • MacKenzie DI, Nichols JD, Lachman GB, Droege S, Royle JA, Langtimm CA (2002) Estimating site occupancy rates when detection probabilities are less than one. Ecology 83: 2248–2255

    Article  Google Scholar 

  • Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (KDD), Chicago, pp 198–207

  • Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42: 203–231

    Article  MATH  Google Scholar 

  • Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1): 21–41

    Article  MATH  Google Scholar 

  • Seber GAF (1982) The estimation of animal abundance and related parameters, 2nd edn. Blackburn Press, New Jersey

    Google Scholar 

  • Turney PD (2000) Types of cost in inductive concept learning. In: Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (WCSL, ICML, Stanford University). Computing Research Repository, vol cs.LG/0212034

  • Valenstein P (1990) Evaluation diagnostic tests with imperfect standards. Am J Clin Pathol 93: 252–258

    Google Scholar 

  • Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML), Oregon, pp 935–942

  • Vucetic S, Obradovic Z (2001) Classification on data with biased class distribution. In: Proceedings of the 12th European conference on machine learning (ECML), Freiburg, pp 527–538

  • Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354

    MATH  Google Scholar 

  • Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco, CA

    MATH  Google Scholar 

  • Wu G, Chang E (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans on Knowl Data Eng 17(6): 786–795

    Article  Google Scholar 

  • Zhou X-H, Obuchowski NA, McClish DK (2002) Statistical methods in diagnostic medicine. Wiley, New York

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to George Forman.

Additional information

Responsible editor: Gary M. Weiss.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Forman, G. Quantifying counts and costs via classification. Data Min Knowl Disc 17, 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0097-y

Keywords

Navigation