Abstract
Many business applications track changes over time, for example, measuring the monthly prevalence of influenza incidents. In situations where a classifier is needed to identify the relevant incidents, imperfect classification accuracy can cause substantial bias in estimating class prevalence. The paper defines two research challenges for machine learning. The ‘quantification’ task is to accurately estimate the number of positive cases (or class distribution) in a test set, using a training set that may have a substantially different distribution. The ‘cost quantification’ variant estimates the total cost associated with the positive class, where each case is tagged with a cost attribute, such as the expense to resolve the case. Quantification has a very different utility model from traditional classification research. For both forms of quantification, the paper describes a variety of methods and evaluates them with a suitable methodology, revealing which methods give reliable estimates when training data is scarce, the testing class distribution differs widely from training, and the positive class is rare, e.g., 1% positives. These strengths can make quantification practical for business use, even where classification accuracy is poor.
Similar content being viewed by others
References
Fawcett T (2003) ROC graphs: notes and practical considerations for data mining researchers. Hewlett-Packard Labs, TR HPL-2003-4. http://www.hpl.hp.com/techreports
Fawcett T, Flach P (2005) A response to Webb and Ting’s ‘On the application of ROC analysis to predict classification performance under varying class distributions’. Mach Learn 58(1): 33–38
Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3(Mar): 1289–1305
Forman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), Porto, pp 564–575
Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 157–166
Forman G, Kirshenbaum E, Suermondt J (2006) Pragmatic text mining: minimizing human effort to quantify many issues in call logs. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), Philadelphia, pp 852–861
Ghani R (2000) Using error-correcting codes for text classification. In: Proceedings of the 17th international conference on machine learning (ICML), pp 303–310
Han E, Karypis G (2000) Centroid-based document classification: analysis & experimental results. In: Proceedings of the 4th European conference on the principles of data mining and knowledge discovery (PKDD), pp 424–431
Havre S, Hetzler E, Whitney P, Nowell L (2002) ThemeRiver: visualizing thematic changes in large document collections. IEEE Trans Vis Comput Graph 8(1): 9–20
Hulten G, Spencer L, Domingos P (2001) Mining time-changing data streams. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), San Francisco, pp 97–106
Lachiche N, Flach PA (2003) Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In: Proceedings of the 20th international conference on machine learning (ICML), Washington DC, pp 416–423
MacKenzie DI, Nichols JD, Lachman GB, Droege S, Royle JA, Langtimm CA (2002) Estimating site occupancy rates when detection probabilities are less than one. Ecology 83: 2248–2255
Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining (KDD), Chicago, pp 198–207
Provost F, Fawcett T (2001) Robust classification for imprecise environments. Mach Learn 42: 203–231
Saerens M, Latinne P, Decaestecker C (2002) Adjusting the outputs of a classifier to new a priori probabilities: a simple procedure. Neural Comput 14(1): 21–41
Seber GAF (1982) The estimation of animal abundance and related parameters, 2nd edn. Blackburn Press, New Jersey
Turney PD (2000) Types of cost in inductive concept learning. In: Workshop on cost-sensitive learning at the seventeenth international conference on machine learning (WCSL, ICML, Stanford University). Computing Research Repository, vol cs.LG/0212034
Valenstein P (1990) Evaluation diagnostic tests with imperfect standards. Am J Clin Pathol 93: 252–258
Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th international conference on machine learning (ICML), Oregon, pp 935–942
Vucetic S, Obradovic Z (2001) Classification on data with biased class distribution. In: Proceedings of the 12th European conference on machine learning (ECML), Freiburg, pp 527–538
Weiss GM, Provost F (2003) Learning when training data are costly: the effect of class distribution on tree induction. J Artif Intell Res 19: 315–354
Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco, CA
Wu G, Chang E (2005) KBA: kernel boundary alignment considering imbalanced data distribution. IEEE Trans on Knowl Data Eng 17(6): 786–795
Zhou X-H, Obuchowski NA, McClish DK (2002) Statistical methods in diagnostic medicine. Wiley, New York
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Gary M. Weiss.
Rights and permissions
About this article
Cite this article
Forman, G. Quantifying counts and costs via classification. Data Min Knowl Disc 17, 164–206 (2008). https://doi.org/10.1007/s10618-008-0097-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0097-y