Skip to main content
Log in

Medical data mining: insights from winning two competitions

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Two major data mining competitions in 2008 presented challenges in medical domains: KDD Cup 2008, which concerned cancer detection from mammography data; and Informs Data Mining Challenge 2008, dealing with diagnosis of pneumonia based on patient information from hospital files. Our team won both of these competitions, and in this paper we share our lessons learned and insights. We emphasize the aspects that pertain to the general practice and methodology of medical data mining, rather than to the specifics of each modeling competition. We concentrate on three topics: information leakage, its effect on competitions and proof-of-concept projects; consideration of real-life model performance measures in model construction and evaluation; and relational learning approaches to medical data mining tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Bandos AI, Rockette HE, Song T, Gur D (2008) Area under the free-response ROC curve (FROC) and a related summary index. Biometrics 65(1): 247–256

    Article  Google Scholar 

  • DeLuca PM, Wambersie A, Whitmore GF (2008) Extensions to conventional ROC methodology: LROC, FROC, and AFROC. J ICRU 8: 31–35

    Google Scholar 

  • Domingos P, Richardson M (2007) Markov logic: a unifying framework for statistical relational learning. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, Cambridge

    Google Scholar 

  • Ferri C, Flach P, Hernandez-Orallo J (2002) Learning decision trees using the area under the ROC curve. In: Proceedings of the international conference on machine learning

  • Getoor L, Friedman N, Koller D, Pfeffer A, Taskar B (2007) Probabilistic relational models. In: Getoor L, Taskar B (eds) Introduction to statistical relational learning. MIT Press, Cambridge

    Google Scholar 

  • Glymour C, Scheines R, Spirtes P, Kelly K (1987) Discovering causal structure: artificial intelligence, philosophy of science, and statistical modeling. Academic Press, San Diego

    MATH  Google Scholar 

  • Inger A, Vatnik N, Rosset S, Neumann E (2000) KDD-Cup 2000: question 1 winner’s report, SIGKDD explorations

  • Joachims T (2005) A support vector method for multivariate performance measures. In: Proceedings of the international conference on machine learning

  • Joachims T (1999) Making large-scale SVM learning practical. In: Scholkopf B, Burges C, Smola A (eds) Advances in Kernel methods—support vector learning. MIT Press, Cambridge

    Google Scholar 

  • Kou Z, Cohen WW (2007) Stacked graphical learning for efficient inference in markov random fields. In: Proceedings of the international conference on data mining

  • Krogel M-A, Wrobel S (2003) Facets of aggregation approaches to propositionalization. In: Proceedings of the international conference on inductive logic programming

  • Lawrence R, Perlich C, Rosset S et al (2007) Analytics-driven solutions for customer targeting and sales-force allocation. IBM Syst J 46(4): 797–816

    Article  Google Scholar 

  • Melville P, Rosset S, Lawrence R (2008) Customer targeting models using actively-selected web content. In: Proceedings of the conference on knowledge discovery and data mining

  • Muggleton SH, DeRaedt L (1994) Inductive logic programming: theory and methods. J Logic Program 19 & 20: 629–680

    Article  MathSciNet  Google Scholar 

  • NIST/SEMATECH (2006) e-Handbook of Statistical Methods, chap. 1. http://www.itl.nist.gov/div898/handbook/eda/eda.htm

  • Perlich C (2005) Approaching the ILP challenge 2005: class-conditional bayesian propositionalization for genetic classification. In: Proceedings of the conference on inductive logic programming

  • Perlich C, Provost F (2006) ACORA: distribution-based aggregation for relational learning from identifier attributes, special issue on statistical relational learning and multi-relational data mining. J Mach Learn 62: 65–105

    Article  Google Scholar 

  • Perlich C, Melville P, Liu Y, Swirszcz G, Lawrence R, Rosset S (2008) Breast cancer identification: KDD cup winner’s report, SIGKDD explorations

  • Platt J (1998) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Bartlett PJ, Schölkopf B, Schuurmans D, Smola AJ (eds) Advances in large margin classifiers. MIT Press, Cambridge

    Google Scholar 

  • Rao RB, Yakhnenko O, Krishnapuram B (2008) KDD Cup 2008 and the workshop on mining medical data, SIGKDD explorations

  • Rosset S, Perlich C, Liu Y (2007) Making the most of your data: KDD Cup 2007 “How many ratings” winner’s report, SIGKDD Explorations

  • Russ TA (1989) Using hindsight in medical decision making. In: Proceedings of the thirteenth annual symposium on computer applications in medical care

  • Saar-Tsechansky M, Pliskin N, Rabinowitz G, Porath A (2001) Monitoring quality of care with relational patterns. Top Health Inf Manag 22(1): 24–35

    Google Scholar 

  • Shahar Y (2000) Dimension of time in illness: an objective view. Ann Intern Med 132: 45–53

    Google Scholar 

  • Simon HA (1954) Spurious correlation: a causal interpretation. J Am Stat Assoc 49: 467–479

    Article  MATH  Google Scholar 

  • Turney PD (2000) Types of cost in inductive concept learning In: Proceedings of the workshop on cost-sensitive learning at the international conference on machine learning

  • Valentini G, Dietterich TG (2003) Low bias bagged support vector machines. In: International conference on machine learning

  • Weiss GM, Saar-Tsechansky M, Zadrozny B (2008) Special issue on utility-based data mining (editors). Data Min Knowl Discov 17(2)

  • White K, Dufresne RL (1997) The placebo effect in drug trials and the double blind. In: Hertzman M, Feltner DE (eds) The handbook of psychopharmacology trials. NYU Press, New York, pp 123–136

    Google Scholar 

  • Wolpert DH (1992) Stacked generalization. Neural Networks 5: 241–259

    Article  Google Scholar 

  • Yan R, Zhang J, Yang J, Hauptmann A (2004) A discriminative learning framework with pairwise constraints for video object classification. In: Proceedings of IEEE conference on computer vision and pattern recognition

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Claudia Perlich.

Additional information

Communicated by R. Bharat Rao and Romer Rosales.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rosset, S., Perlich, C., Świrszcz, G. et al. Medical data mining: insights from winning two competitions. Data Min Knowl Disc 20, 439–468 (2010). https://doi.org/10.1007/s10618-009-0158-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0158-x

Keywords

Navigation