Skip to main content

Advertisement

Log in

Cost-based quality measures in subgroup discovery

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

We consider data where examples are not only labeled in the classical sense (positive or negative), but also have costs associated with them. In this sense, each example has two target attributes, and we aim to find clearly defined subsets of the data where the values of these two targets have an unusual distribution. In other words, we are focusing on a Subgroup Discovery task with a somewhat unusual target concept, and investigate quality measures that take into account both the binary and the cost target. In defining such quality measures, we aim to produce an interpretable valuation of a subgroup, such that data analysts can directly appreciate the findings, and relate these to monetary gains or losses. Our work is particularly relevant in the domain of health care fraud detection. In this domain, the binary target identifies the patients of a specific medical practitioner under investigation, and the cost target specifies the money spent on each patient. When looking for differences in claim behavior, we need to take into account both the ‘positive’ examples (patients of the practitioner) and ‘negative’ examples (other patients), as well as information about costs of all patients. A typical subgroup will list a number of treatments, and the target practitioner’s patients behavioral difference in both treatment prevalence and associated costs. An additional angle is the Local Subgroup Discovery task, where subgroups are judged according to the difference with a local reference group instead of the entire dataset. We show how the cost-based analysis of data specifically fits this local focus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Atzmueller, M., & Lemmerich, F. (2009). Fast subgroup discovery for continuous target concepts. In J. Rauch, W. Raś, Z., P. Berka, T. Elomaa (Eds.), Foundations of intelligent systems. Lecture notes in computer science (Vol. 5722, pp. 35–44). Berlin: Springer.

  • Bay, S., & Pazzani, M. (2001). Detecting group differences: mining contrast sets. Data Mining and Knowledge Discovery, 5(3), 213–246.

    Article  MATH  Google Scholar 

  • Chan, R., Yang, Q., Shen, Y.-D. (2003). Mining high utility itemsets. In Third IEEE international conference on data mining, 2003 (pp. 19–26). IEEE.

  • Dong, G., & Li, J. (1999). Efficient mining of emerging patterns: discovering trends and differences. In Proceedings of KDD ’99 (pp. 43–52). New York.

  • Elkan, C. (2001). The foundations of cost-sensitive learning. In International joint conference on artificial intelligence (Vol. 17, pp. 973–978). Citeseer.

  • Grosskreutz, H. (2010). Cascaded subgroups discovery with an application to regression. In LeGo-08 - from local patterns to global models: ECML/PKDD-08 workshop (p. 16).

  • Grosskreutz, H., Rüping, S., Wrobel, S. (2008). Tight optimistic estimates for fast subgroup discovery. In W. Daelemans, B. Goethals, K. Morik (Eds.), Machine learning and knowledge discovery in databases. Lecture notes in computer science (Vol. 5211, pp. 440–456). Berlin: Springer.

  • Hernández-Orallo, J., Flach, P. A., Ramirez, C. F. (2011). Technical note: towards roc curves in cost space. CoRR, ArXiv abs/1107.5930.

  • Jorge, A. M., Azevedo, P. J., Pereira, F. (2006). Distribution rules with numeric attributes of interest. In J. Fürnkranz, T. Scheffer, M. Spiliopoulou (Eds.), Knowledge discovery in databases: PKDD 2006. Lecture notes in computer science (Vol. 4213, pp. 247–258). Berlin: Springer.

  • Knobbe, A., & Ho, E. (2006). Pattern teams. In J. Fürnkranz, T. Scheffer, M. Spiliopoulou (Eds.), Knowledge discovery in databases: PKDD 2006. Lecture notes in computer science (Vol. 4213, pp. 577–584). Berlin: Springer.

  • Konijn, R. M., & Kowalczyk, W. (2012). Hunting for fraudsters in random forests. In E. Corchado, V. Snasel, A. Abraham, M. Wozniak, M. Grana, S.-B. Cho (Eds.), Hybrid artificial intelligent systems. Lecture notes in computer science (Vol. 7208, pp. 174–185). Berlin: Springer.

  • Konijn, R. M., Duivesteijn, W., Kowalczyk, W., Knobbe, A. (2013a). Discovering local subgroups, with an application to fraud detection. In J. Pei, V. Tseng, L. Cao, H. Motoda, G. Xu (Eds.), Advances in knowledge discovery and data mining. Lecture notes in computer science (Vol. 7818, pp. 1–12). Berlin: Springer.

  • Konijn, R. M., Duivesteijn, W., Meeng, M., Knobbe, A. (2013b). Cost-based quality measures in subgroup discovery. In New frontiers in applied data mining - PAKDD 2013 international workshops - QIMIE 2013.

  • Lavrač, N., Flach, P., Zupan, B. (1999). Rule evaluation measures: a unifying view. In S. Džeroski, P. Flach (Eds.), Inductive logic programming. Lecture notes in computer science (Vol. 1634, pp. 174–185). Berlin: Springer.

  • Liu, Y., Liao, W. K., Choudhary, A. (2005). A fast high utility itemsets mining algorithm. In Proceedings of the 1st international workshop on utility-based data mining (pp. 90–99).

  • Meeng, M., & Knobbe, A. (2011). Flexible enrichment with cortana (software demo). In Proceedings Benelearn (pp. 117–120).

  • Pieters, B.F.I., Knobbe, A., Džeroski, S. (2010). Subgroup discovery in ranked data, with an application to gene set enrichment. In Proceedings preference learning workshop (PL 2010)

  • Reid, A. A., Tayebi, M. A., Frank, R. (2013). Exploring the structural characteristics of social networks in a large criminal court database. In Proceedings of the IEEE intelligence and security informatics conference (ISI 2013) (pp. 209–214).

  • Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups. In Proceedings of PKDD (pp. 78–87).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rob M. Konijn.

Additional information

This paper is an extended version of the paper with the same title [12] which appeared in New Frontiers in Applied Data Mining—PAKDD 2013 International Workshops, 2013.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Konijn, R.M., Duivesteijn, W., Meeng, M. et al. Cost-based quality measures in subgroup discovery. J Intell Inf Syst 45, 337–355 (2015). https://doi.org/10.1007/s10844-014-0313-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-014-0313-8

Keywords

Navigation