Skip to main content
Log in

Rule induction for uncertain data

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data uncertainty are common in real-world applications and it can be caused by many factors such as imprecise measurements, network latency, outdated sources and sampling errors. When mining knowledge from these applications, data uncertainty need to be handled with caution. Otherwise, unreliable or even wrong mining results would be obtained. In this paper, we propose a rule induction algorithm, called uRule, to learn rules from uncertain data. The key problem in learning rules is to efficiently identify the optimal cut points from training data. For uncertain numerical data, we propose an optimization mechanism which merges adjacent bins that have equal classifying class distribution and prove its soundness. For the uncertain categorical data, we also propose a new method to select cut points based on possible world semantics. We then present the uRule algorithm in detail. Our experimental results show that the uRule algorithm can generate rules from uncertain numerical data with potentially higher accuracies, and the proposed optimization method is effective in the cut point selection for both certain and uncertain numerical data. Furthermore, uRule has quite stable performance when mining uncertain categorical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aggarwal C, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain Data. In: Proceedings of SIGKDD’09 pp 29–38

  2. http://archive.ics.uci.edu/ml/datasets.html

  3. Bernecker T, Kriegel H, Renz M, Verhein F, Zfle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of SIGKDD’09 pp 119–128

  4. Bi J, Zhang T (2004) Support vector classification with input data uncertainty. Adv Neural Inf Process Syst 17: 161–168

    Google Scholar 

  5. Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Springer, Berlin

    Google Scholar 

  6. Bodner G, Schocke M, Rachbauer F, Seppi K, Peer S, Fierlinger A, Sununu T, Jaschke WR (2002) Differentiation of malignant and benign musculoskeletal tumors: combined color and power doppler US and spectral wave analysis. Radiology 223(2):410–416

    Article  Google Scholar 

  7. Carvalho F, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2): 231–250

    Article  MATH  Google Scholar 

  8. Chavent M, Carvalho F, Lechevallier Y, Verde R (2006) New clustering methods for interval data. Comput Stat 21(2): 211–229

    Article  MATH  Google Scholar 

  9. Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of the ACM SIGMOD international conference on management of data pp 551–562

  10. Chui C, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the PAKDD’07 pp 47–58

  11. Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning pp 115–123

  12. Cormode G, McGregor A (2008) Approximation algorithm for clustering uncertain data. In: Proceedings of the PODS 2008 pp 191–199

  13. Diday E, Fraiture MN (2008) Symbolic data analysis and the sodas software. Wiley, London

    MATH  Google Scholar 

  14. Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3): 201–244

    Article  MATH  Google Scholar 

  15. Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optimapreserving elimination of partition candidates. Data Min Knowl Discov 8(2): 97–126

    Article  MathSciNet  Google Scholar 

  16. Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102

    MATH  Google Scholar 

  17. Jiang L, Li C, Cai Z (2009) Learning decision tree for ranking. Knowl Inf Syst 20(1): 123–135

    Article  Google Scholar 

  18. Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, London

    MATH  Google Scholar 

  19. Kriegel H, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the KDD’05 pp 672–677

  20. Lobo O, Numao M (1999) Ordered estimation of missing values. In: Proceedings of PAKDD’99 pp 499–503

  21. Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data, In: Proceedings of ICDM’06. pp 436–445

  22. Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105

    Article  Google Scholar 

  23. Pappa G, Freitas A (2009) Evolving rule induction algorithms with multi-objective grammar-based genetic programming. Knowl Inf Syst 19(3): 283–309

    Article  Google Scholar 

  24. Qin B, Xia Y, Li F (2009) DTU: a decision tree for uncertain data. In: Proceedings of PAKDD’09 pp 4–15

  25. Qin B, Xia Y, Prbahakar S (2009) A rule-based classification algorithm for uncertain data. In: Proceedings of the workshop on management and mining Of uncertain data (MOUND)

  26. Qin B, Xia Y, Sathyesh R, Prabhakar S, Tu Y (2009) uRule: a rule-based classification system for uncertain data. In: Proceedings of ICDM’09 (Demo)

  27. Quinlan JR (1990) Probabilistic decision trees in machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco

    Google Scholar 

  28. Quinlan JR (1993) C.45: programs for machine learning. Morgan Kaufman, San Francisco

    Google Scholar 

  29. Quinlan JR (1995) MDL and categorial theories (Continued). In: Proceedings of international conference on machine Learning pp 464–470

  30. Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3): 287–312

    Article  Google Scholar 

  31. Resconi G, Kovalerchuk B (2009) Agents’ model of uncertainty. Knowl Inf Syst 18(2): 213–229

    Article  Google Scholar 

  32. Singh S, Mayfield C, Prabhakar S, Shah R, Hambrusch S (2007) Indexing categorical data with uncertainty. In: Proceedings of ICDE’07 pp 616–625

  33. Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ (2009) Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D structure and sequence Properties. PLoS Comput Biol 5(1): 410–416

    Article  Google Scholar 

  34. Tsang S, Kao B, Yip KY, Ho WS, Lee SD (2010) Decision trees for uncertain data. IEEE Trans Knowl Eng

  35. Wei G, Wang H, Lin R (2010) Application of correlation coefficient to interval-valued intuitionistic fuzzy multiple attribute decision-making with incomplete weight information. Knowl Inf Syst

  36. Weiss SM, Indurkhya N (1991) Reduced complexity rule induction. In: Proceedings of IJCAI’91 pp 678–684

  37. Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of ICDR’05 pp 262–276

  38. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufman, San Francisco

    MATH  Google Scholar 

  39. Yu Z, Wong H (2006) Mining uncertain data in low-dimensional Subspace. In: Proceedings of ICPR’06 pp 748–751

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Biao Qin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, B., Xia, Y. & Prabhakar, S. Rule induction for uncertain data. Knowl Inf Syst 29, 103–130 (2011). https://doi.org/10.1007/s10115-010-0335-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0335-7

Keywords

Navigation