Abstract
Data uncertainty are common in real-world applications and it can be caused by many factors such as imprecise measurements, network latency, outdated sources and sampling errors. When mining knowledge from these applications, data uncertainty need to be handled with caution. Otherwise, unreliable or even wrong mining results would be obtained. In this paper, we propose a rule induction algorithm, called uRule, to learn rules from uncertain data. The key problem in learning rules is to efficiently identify the optimal cut points from training data. For uncertain numerical data, we propose an optimization mechanism which merges adjacent bins that have equal classifying class distribution and prove its soundness. For the uncertain categorical data, we also propose a new method to select cut points based on possible world semantics. We then present the uRule algorithm in detail. Our experimental results show that the uRule algorithm can generate rules from uncertain numerical data with potentially higher accuracies, and the proposed optimization method is effective in the cut point selection for both certain and uncertain numerical data. Furthermore, uRule has quite stable performance when mining uncertain categorical data.
Similar content being viewed by others
References
Aggarwal C, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain Data. In: Proceedings of SIGKDD’09 pp 29–38
Bernecker T, Kriegel H, Renz M, Verhein F, Zfle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of SIGKDD’09 pp 119–128
Bi J, Zhang T (2004) Support vector classification with input data uncertainty. Adv Neural Inf Process Syst 17: 161–168
Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Springer, Berlin
Bodner G, Schocke M, Rachbauer F, Seppi K, Peer S, Fierlinger A, Sununu T, Jaschke WR (2002) Differentiation of malignant and benign musculoskeletal tumors: combined color and power doppler US and spectral wave analysis. Radiology 223(2):410–416
Carvalho F, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2): 231–250
Chavent M, Carvalho F, Lechevallier Y, Verde R (2006) New clustering methods for interval data. Comput Stat 21(2): 211–229
Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of the ACM SIGMOD international conference on management of data pp 551–562
Chui C, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the PAKDD’07 pp 47–58
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning pp 115–123
Cormode G, McGregor A (2008) Approximation algorithm for clustering uncertain data. In: Proceedings of the PODS 2008 pp 191–199
Diday E, Fraiture MN (2008) Symbolic data analysis and the sodas software. Wiley, London
Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3): 201–244
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optimapreserving elimination of partition candidates. Data Min Knowl Discov 8(2): 97–126
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
Jiang L, Li C, Cai Z (2009) Learning decision tree for ranking. Knowl Inf Syst 20(1): 123–135
Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, London
Kriegel H, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the KDD’05 pp 672–677
Lobo O, Numao M (1999) Ordered estimation of missing values. In: Proceedings of PAKDD’99 pp 499–503
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data, In: Proceedings of ICDM’06. pp 436–445
Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105
Pappa G, Freitas A (2009) Evolving rule induction algorithms with multi-objective grammar-based genetic programming. Knowl Inf Syst 19(3): 283–309
Qin B, Xia Y, Li F (2009) DTU: a decision tree for uncertain data. In: Proceedings of PAKDD’09 pp 4–15
Qin B, Xia Y, Prbahakar S (2009) A rule-based classification algorithm for uncertain data. In: Proceedings of the workshop on management and mining Of uncertain data (MOUND)
Qin B, Xia Y, Sathyesh R, Prabhakar S, Tu Y (2009) uRule: a rule-based classification system for uncertain data. In: Proceedings of ICDM’09 (Demo)
Quinlan JR (1990) Probabilistic decision trees in machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco
Quinlan JR (1993) C.45: programs for machine learning. Morgan Kaufman, San Francisco
Quinlan JR (1995) MDL and categorial theories (Continued). In: Proceedings of international conference on machine Learning pp 464–470
Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3): 287–312
Resconi G, Kovalerchuk B (2009) Agents’ model of uncertainty. Knowl Inf Syst 18(2): 213–229
Singh S, Mayfield C, Prabhakar S, Shah R, Hambrusch S (2007) Indexing categorical data with uncertainty. In: Proceedings of ICDE’07 pp 616–625
Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ (2009) Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D structure and sequence Properties. PLoS Comput Biol 5(1): 410–416
Tsang S, Kao B, Yip KY, Ho WS, Lee SD (2010) Decision trees for uncertain data. IEEE Trans Knowl Eng
Wei G, Wang H, Lin R (2010) Application of correlation coefficient to interval-valued intuitionistic fuzzy multiple attribute decision-making with incomplete weight information. Knowl Inf Syst
Weiss SM, Indurkhya N (1991) Reduced complexity rule induction. In: Proceedings of IJCAI’91 pp 678–684
Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of ICDR’05 pp 262–276
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufman, San Francisco
Yu Z, Wong H (2006) Mining uncertain data in low-dimensional Subspace. In: Proceedings of ICPR’06 pp 748–751
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qin, B., Xia, Y. & Prabhakar, S. Rule induction for uncertain data. Knowl Inf Syst 29, 103–130 (2011). https://doi.org/10.1007/s10115-010-0335-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-010-0335-7