Rule induction for uncertain data

Qin, Biao; Xia, Yuni; Prabhakar, Sunil

doi:10.1007/s10115-010-0335-7

Rule induction for uncertain data

Regular Paper
Published: 21 August 2010

Volume 29, pages 103–130, (2011)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Biao Qin^1,2,
Yuni Xia³ &
Sunil Prabhakar⁴

211 Accesses
18 Citations
Explore all metrics

Abstract

Data uncertainty are common in real-world applications and it can be caused by many factors such as imprecise measurements, network latency, outdated sources and sampling errors. When mining knowledge from these applications, data uncertainty need to be handled with caution. Otherwise, unreliable or even wrong mining results would be obtained. In this paper, we propose a rule induction algorithm, called uRule, to learn rules from uncertain data. The key problem in learning rules is to efficiently identify the optimal cut points from training data. For uncertain numerical data, we propose an optimization mechanism which merges adjacent bins that have equal classifying class distribution and prove its soundness. For the uncertain categorical data, we also propose a new method to select cut points based on possible world semantics. We then present the uRule algorithm in detail. Our experimental results show that the uRule algorithm can generate rules from uncertain numerical data with potentially higher accuracies, and the proposed optimization method is effective in the cut point selection for both certain and uncertain numerical data. Furthermore, uRule has quite stable performance when mining uncertain categorical data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal C, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain Data. In: Proceedings of SIGKDD’09 pp 29–38
http://archive.ics.uci.edu/ml/datasets.html
Bernecker T, Kriegel H, Renz M, Verhein F, Zfle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of SIGKDD’09 pp 119–128
Bi J, Zhang T (2004) Support vector classification with input data uncertainty. Adv Neural Inf Process Syst 17: 161–168
Google Scholar
Bock HH, Diday E (2000) Analysis of symbolic data, exploratory methods for extracting statistical information from complex data. Springer, Berlin
Google Scholar
Bodner G, Schocke M, Rachbauer F, Seppi K, Peer S, Fierlinger A, Sununu T, Jaschke WR (2002) Differentiation of malignant and benign musculoskeletal tumors: combined color and power doppler US and spectral wave analysis. Radiology 223(2):410–416
Article Google Scholar
Carvalho F, Brito P, Bock HH (2006) Dynamic clustering for interval data based on L2 distance. Comput Stat 21(2): 231–250
Article MATH Google Scholar
Chavent M, Carvalho F, Lechevallier Y, Verde R (2006) New clustering methods for interval data. Comput Stat 21(2): 211–229
Article MATH Google Scholar
Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of the ACM SIGMOD international conference on management of data pp 551–562
Chui C, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the PAKDD’07 pp 47–58
Cohen WW (1995) Fast effective rule induction. In: Proceedings of the 12th international conference on machine learning pp 115–123
Cormode G, McGregor A (2008) Approximation algorithm for clustering uncertain data. In: Proceedings of the PODS 2008 pp 191–199
Diday E, Fraiture MN (2008) Symbolic data analysis and the sodas software. Wiley, London
MATH Google Scholar
Elomaa T, Rousu J (1999) General and efficient multisplitting of numerical attributes. Mach Learn 36(3): 201–244
Article MATH Google Scholar
Elomaa T, Rousu J (2004) Efficient multisplitting revisited: optimapreserving elimination of partition candidates. Data Min Knowl Discov 8(2): 97–126
Article MathSciNet Google Scholar
Fayyad UM, Irani KB (1992) On the handling of continuous-valued attributes in decision tree generation. Mach Learn 8: 87–102
MATH Google Scholar
Jiang L, Li C, Cai Z (2009) Learning decision tree for ranking. Knowl Inf Syst 20(1): 123–135
Article Google Scholar
Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, 2nd edn. Wiley, London
MATH Google Scholar
Kriegel H, Pfeifle M (2005) Density-based clustering of uncertain data. In: Proceedings of the KDD’05 pp 672–677
Lobo O, Numao M (1999) Ordered estimation of missing values. In: Proceedings of PAKDD’99 pp 499–503
Ngai WK, Kao B, Chui CK, Cheng R, Chau M, Yip KY (2006) Efficient clustering of uncertain data, In: Proceedings of ICDM’06. pp 436–445
Pang S, Kasabov N (2009) Encoding and decoding the knowledge of association rules over SVM classification trees. Knowl Inf Syst 19(1): 79–105
Article Google Scholar
Pappa G, Freitas A (2009) Evolving rule induction algorithms with multi-objective grammar-based genetic programming. Knowl Inf Syst 19(3): 283–309
Article Google Scholar
Qin B, Xia Y, Li F (2009) DTU: a decision tree for uncertain data. In: Proceedings of PAKDD’09 pp 4–15
Qin B, Xia Y, Prbahakar S (2009) A rule-based classification algorithm for uncertain data. In: Proceedings of the workshop on management and mining Of uncertain data (MOUND)
Qin B, Xia Y, Sathyesh R, Prabhakar S, Tu Y (2009) uRule: a rule-based classification system for uncertain data. In: Proceedings of ICDM’09 (Demo)
Quinlan JR (1990) Probabilistic decision trees in machine learning: an artificial intelligence approach. Morgan Kaufmann, San Francisco
Google Scholar
Quinlan JR (1993) C.45: programs for machine learning. Morgan Kaufman, San Francisco
Google Scholar
Quinlan JR (1995) MDL and categorial theories (Continued). In: Proceedings of international conference on machine Learning pp 464–470
Quinlan JR, Cameron-Jones RM (1995) Induction of logic programs: FOIL and related systems. New Gener Comput 13(3): 287–312
Article Google Scholar
Resconi G, Kovalerchuk B (2009) Agents’ model of uncertainty. Knowl Inf Syst 18(2): 213–229
Article Google Scholar
Singh S, Mayfield C, Prabhakar S, Shah R, Hambrusch S (2007) Indexing categorical data with uncertainty. In: Proceedings of ICDE’07 pp 616–625
Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ (2009) Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D structure and sequence Properties. PLoS Comput Biol 5(1): 410–416
Article Google Scholar
Tsang S, Kao B, Yip KY, Ho WS, Lee SD (2010) Decision trees for uncertain data. IEEE Trans Knowl Eng
Wei G, Wang H, Lin R (2010) Application of correlation coefficient to interval-valued intuitionistic fuzzy multiple attribute decision-making with incomplete weight information. Knowl Inf Syst
Weiss SM, Indurkhya N (1991) Reduced complexity rule induction. In: Proceedings of IJCAI’91 pp 678–684
Widom J (2005) Trio: a system for integrated management of data, accuracy, and lineage. In: Proceedings of ICDR’05 pp 262–276
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufman, San Francisco
MATH Google Scholar
Yu Z, Wong H (2006) Mining uncertain data in low-dimensional Subspace. In: Proceedings of ICPR’06 pp 748–751

Download references

Author information

Authors and Affiliations

Department of Computer Science, Renmin University of China, Beijing, China
Biao Qin
Key Labs of Data Engineering and Knowledge Engineering, MOE, Beijing, China
Biao Qin
Department of Computer and Information Science, Indiana University-Purdue University Indianapolis, Indianapolis, IN, USA
Yuni Xia
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Sunil Prabhakar

Authors

Biao Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yuni Xia
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Prabhakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Biao Qin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qin, B., Xia, Y. & Prabhakar, S. Rule induction for uncertain data. Knowl Inf Syst 29, 103–130 (2011). https://doi.org/10.1007/s10115-010-0335-7

Download citation

Received: 23 October 2009
Revised: 11 July 2010
Accepted: 17 July 2010
Published: 21 August 2010
Issue Date: October 2011
DOI: https://doi.org/10.1007/s10115-010-0335-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Rule induction for uncertain data

Abstract

Access this article

Similar content being viewed by others

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods

A survey on ensemble learning

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Rule induction for uncertain data

Abstract

Access this article

Similar content being viewed by others

Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods

A survey on ensemble learning

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation