Abstract
A hard-to-catch erroneous data is one whose value looks perfectly legitimate. Yet, if we examine this value in conjunction with other attribute values, the value appear questionable. Detecting such dubious values is a major problem in data cleaning. This paper presents a framework to automatically detect dubious data values in the datasets. Data is first pre-processed by data smoothing and mapping. Next, interval association rules are generated which involved data partitioning via clustering before the rules are generated using an Apriori algorithm. Finally, these rules are used to identify data values that fall outside the expected intervals. Experiment results show that the proposed framework is able to accurately and efficiently dubious values in large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Arning, A., Agrawal, R., Raghavan, P.: A linear method for deviation detection in large databases. In: AAAI (1996)
Monge, A.E.: Matching algorithms within a duplicate detection system. IEEE Bulletin on Data Engineering (2000)
Mauricio, A., Hernändez, Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)
Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: SIGMOD (2001)
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large data sets. In: VLDB (1998)
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Bulletin on Data Engineering (2000)
Maletic, J., Marcus, A.: Automated identification of errors in data sets (2000)
Maletic, J., Marcus, A.: Utilizing association rules for the identification of errors in data (2000)
Breuing, M.M., Kriegel, H.-P., ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)
Agrawal, R., Unuekubsjum, T., Swami, A.: Mining association rules between sets of items in large databases. In: VLDB (1993)
Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD (1997)
Kimball, R.: Dealing with dirty data. DBMS Online
Srikant, R., Agrawal, R.: Mining quantitative association rules. In: VLDB (1995)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD (2000)
Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: SIGKDD (2001)
Barnett, V., Lewis, T.: Outliers in statistical data (1994)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lu, R., Lee, M.L., Hsu, W. (2004). Using Interval Association Rules to Identify Dubious Data Values. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_53
Download citation
DOI: https://doi.org/10.1007/978-3-540-27772-9_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive