Using Interval Association Rules to Identify Dubious Data Values

Lu, Ren; Lee, Mong Li; Hsu, Wynne

doi:10.1007/978-3-540-27772-9_53

Ren Lu¹⁸,
Mong Li Lee¹⁸ &
Wynne Hsu¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

International Conference on Web-Age Information Management

890 Accesses

Abstract

A hard-to-catch erroneous data is one whose value looks perfectly legitimate. Yet, if we examine this value in conjunction with other attribute values, the value appear questionable. Detecting such dubious values is a major problem in data cleaning. This paper presents a framework to automatically detect dubious data values in the datasets. Data is first pre-processed by data smoothing and mapping. Next, interval association rules are generated which involved data partitioning via clustering before the rules are generated using an Apriori algorithm. Finally, these rules are used to identify data values that fall outside the expected intervals. Experiment results show that the proposed framework is able to accurately and efficiently dubious values in large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arning, A., Agrawal, R., Raghavan, P.: A linear method for deviation detection in large databases. In: AAAI (1996)
Google Scholar
Monge, A.E.: Matching algorithms within a duplicate detection system. IEEE Bulletin on Data Engineering (2000)
Google Scholar
Mauricio, A., Hernändez, Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)
Google Scholar
Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: SIGMOD (2001)
Google Scholar
Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large data sets. In: VLDB (1998)
Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Bulletin on Data Engineering (2000)
Google Scholar
Maletic, J., Marcus, A.: Automated identification of errors in data sets (2000)
Google Scholar
Maletic, J., Marcus, A.: Utilizing association rules for the identification of errors in data (2000)
Google Scholar
Breuing, M.M., Kriegel, H.-P., ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)
Google Scholar
Agrawal, R., Unuekubsjum, T., Swami, A.: Mining association rules between sets of items in large databases. In: VLDB (1993)
Google Scholar
Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD (1997)
Google Scholar
Kimball, R.: Dealing with dirty data. DBMS Online
Google Scholar
Srikant, R., Agrawal, R.: Mining quantitative association rules. In: VLDB (1995)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD (2000)
Google Scholar
Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: SIGKDD (2001)
Google Scholar
Barnett, V., Lewis, T.: Outliers in statistical data (1994)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, National University of Singapore, Singapore
Ren Lu, Mong Li Lee & Wynne Hsu

Authors

Ren Lu
View author publications
You can also search for this author in PubMed Google Scholar
Mong Li Lee
View author publications
You can also search for this author in PubMed Google Scholar
Wynne Hsu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Shenyang Liaoning, Northeastern University, 110004, China
Guoren Wang
Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lu, R., Lee, M.L., Hsu, W. (2004). Using Interval Association Rules to Identify Dubious Data Values. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_53

Download citation

DOI: https://doi.org/10.1007/978-3-540-27772-9_53
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics