Skip to main content

Using Interval Association Rules to Identify Dubious Data Values

  • Conference paper
Advances in Web-Age Information Management (WAIM 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

  • 890 Accesses

Abstract

A hard-to-catch erroneous data is one whose value looks perfectly legitimate. Yet, if we examine this value in conjunction with other attribute values, the value appear questionable. Detecting such dubious values is a major problem in data cleaning. This paper presents a framework to automatically detect dubious data values in the datasets. Data is first pre-processed by data smoothing and mapping. Next, interval association rules are generated which involved data partitioning via clustering before the rules are generated using an Apriori algorithm. Finally, these rules are used to identify data values that fall outside the expected intervals. Experiment results show that the proposed framework is able to accurately and efficiently dubious values in large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Arning, A., Agrawal, R., Raghavan, P.: A linear method for deviation detection in large databases. In: AAAI (1996)

    Google Scholar 

  2. Monge, A.E.: Matching algorithms within a duplicate detection system. IEEE Bulletin on Data Engineering (2000)

    Google Scholar 

  3. Mauricio, A., Hernändez, Stolfo, S.J.: The merge/purge problem for large databases. In: SIGMOD (1995)

    Google Scholar 

  4. Aggarwal, C.C., Yu, P.S.: Outlier detection for high dimensional data. In: SIGMOD (2001)

    Google Scholar 

  5. Knorr, E., Ng, R.: Algorithms for mining distance-based outliers in large data sets. In: VLDB (1998)

    Google Scholar 

  6. Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Bulletin on Data Engineering (2000)

    Google Scholar 

  7. Maletic, J., Marcus, A.: Automated identification of errors in data sets (2000)

    Google Scholar 

  8. Maletic, J., Marcus, A.: Utilizing association rules for the identification of errors in data (2000)

    Google Scholar 

  9. Breuing, M.M., Kriegel, H.-P., ng, R.T., Sander, J.: Lof: Identifying density-based local outliers. In: SIGMOD (2000)

    Google Scholar 

  10. Agrawal, R., Unuekubsjum, T., Swami, A.: Mining association rules between sets of items in large databases. In: VLDB (1993)

    Google Scholar 

  11. Miller, R.J., Yang, Y.: Association rules over interval data. In: SIGMOD (1997)

    Google Scholar 

  12. Kimball, R.: Dealing with dirty data. DBMS Online

    Google Scholar 

  13. Srikant, R., Agrawal, R.: Mining quantitative association rules. In: VLDB (1995)

    Google Scholar 

  14. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. In: SIGMOD (2000)

    Google Scholar 

  15. Chiu, T., Fang, D.P., Chen, J., Wang, Y.: A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: SIGKDD (2001)

    Google Scholar 

  16. Barnett, V., Lewis, T.: Outliers in statistical data (1994)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lu, R., Lee, M.L., Hsu, W. (2004). Using Interval Association Rules to Identify Dubious Data Values. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_53

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-27772-9_53

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22418-1

  • Online ISBN: 978-3-540-27772-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics