Skip to main content

Cleansing Data for Mining and Warehousing

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 1999)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1677))

Included in the following conference series:

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the “garbage in, garbage out” principle. “Dirty” data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Bitton and D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.

    Google Scholar 

  2. M. Hernandez and S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data pages 127–138, 1995.

    Google Scholar 

  3. M. Hernandez. A generation of band joins and the merge/purge problem. Technical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.

    Google Scholar 

  4. C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. Proc. of the ACM-SIGIR Conference on Research and Development in Information Retrieval pages 132–141, 1994.

    Google Scholar 

  5. A.E. Monge and C.P. Elkan. The field matching problem: Algorithms and applications. Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining pages 267–270, 1996.

    Google Scholar 

  6. A. Siberschatz, M. Stonebraker, and J.D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research. SIGMOD RECORD, March 1996.

    Google Scholar 

  7. T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology 147:195–197, 1981.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1999 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, M.L., Lu, H., Ling, T.W., Ko, Y.T. (1999). Cleansing Data for Mining and Warehousing. In: Bench-Capon, T.J., Soda, G., Tjoa, A.M. (eds) Database and Expert Systems Applications. DEXA 1999. Lecture Notes in Computer Science, vol 1677. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48309-8_70

Download citation

  • DOI: https://doi.org/10.1007/3-540-48309-8_70

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-66448-2

  • Online ISBN: 978-3-540-48309-0

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics