Abstract
Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the “garbage in, garbage out” principle. “Dirty” data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
D. Bitton and D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data pages 127–138, 1995.
M. Hernandez. A generation of band joins and the merge/purge problem. Technical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.
C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. Proc. of the ACM-SIGIR Conference on Research and Development in Information Retrieval pages 132–141, 1994.
A.E. Monge and C.P. Elkan. The field matching problem: Algorithms and applications. Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining pages 267–270, 1996.
A. Siberschatz, M. Stonebraker, and J.D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research. SIGMOD RECORD, March 1996.
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology 147:195–197, 1981.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lee, M.L., Lu, H., Ling, T.W., Ko, Y.T. (1999). Cleansing Data for Mining and Warehousing. In: Bench-Capon, T.J., Soda, G., Tjoa, A.M. (eds) Database and Expert Systems Applications. DEXA 1999. Lecture Notes in Computer Science, vol 1677. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48309-8_70
Download citation
DOI: https://doi.org/10.1007/3-540-48309-8_70
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66448-2
Online ISBN: 978-3-540-48309-0
eBook Packages: Springer Book Archive