Cleansing Data for Mining and Warehousing

Lee, Mong Li; Lu, Hongjun; Ling, Tok Wang; Ko, Yee Teng

doi:10.1007/3-540-48309-8_70

Mong Li Lee⁷,
Hongjun Lu⁷,
Tok Wang Ling⁷ &
…
Yee Teng Ko⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 1677))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

716 Accesses
35 Citations

Abstract

Given the rapid growth of data, it is important to extract, mine and discover useful information from databases and data warehouses. The process of data cleansing is crucial because of the “garbage in, garbage out” principle. “Dirty” data files are prevalent because of incorrect or missing data values, inconsistent value naming conventions, and incomplete information. Hence, we may have multiple records refering to the same real world entity. In this paper, we examine the problem of detecting and removing duplicating records. We present several efficient techniques to pre-process the records before sorting them so that potentially matching records will be brought to a close neighbourhood. Based on these techniques, we implement a data cleansing system which can detect and remove more duplicate records than existing methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Bitton and D.J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 1995.
Google Scholar
M. Hernandez and S. Stolfo. The merge/purge problem for large databases. Proc. of ACM SIGMOD Int. Conference on Management of Data pages 127–138, 1995.
Google Scholar
M. Hernandez. A generation of band joins and the merge/purge problem. Technical report CUCS-005-1995, Department of Computer Science, Columbia University, 1995.
Google Scholar
C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized unification-based framework. Proc. of the ACM-SIGIR Conference on Research and Development in Information Retrieval pages 132–141, 1994.
Google Scholar
A.E. Monge and C.P. Elkan. The field matching problem: Algorithms and applications. Proc. of the 2nd Int. Conference on Knowledge Discovery and Data Mining pages 267–270, 1996.
Google Scholar
A. Siberschatz, M. Stonebraker, and J.D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research. SIGMOD RECORD, March 1996.
Google Scholar
T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. Journal of Molecular Biology 147:195–197, 1981.
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, National University of Singapore, Singapore
Mong Li Lee, Hongjun Lu, Tok Wang Ling & Yee Teng Ko

Authors

Mong Li Lee
View author publications
You can also search for this author in PubMed Google Scholar
Hongjun Lu
View author publications
You can also search for this author in PubMed Google Scholar
Tok Wang Ling
View author publications
You can also search for this author in PubMed Google Scholar
Yee Teng Ko
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Liverpool, P.O. Box 147, Liverpool, L49 3BX, UK
Trevor J.M. Bench-Capon
Department of Systems and Computers, University of Florence, Via S. Marta, 3, I-50139, Florence, Italy
Giovanni Soda
IFS, Technical University of Vienna, Resselgasse 3, A-1040, Vienna, Austria
A Min Tjoa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, M.L., Lu, H., Ling, T.W., Ko, Y.T. (1999). Cleansing Data for Mining and Warehousing. In: Bench-Capon, T.J., Soda, G., Tjoa, A.M. (eds) Database and Expert Systems Applications. DEXA 1999. Lecture Notes in Computer Science, vol 1677. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48309-8_70

Download citation

DOI: https://doi.org/10.1007/3-540-48309-8_70
Published: 18 June 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66448-2
Online ISBN: 978-3-540-48309-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics