Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Hernández, Mauricio A.; Stolfo, Salvatore J.

doi:10.1023/A:1009761603038

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Published: January 1998

Volume 2, pages 9–37, (1998)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Mauricio A. Hernández¹ &
Salvatore J. Stolfo¹

3641 Accesses
482 Citations
3 Altmetric
Explore all metrics

Abstract

The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

ACM. SIGMOD record, December 1991.
Agrawal, R. and Jagadish, H.V. Multiprocessor Transitive Closure Algorithms. In Proc. Int'l Symp. on Databases in Parallel and Distributed Systems, pages 56–66, December 1988.
Batini, C., Lenzerini, M. and Navathe, S. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surverys, 18(4):323–364, December 1986.
Google Scholar
Bitton, D. and DeWitt, D. J. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.
Google Scholar
Buckles, B.P. and Petry, F. E. A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7:213–226, 1982. Generally regarded as the paper that originated Fuzzy Databases.
Google Scholar
Buckley, J. P. A Hierarchical Clustering Strategy for Very Large Fuzzy Databases. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pages 3573–3578, 1995.
Church, K.W. and Gale, W. A. Probability Scoring for Spelling Correction. Statistics and Computing, 1:93–103, 1991.
Google Scholar
Clark, T. K. Analyzing Foster Childrens' Foster Home Payments Database. In KDD Nuggets 95:7 (http://info.gte.com/"kdd/nuggets/95/), Piatetsky-Shapiro, ed., 1995.
Dietterich, T. and Michalski, R. A Comparative Review of Selected Methods for Learning from Examples. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 1, pages 41–81. Morgan Kaufmann Publishers, Inc., 1983.
Dubes, R. and Jain, A. Clustering Techniques: The User's Dilema. Pattern Recognition, 8:247–260, 1976.
Google Scholar
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), Fall 1996.
Fellegi, I. and Sunter, A. A Theory for Record Linkage. American Statistical Association Journal, pages 1183–1210, December 1969.
Forgy, C. L. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.
George, R., Petry, F. E., Buckles, B. P. and Srikanth, R. Fuzzy Database Systems – Challenges and Opportunities of a New Era. International Journal of Intelligent Systems, 11:649–659, 1996.
Google Scholar
Ghandeharizadeh, S. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin-Madison, 1990.
Hernández, M. and Stolfo, S. The Merge/Purge Problem for Large Databases. In Proceedings of the 1995 ACM-SIGMOD Conference, May 1995.
Kukich, K. Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24(4):377–439, 1992.
Google Scholar
Lebowitz, M. Not the Path to Perdition: The Utility of Similarity-Based Learning. In Proceedings of 5th National Conference on Artificial Intelligence, pages 533–537, 1986.
Monge, A. and Elkan, C. An Efficient Domain-independent Algorithm for Detecting Approximate Duplicate Database Records. In Proceedings of the 1997 SIGMOD Workshop on Research Issues on DMKD, pages 23–29, 1997.
Nyberg, C., Barclay, T., Cvetanovic, Z, Gray, J. and Lomet, D. AlphaSort: A RISC Machine Sort. In Proceedings of the 1994 ACM-SIGMOD Conference, pages 233–242, 1994.
Pollock, J. J. and Zamora, A. Automatic spelling correction in scientific and scholarly text. ACM Computing Surveys, 27(4):358–368, 1987.
Google Scholar
Senator, T., Goldberg, H., Wooton, J., Cottini, A., Umar, A., Klinger, C., Llamas, W. Marrone, M. and Wong, R. The FinCEN Artificial Intelligence System: Identifying Potential Money Laundering from Reports of Large Cash Transactions. In Proceedings of the 7th Conference on Innovative Applications of AI, August 1995.
Wang, Y. R. and Madnick, S. E. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proceedings of the Sixth International Conference on Data Engineering, February 1989.

Download references

Author information

Authors and Affiliations

Department of Computer Science, Columbia University, New York, NY, 10027
Mauricio A. Hernández & Salvatore J. Stolfo

Authors

Mauricio A. Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore J. Stolfo
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hernández, M.A., Stolfo, S.J. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 9–37 (1998). https://doi.org/10.1023/A:1009761603038

Download citation

Issue Date: January 1998
DOI: https://doi.org/10.1023/A:1009761603038

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Uncertainty in big data analytics: survey, opportunities, and challenges

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation