Skip to main content
Log in

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • ACM. SIGMOD record, December 1991.

  • Agrawal, R. and Jagadish, H.V. Multiprocessor Transitive Closure Algorithms. In Proc. Int'l Symp. on Databases in Parallel and Distributed Systems, pages 56–66, December 1988.

  • Batini, C., Lenzerini, M. and Navathe, S. A Comparative Analysis of Methodologies for Database Schema Integration. ACM Computing Surverys, 18(4):323–364, December 1986.

    Google Scholar 

  • Bitton, D. and DeWitt, D. J. Duplicate Record Elimination in Large Data Files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.

    Google Scholar 

  • Buckles, B.P. and Petry, F. E. A fuzzy representation of data for relational databases. Fuzzy Sets and Systems, 7:213–226, 1982. Generally regarded as the paper that originated Fuzzy Databases.

    Google Scholar 

  • Buckley, J. P. A Hierarchical Clustering Strategy for Very Large Fuzzy Databases. In Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, pages 3573–3578, 1995.

  • Church, K.W. and Gale, W. A. Probability Scoring for Spelling Correction. Statistics and Computing, 1:93–103, 1991.

    Google Scholar 

  • Clark, T. K. Analyzing Foster Childrens' Foster Home Payments Database. In KDD Nuggets 95:7 (http://info.gte.com/"kdd/nuggets/95/), Piatetsky-Shapiro, ed., 1995.

  • Dietterich, T. and Michalski, R. A Comparative Review of Selected Methods for Learning from Examples. In R. Michalski, J. Carbonell, and T. Mitchell, editors, Machine Learning, volume 1, pages 41–81. Morgan Kaufmann Publishers, Inc., 1983.

  • Dubes, R. and Jain, A. Clustering Techniques: The User's Dilema. Pattern Recognition, 8:247–260, 1976.

    Google Scholar 

  • Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. From Data Mining to Knowledge Discovery in Databases. AI Magazine, 17(3), Fall 1996.

  • Fellegi, I. and Sunter, A. A Theory for Record Linkage. American Statistical Association Journal, pages 1183–1210, December 1969.

  • Forgy, C. L. OPS5 User's Manual. Technical Report CMU-CS-81-135, Carnegie Mellon University, July 1981.

  • George, R., Petry, F. E., Buckles, B. P. and Srikanth, R. Fuzzy Database Systems – Challenges and Opportunities of a New Era. International Journal of Intelligent Systems, 11:649–659, 1996.

    Google Scholar 

  • Ghandeharizadeh, S. Physical Database Design in Multiprocessor Database Systems. PhD thesis, Department of Computer Science, University of Wisconsin-Madison, 1990.

  • Hernández, M. and Stolfo, S. The Merge/Purge Problem for Large Databases. In Proceedings of the 1995 ACM-SIGMOD Conference, May 1995.

  • Kukich, K. Techniques for Automatically Correcting Words in Text. ACM Computing Surveys, 24(4):377–439, 1992.

    Google Scholar 

  • Lebowitz, M. Not the Path to Perdition: The Utility of Similarity-Based Learning. In Proceedings of 5th National Conference on Artificial Intelligence, pages 533–537, 1986.

  • Monge, A. and Elkan, C. An Efficient Domain-independent Algorithm for Detecting Approximate Duplicate Database Records. In Proceedings of the 1997 SIGMOD Workshop on Research Issues on DMKD, pages 23–29, 1997.

  • Nyberg, C., Barclay, T., Cvetanovic, Z, Gray, J. and Lomet, D. AlphaSort: A RISC Machine Sort. In Proceedings of the 1994 ACM-SIGMOD Conference, pages 233–242, 1994.

  • Pollock, J. J. and Zamora, A. Automatic spelling correction in scientific and scholarly text. ACM Computing Surveys, 27(4):358–368, 1987.

    Google Scholar 

  • Senator, T., Goldberg, H., Wooton, J., Cottini, A., Umar, A., Klinger, C., Llamas, W. Marrone, M. and Wong, R. The FinCEN Artificial Intelligence System: Identifying Potential Money Laundering from Reports of Large Cash Transactions. In Proceedings of the 7th Conference on Innovative Applications of AI, August 1995.

  • Wang, Y. R. and Madnick, S. E. The Inter-Database Instance Identification Problem in Integrating Autonomous Systems. In Proceedings of the Sixth International Conference on Data Engineering, February 1989.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hernández, M.A., Stolfo, S.J. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem. Data Mining and Knowledge Discovery 2, 9–37 (1998). https://doi.org/10.1023/A:1009761603038

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1009761603038

Navigation