Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4251))

  • 1278 Accesses

Abstract

The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Famili, A., Shen, W., Weber, R., Simoudis, E.: Data Preprocessing and Intelligent Data Analysis. Intelligent Data Analysis 1(1), 3–23 (1997)

    Article  Google Scholar 

  2. Hall, P.A., Dowling, G.R.: Approximate string matching. Computer Surveys (12), 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  3. Gill, L.: Methods for Automatic Record Matching and Linking and their use in National Statistics. National Statistics Methodology Series No. 25, London (2001)

    Google Scholar 

  4. Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data Cleansing and The Merge/Purge Problem. Knowledge Discovery 2(1), 9–37 (1998)

    Article  Google Scholar 

  5. Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)

    Article  Google Scholar 

  6. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183–1210 (1969)

    Article  Google Scholar 

  7. William, E.: Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the, U.S. Census, Number RR91/09 (1990)

    Google Scholar 

  8. Winkler, W.E.: The State of Record Linkage and Current Research Problems. In: Statistical Society of Canada, Proceedings of the Section on Survey Methods, pp. 73–79 (1999)

    Google Scholar 

  9. Kimball, R.: Dealing with Dirty Data,DBMS online, Available at URL (September 1996), http://www.dbmsmag.com/9609d14.html

  10. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: 27th International Conference on Very Large Data Bases, pp. 371–380 (2001)

    Google Scholar 

  11. Low, W.L., Lee, M.-L., Lin, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)

    Article  MATH  Google Scholar 

  12. Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the ACM-SIGMOD workshop on Research issues on Knowledge discovery and data mining. AZ (1997)

    Google Scholar 

  13. McCallum, A.K., Nigam, K., Ungar, L.H.: Efficient clustering of high dimensional datasets with application to reference matching. In: Sixth International Conference on Knowledge Discovery and Data Mining, Boston (2000)

    Google Scholar 

  14. Sauleau, E.A., Paumier, J.-P., Buemi, A.: Medical record linkage in health information systems byapproximate string matching and clustering. In: BMC Medical Informatics and Decision Making, pp. 5–32 (2005)

    Google Scholar 

  15. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  16. Information retrieval: data structures and algorithmspp, pp. 419 - 442 (publication, 1992)

    Google Scholar 

  17. Zobel, J., Dart, P.: Phonetic string matching: Lessons from information retrieval. In: Proceedings of the Eighteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 166–173 (August 1996)

    Google Scholar 

  18. Lance, G., Williams, W.: A general theory of classification sorting strategies. Computer Journal 9, 373–386 (1967)

    Google Scholar 

  19. Lohninger, H.: Teach/Me Data Analysis. Springer, Berlin (1999)

    MATH  Google Scholar 

  20. Margaret, H.: Dunham: Data Mining: Introductory and Advanced Topics. Prentice-Hall, Englewood Cliffs (2002)

    Google Scholar 

  21. Robertson, A.M., Willet, P.: Applications of n-grams in textual information systems. Journal of Documentation 54(1), 48–69 (1998)

    Article  Google Scholar 

  22. Van-Rijsbergen, C.J.: Information Retrieval, ch. 3, 2nd edn., Butterworths, London, England (1979)

    Google Scholar 

  23. Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)

    MATH  Google Scholar 

  24. Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing & Management 24(4), 577–597 (1988)

    Article  Google Scholar 

  25. Smadja, F.A., McKeown, K.R.: Translating collocations for use in bilingual lexicons. In: Proceedings of the ARPA Human Language Technology Workshop, Princeton, N.J. (1994)

    Google Scholar 

  26. Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/

  27. Sparck Jones, K.: Automatic keyword classification for information retrieval, Butterworths, London, UK (1971)

    Google Scholar 

  28. Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE, Los Alamitos (2002)

    Google Scholar 

  29. Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 95, p. 25 (1998)

    Google Scholar 

  30. Bar-Joseph, Z., Giord, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, 22–29 (2001)

    Google Scholar 

  31. Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, New York (1999)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Apeh, E.T., Gabrys, B. (2006). Clustering for Data Matching. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_146

Download citation

  • DOI: https://doi.org/10.1007/11892960_146

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46535-5

  • Online ISBN: 978-3-540-46536-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics