Abstract
The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Famili, A., Shen, W., Weber, R., Simoudis, E.: Data Preprocessing and Intelligent Data Analysis. Intelligent Data Analysis 1(1), 3–23 (1997)
Hall, P.A., Dowling, G.R.: Approximate string matching. Computer Surveys (12), 381–402 (1980)
Gill, L.: Methods for Automatic Record Matching and Linking and their use in National Statistics. National Statistics Methodology Series No. 25, London (2001)
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data Cleansing and The Merge/Purge Problem. Knowledge Discovery 2(1), 9–37 (1998)
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183–1210 (1969)
William, E.: Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the, U.S. Census, Number RR91/09 (1990)
Winkler, W.E.: The State of Record Linkage and Current Research Problems. In: Statistical Society of Canada, Proceedings of the Section on Survey Methods, pp. 73–79 (1999)
Kimball, R.: Dealing with Dirty Data,DBMS online, Available at URL (September 1996), http://www.dbmsmag.com/9609d14.html
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: 27th International Conference on Very Large Data Bases, pp. 371–380 (2001)
Low, W.L., Lee, M.-L., Lin, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the ACM-SIGMOD workshop on Research issues on Knowledge discovery and data mining. AZ (1997)
McCallum, A.K., Nigam, K., Ungar, L.H.: Efficient clustering of high dimensional datasets with application to reference matching. In: Sixth International Conference on Knowledge Discovery and Data Mining, Boston (2000)
Sauleau, E.A., Paumier, J.-P., Buemi, A.: Medical record linkage in health information systems byapproximate string matching and clustering. In: BMC Medical Informatics and Decision Making, pp. 5–32 (2005)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Information retrieval: data structures and algorithmspp, pp. 419 - 442 (publication, 1992)
Zobel, J., Dart, P.: Phonetic string matching: Lessons from information retrieval. In: Proceedings of the Eighteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 166–173 (August 1996)
Lance, G., Williams, W.: A general theory of classification sorting strategies. Computer Journal 9, 373–386 (1967)
Lohninger, H.: Teach/Me Data Analysis. Springer, Berlin (1999)
Margaret, H.: Dunham: Data Mining: Introductory and Advanced Topics. Prentice-Hall, Englewood Cliffs (2002)
Robertson, A.M., Willet, P.: Applications of n-grams in textual information systems. Journal of Documentation 54(1), 48–69 (1998)
Van-Rijsbergen, C.J.: Information Retrieval, ch. 3, 2nd edn., Butterworths, London, England (1979)
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing & Management 24(4), 577–597 (1988)
Smadja, F.A., McKeown, K.R.: Translating collocations for use in bilingual lexicons. In: Proceedings of the ARPA Human Language Technology Workshop, Princeton, N.J. (1994)
Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/
Sparck Jones, K.: Automatic keyword classification for information retrieval, Butterworths, London, UK (1971)
Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE, Los Alamitos (2002)
Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 95, p. 25 (1998)
Bar-Joseph, Z., Giord, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, 22–29 (2001)
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, New York (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Apeh, E.T., Gabrys, B. (2006). Clustering for Data Matching. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_146
Download citation
DOI: https://doi.org/10.1007/11892960_146
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46535-5
Online ISBN: 978-3-540-46536-2
eBook Packages: Computer ScienceComputer Science (R0)