Clustering for Data Matching

Apeh, Edward Tersoo; Gabrys, Bogdan

doi:10.1007/11892960_146

Edward Tersoo Apeh^21,22 &
Bogdan Gabrys²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4251))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1278 Accesses

Abstract

The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Famili, A., Shen, W., Weber, R., Simoudis, E.: Data Preprocessing and Intelligent Data Analysis. Intelligent Data Analysis 1(1), 3–23 (1997)
Article Google Scholar
Hall, P.A., Dowling, G.R.: Approximate string matching. Computer Surveys (12), 381–402 (1980)
Article MathSciNet Google Scholar
Gill, L.: Methods for Automatic Record Matching and Linking and their use in National Statistics. National Statistics Methodology Series No. 25, London (2001)
Google Scholar
Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data Cleansing and The Merge/Purge Problem. Knowledge Discovery 2(1), 9–37 (1998)
Article Google Scholar
Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)
Article Google Scholar
Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183–1210 (1969)
Article Google Scholar
William, E.: Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the, U.S. Census, Number RR91/09 (1990)
Google Scholar
Winkler, W.E.: The State of Record Linkage and Current Research Problems. In: Statistical Society of Canada, Proceedings of the Section on Survey Methods, pp. 73–79 (1999)
Google Scholar
Kimball, R.: Dealing with Dirty Data,DBMS online, Available at URL (September 1996), http://www.dbmsmag.com/9609d14.html
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: 27th International Conference on Very Large Data Bases, pp. 371–380 (2001)
Google Scholar
Low, W.L., Lee, M.-L., Lin, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)
Article MATH Google Scholar
Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the ACM-SIGMOD workshop on Research issues on Knowledge discovery and data mining. AZ (1997)
Google Scholar
McCallum, A.K., Nigam, K., Ungar, L.H.: Efficient clustering of high dimensional datasets with application to reference matching. In: Sixth International Conference on Knowledge Discovery and Data Mining, Boston (2000)
Google Scholar
Sauleau, E.A., Paumier, J.-P., Buemi, A.: Medical record linkage in health information systems byapproximate string matching and clustering. In: BMC Medical Informatics and Decision Making, pp. 5–32 (2005)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Information retrieval: data structures and algorithmspp, pp. 419 - 442 (publication, 1992)
Google Scholar
Zobel, J., Dart, P.: Phonetic string matching: Lessons from information retrieval. In: Proceedings of the Eighteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 166–173 (August 1996)
Google Scholar
Lance, G., Williams, W.: A general theory of classification sorting strategies. Computer Journal 9, 373–386 (1967)
Google Scholar
Lohninger, H.: Teach/Me Data Analysis. Springer, Berlin (1999)
MATH Google Scholar
Margaret, H.: Dunham: Data Mining: Introductory and Advanced Topics. Prentice-Hall, Englewood Cliffs (2002)
Google Scholar
Robertson, A.M., Willet, P.: Applications of n-grams in textual information systems. Journal of Documentation 54(1), 48–69 (1998)
Article Google Scholar
Van-Rijsbergen, C.J.: Information Retrieval, ch. 3, 2nd edn., Butterworths, London, England (1979)
Google Scholar
Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)
MATH Google Scholar
Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing & Management 24(4), 577–597 (1988)
Article Google Scholar
Smadja, F.A., McKeown, K.R.: Translating collocations for use in bilingual lexicons. In: Proceedings of the ARPA Human Language Technology Workshop, Princeton, N.J. (1994)
Google Scholar
Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/
Sparck Jones, K.: Automatic keyword classification for information retrieval, Butterworths, London, UK (1971)
Google Scholar
Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE, Los Alamitos (2002)
Google Scholar
Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 95, p. 25 (1998)
Google Scholar
Bar-Joseph, Z., Giord, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, 22–29 (2001)
Google Scholar
Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, New York (1999)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Design, Engineering and Computing, Computational Intelligence Research Group, Bournemouth University, Talbot Campus, Fern Barrow, Poole, BH12 5BB, UK
Edward Tersoo Apeh & Bogdan Gabrys
QGate Software Limited, D2 Fareham Heights, Standard Way, Fareham, Hampshire, PO16 8XT, UK
Edward Tersoo Apeh

Authors

Edward Tersoo Apeh
View author publications
You can also search for this author in PubMed Google Scholar
Bogdan Gabrys
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Design, Engineering and Computing, Bournemouth University, UK
Bogdan Gabrys
Centre for SMART Systems, School of Environment and Technology, University of Brighton, BN2 4GJ, Brighton, UK
Robert J. Howlett
School of Electrical and Information Engineering, Knowledge Based Intelligent Engineering Systems Centre, University of South Australia, Mawson Lakes, 5095, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Apeh, E.T., Gabrys, B. (2006). Clustering for Data Matching. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4251. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11892960_146

Download citation

DOI: https://doi.org/10.1007/11892960_146
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-46535-5
Online ISBN: 978-3-540-46536-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics