Skip to main content
Log in

An incremental clustering scheme for data de-duplication

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We propose an incremental technique for discovering duplicates in large databases of textual sequences, i.e., syntactically different tuples, that refer to the same real-world entity. The problem is approached from a clustering perspective: given a set of tuples, the objective is to partition them into groups of duplicate tuples. Each newly arrived tuple is assigned to an appropriate cluster via nearest-neighbor classification. This is achieved by means of a suitable hash-based index, that maps any tuple to a set of indexing keys and assigns tuples with high syntactic similarity to the same buckets. Hence, the neighbors of a query tuple can be efficiently identified by simply retrieving those tuples that appear in the same buckets associated to the query tuple itself, without completely scanning the original database. Two alternative schemes for computing indexing keys are discussed and compared. An extensive experimental evaluation on both synthetic and real data shows the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Agichtein E, Ganti V (2004) Mining reference tables for automatic text segmentation. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 20–29

  • Ananthakrishna R, Chaudhuri S, Ganti V (2002) Eliminating fuzzy duplicates in data warehouses. In: Proceedings of the international conference on very large databases, pp 586–597

  • Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases, pp 918–929

  • Bawa M, Tyson S, Condie, Ganesan P (2005) LSH forest: self-tuning indexes for similarity search. In: Proceedings of the international conference on world wide web, pp 651–660

  • Bayardo RJ, Ma Y, Srikant R (2007) Scaling up all pairs similarity search. In: Proceedings of the international conference on world wide web, pp 131–140

  • Bhattacharya I, Getoor L (2004) Iterative record linkage for cleaning and integration. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 11–18

  • Bilenko M, Mooney RJ (2003a) Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 39–48

  • Bilenko M, Mooney RJ (2003b) On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 7–12

  • Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering on the Web. In: Proceedings of the international conference on World Wide Web, pp 1157–1166

  • Broder A, Charikar M, Frieze AM, Mitzenmacher M (1998) Minwise independent permutations. In: Proceedings of the ACM symposium on theory of computing, pp 327–336

  • Cesario E, Folino F, Manco G, Pontieri L (2005) An incremental clustering scheme for duplicate detection in large databases. In: Proceedings of the international conference databases and applications symposium, pp 89–95

  • Cesario E, Folino F, Locane A, Manco G, Ortale R (2008) Boosting text segmentation via progressive classification. J Knowl Inf Syst 15(3): 285–320

    Article  Google Scholar 

  • Chaudhuri S, Ganjam K, Ganti V, Motwani R (2003) Robust and efficient fuzzy match for online data cleaning. In: Proceedings of the ACM SIGMOD conference on management of data, pp 313–324

  • Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering, pp 865–876

  • Chavez E, Navarro G, Baeza-Yates R, Luis Marroquin J (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321

    Article  Google Scholar 

  • Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proceedings of the international conference on very large databases, pp 426–435

  • Cochinwala M, Dalal S, Elmagarmid AK, Verykios VS (2005) Record matching: past, present and future

  • Cohen W, Richman J (2001) Learning to match and cluster entity names. In: Proceedings of the ACM SIGIR workshop on mathematical/formal methods in information retrieval, pp 13–18

  • Cohen WW, Richman J (2002) Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 475–480

  • Cohen WW, Ravikumar P, Fienberg SE (2003) A comparison of string distance metrics for name-matching tasks. In: Proceedings of the IJCAI workshop on information integration on the web, pp 73–78

  • Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the international conference on knowledge discovery and data mining, pp 226–231

  • Fellegi IP, Sunter AB (1969) A theory for record linkage. J Am Stat Assoc 64: 1183–1210

    Article  Google Scholar 

  • Ganti V et al (1999) Clustering large datasets in arbitrary metric spaces. In: Proceedings of the international conference on data engineering, pp 502–511

  • Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large databases, pp 518–529

  • Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (Almost) for free. In: Proceedings of the international conference on very large databases, pp 491–500

  • Gu L, Baxter RA, Vickers D, Rainsford C (2003) Record linkage: current practice and future directions. Technical Report, number 03/83. CSIRO Mathematical and Information Sciences

  • Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 73–84

  • Guha S, Rastogi R, Shim K (2001) ROCK: a robust clustering algorithm for categorical attributes. Inf Syst 25(5): 345–366

    Article  Google Scholar 

  • Gunsfield D (1997) Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge

    Google Scholar 

  • Hernández MA, Stolfo SJ (1995) The Merge/Purge problem for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 127–138

  • Hjatason GR, Samet H (2003) Index-driven similarity search in metric spaces. ACM Trans Database Syst 28(4): 517–518

    Article  Google Scholar 

  • Indyk P, Motwani R (1998) Approximate nearest neighbor-towards removing the curse of dimensionality. In: Proceedings of symposium on theory of computing, pp 604–613

  • Ipeirotis PG, Verykios VS, Elmagarmid AK (2007) Duplicate record detection: a review. IEEE Trans Knowl Data Eng 18(1): 1–16

    Google Scholar 

  • Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3): 264–323

    Article  Google Scholar 

  • Kalashnikov DV, Mehrotra S, Chen Z (2005) Exploiting relationships for domain independent data cleaning. In: Proceedings of the SIAM conference on data mining, pp 262–273

  • McCallum AK, Nigam K, Ungar L (2000) Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 169–178

  • Monge AE, Elkan CP (1996) The field matching problem: algorithms and applications. In: Proceedings of the international conference on knowledge discovery and data mining, pp 267–270

  • Monge AE, Elkan CP (1997) An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the SIGMOD workshop on research issues on data mining and knowledge discovery, pp 23–29

  • Monge AE, Elkan CP (2001) Automatic segmentation of text into structured records. In: Proceedings of the ACM SIGMOD conference on management of data

  • Neiling M, Jurk S (2003) The object identification framework. In: Proceedings of the KDD workshop on data cleaning, record linkage, and object consolidation, pp 37–39

  • Sarawagi S, Bhamidipaty A (2002) Interactive deduplication using active learning. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 269–278

  • Sarawagi S, Kirpal A (2004) Efficient exact set-similarity joins. In: Proceedings of the SIGMOD international conference on management of data, pp 743–754

  • Tejada S, Knoblock CA, Minton S (2002) Learning domain-independent string transformation weights for high accuracy object identification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 350–359

  • Ukkonen E (1982) Approximate string matching using q-grams and maximal matches. Theor Comput Sci 92(1): 191–211

    Article  MathSciNet  Google Scholar 

  • Weber R, Schek HJ, Blott S (1998) A quantitative analsysis and performance study for similarity search in high-dimensional spaces. In: Proceedings of the international conference on very large databases, pp 194–205

  • Winkler WE (1990) String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. In: Proceedings of the section on survey research methods, American Statistical Association, pp 354–359

  • Winkler WE (1999) The state of record linkage and current research problems. Technical Report. Statistical Research Division, US Census Bureau

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Giuseppe Manco.

Additional information

Responsible editor: R. Bayardo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Costa, G., Manco, G. & Ortale, R. An incremental clustering scheme for data de-duplication. Data Min Knowl Disc 20, 152–187 (2010). https://doi.org/10.1007/s10618-009-0155-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-009-0155-0

Keywords

Navigation