An n-gram-based approach for detecting approximately duplicate database records

Tian, Zengping; Lu, Hongjun; Ji, Wenyun; Zhou, Aoying; Tian, Zhong

doi:10.1007/s007990100044

An n-gram-based approach for detecting approximately duplicate database records

Published: May 2002

Volume 3, pages 325–331, (2002)
Cite this article

International Journal on Digital Libraries Aims and scope Submit manuscript

Zengping Tian¹,
Hongjun Lu²,
Wenyun Ji¹,
Aoying Zhou¹ &
…
Zhong Tian³

112 Accesses
6 Citations
Explore all metrics

Abstract.

Detecting and eliminating duplicate records is one of the major tasks for improving data quality. The task, however, is not as trivial as it seems since various errors, such as character insertion, deletion, transposition, substitution, and word switching, are often present in real-world databases. This paper presents an n-gram-based approach for detecting duplicate records in large databases. Using the approach, records are first mapped to numbers based on the n-grams of their field values. The obtained numbers are then clustered, and records within a cluster are taken as potential duplicate records. Finally, record comparisons are performed within clusters to identify true duplicate records. The unique feature of this method is that it does not require preprocessing to correct syntactic or typographical errors in the source data in order to achieve high accuracy. Moreover, sorting the source data file is unnecessary. Only a fixed number of database scans is required. Therefore, compared with previous methods, the algorithm is more time efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Author information

Authors and Affiliations

Department of Computer Science, Fudan University, Shanghi, 200433, P.R. China; E-mail: {zptian, wyji, ayzhou}@fudan.edu.cn, , , , , , CN
Zengping Tian, Wenyun Ji & Aoying Zhou
Department of Computer Science, Hong Kong University of Science and Technology, Hong Kong, P.R. China; E-mail: luhj@cs.ust.hk, , , , , , CN
Hongjun Lu
IBM China Research Laboratory, Beijing, P.R. China; E-mail: tianz@cn.ibm.com, , , , , , CN
Zhong Tian

Authors

Zengping Tian
View author publications
You can also search for this author in PubMed Google Scholar
Hongjun Lu
View author publications
You can also search for this author in PubMed Google Scholar
Wenyun Ji
View author publications
You can also search for this author in PubMed Google Scholar
Aoying Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Zhong Tian
View author publications
You can also search for this author in PubMed Google Scholar

Additional information

Published online: 22 August 2001

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tian, Z., Lu, H., Ji, W. et al. An n-gram-based approach for detecting approximately duplicate database records. Int J Digit Libr 3, 325–331 (2002). https://doi.org/10.1007/s007990100044

Download citation

Issue Date: May 2002
DOI: https://doi.org/10.1007/s007990100044

Key words: Duplicate elimination – N-gram – Edit distance – Data quality

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An n-gram-based approach for detecting approximately duplicate database records

Abstract.

Access this article

Similar content being viewed by others

Detecting Near Duplicate Dataset

Unsupervised record matching with noisy and incomplete data

An Effective Duplicate Removal Algorithm for Text Documents

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Navigation

An n-gram-based approach for detecting approximately duplicate database records

Abstract.

Access this article

Similar content being viewed by others

Detecting Near Duplicate Dataset

Unsupervised record matching with noisy and incomplete data

An Effective Duplicate Removal Algorithm for Text Documents

Author information

Authors and Affiliations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation