A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database

Song, Jie; Bao, Yubin; Yu, Ge

doi:10.1007/978-3-642-14246-8_69

Jie Song²⁰,
Yubin Bao²⁰ &
Ge Yu²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

International Conference on Web-Age Information Management

1669 Accesses
1 Citations

Abstract

The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.

This work is supported by National Natural Science Foundation of China (No. 60773222).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gray, J., Liu, D.T., Nieto-Santisteban, M.A., Szalay, A., et al.: Scientific Data Management in The Coming Decade. SIGMOD Record. 34(4), 34–41 (2005)
Article Google Scholar
Rahm, E., Do, H.H.: Data Cleaning: Problem and Current Approaches. IEEE Data Engineering Bulletin 23(3), 1 (2000)
Google Scholar
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Proc. of International Conf. on Very Large Databases, pp. 371–380 (2001)
Google Scholar
Hernandez, M., Stolfo, S.: The Merge/Purge Problem for Large Databases. In: Proc. of the ACM SIGMOD, pp. 127–138 (May 1995)
Google Scholar
Felligi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Society 64, 1183–1210 (1969)
Google Scholar
Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)
Google Scholar
Dong, X., Halevy, A., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: Proc. of SIGMOD, pp. 85–96 (2005)
Google Scholar
Monge, A., Elkan, C.: An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proc. of the SIGMOD Workshop on Data Mining and Knowledge Discovery (May 1997)
Google Scholar
Garcia, E.: An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations, http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html#Cosim
Rousseau, R.: Jaccard Similarity Leads to the Marczewski-Steinhaus Topology for Information Retrieval. Inf. Process. Manage. (IPM) 34(1), 87–94 (1998)
Article Google Scholar
Black, P.E. (ed.): Euclidean Distance, in Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, http://www.itl.nist.gov/div897/sqg/dads/HTML/euclidndstnc.html
Mahalanobis, P.C.: On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1), 49–55
Google Scholar
Xue, Z.-a., Cen, F., Wei, L.-p.: A Weighting Fuzzy Clustering Algorithm Based on Euclidean Distance. In: FSKD 2008, pp. 172–175 (2008)
Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of International Conf. on Database Systems for Advanced Applications, p. 137 (2003)
Google Scholar
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity Identification in Database Integration. In: Proc. of International Conf. on Data Engineering, pp. 294–301 (April 1993)
Google Scholar
Weis, M.: Fuzzy Duplicate Detection on XML. In: VLDB PhD Workshop (2005)
Google Scholar
Weis, M., Naumann, F.: Duplicate Detection in XML. In: Proc. of the ACM SIGMOD Workshop on Information Quality in Information Systems, pp. 10–19 (2004)
Google Scholar
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proc. of VLDB, pp. 586–597 (2002)
Google Scholar
Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)
Google Scholar

Download references

Author information

Authors and Affiliations

Northeastern University, Shenyang, 110004, China
Jie Song, Yubin Bao & Ge Yu

Authors

Jie Song
View author publications
You can also search for this author in PubMed Google Scholar
Yubin Bao
View author publications
You can also search for this author in PubMed Google Scholar
Ge Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Lei Chen
Computer Department, Sichuan University, 610064, Chengdu, China
Changjie Tang
Department of Computer Science, Duke University, Box 90129, NC 27708-0129, Durham, USA
Jun Yang
College of Computer Science, Zhejiang University, 388 Yuhangtang Road, 310058, Hangzhou, China
Yunjun Gao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Song, J., Bao, Y., Yu, G. (2010). A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_69

Download citation

DOI: https://doi.org/10.1007/978-3-642-14246-8_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics