Abstract
The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.
This work is supported by National Natural Science Foundation of China (No. 60773222).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gray, J., Liu, D.T., Nieto-Santisteban, M.A., Szalay, A., et al.: Scientific Data Management in The Coming Decade. SIGMOD Record. 34(4), 34–41 (2005)
Rahm, E., Do, H.H.: Data Cleaning: Problem and Current Approaches. IEEE Data Engineering Bulletin 23(3), 1 (2000)
Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Proc. of International Conf. on Very Large Databases, pp. 371–380 (2001)
Hernandez, M., Stolfo, S.: The Merge/Purge Problem for Large Databases. In: Proc. of the ACM SIGMOD, pp. 127–138 (May 1995)
Felligi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Society 64, 1183–1210 (1969)
Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)
Dong, X., Halevy, A., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: Proc. of SIGMOD, pp. 85–96 (2005)
Monge, A., Elkan, C.: An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proc. of the SIGMOD Workshop on Data Mining and Knowledge Discovery (May 1997)
Garcia, E.: An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations, http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html#Cosim
Rousseau, R.: Jaccard Similarity Leads to the Marczewski-Steinhaus Topology for Information Retrieval. Inf. Process. Manage. (IPM) 34(1), 87–94 (1998)
Black, P.E. (ed.): Euclidean Distance, in Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, http://www.itl.nist.gov/div897/sqg/dads/HTML/euclidndstnc.html
Mahalanobis, P.C.: On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1), 49–55
Xue, Z.-a., Cen, F., Wei, L.-p.: A Weighting Fuzzy Clustering Algorithm Based on Euclidean Distance. In: FSKD 2008, pp. 172–175 (2008)
Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of International Conf. on Database Systems for Advanced Applications, p. 137 (2003)
Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity Identification in Database Integration. In: Proc. of International Conf. on Data Engineering, pp. 294–301 (April 1993)
Weis, M.: Fuzzy Duplicate Detection on XML. In: VLDB PhD Workshop (2005)
Weis, M., Naumann, F.: Duplicate Detection in XML. In: Proc. of the ACM SIGMOD Workshop on Information Quality in Information Systems, pp. 10–19 (2004)
Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proc. of VLDB, pp. 586–597 (2002)
Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Song, J., Bao, Y., Yu, G. (2010). A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_69
Download citation
DOI: https://doi.org/10.1007/978-3-642-14246-8_69
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14245-1
Online ISBN: 978-3-642-14246-8
eBook Packages: Computer ScienceComputer Science (R0)