Skip to main content

A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database

  • Conference paper
Web-Age Information Management (WAIM 2010)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6184))

Included in the following conference series:

Abstract

The duplicate detection is one of technical difficulties in data cleaning area. At present, the data volume of scientific database is increasing rapidly, bringing new challenges to the duplicate detection. In the scientific database, the duplicate detection model should be suitable for massive and numerical data, should independent from the domains, should well consider the relationships among tables, and should focus on common grounds of the scientific database. In the paper, a multilevel duplicate detection model for scientific database is proposed, which consider numerical data and general usage well. Firstly, the challenges are propose by analyzing duplicate-related characteristics of scientific data; Secondly, similarity measure of the proposed model are defined; Then the details of multilevel detecting algorithms are introduced; At last, some experiments and applications show that the proposed model is more domain-independent and effective, suitable for duplicate detection in scientific database.

This work is supported by National Natural Science Foundation of China (No. 60773222).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gray, J., Liu, D.T., Nieto-Santisteban, M.A., Szalay, A., et al.: Scientific Data Management in The Coming Decade. SIGMOD Record. 34(4), 34–41 (2005)

    Article  Google Scholar 

  2. Rahm, E., Do, H.H.: Data Cleaning: Problem and Current Approaches. IEEE Data Engineering Bulletin 23(3), 1 (2000)

    Google Scholar 

  3. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative Data Cleaning: Language, Model, and Algorithms. In: Proc. of International Conf. on Very Large Databases, pp. 371–380 (2001)

    Google Scholar 

  4. Hernandez, M., Stolfo, S.: The Merge/Purge Problem for Large Databases. In: Proc. of the ACM SIGMOD, pp. 127–138 (May 1995)

    Google Scholar 

  5. Felligi, I.P., Sunter, A.B.: A Theory for Record Linkage. Journal of the American Statistical Society 64, 1183–1210 (1969)

    Google Scholar 

  6. Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)

    Google Scholar 

  7. Dong, X., Halevy, A., Madhavan, J.: Reference Reconciliation in Complex Information Spaces. In: Proc. of SIGMOD, pp. 85–96 (2005)

    Google Scholar 

  8. Monge, A., Elkan, C.: An Efficient Domain Independent Algorithm for Detecting Approximately Duplicate Database Records. In: Proc. of the SIGMOD Workshop on Data Mining and Knowledge Discovery (May 1997)

    Google Scholar 

  9. Garcia, E.: An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations, http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html#Cosim

  10. Rousseau, R.: Jaccard Similarity Leads to the Marczewski-Steinhaus Topology for Information Retrieval. Inf. Process. Manage. (IPM) 34(1), 87–94 (1998)

    Article  Google Scholar 

  11. Black, P.E. (ed.): Euclidean Distance, in Dictionary of Algorithms and Data Structures, U.S. National Institute of Standards and Technology, http://www.itl.nist.gov/div897/sqg/dads/HTML/euclidndstnc.html

  12. Mahalanobis, P.C.: On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India 2 (1), 49–55

    Google Scholar 

  13. Xue, Z.-a., Cen, F., Wei, L.-p.: A Weighting Fuzzy Clustering Algorithm Based on Euclidean Distance. In: FSKD 2008, pp. 172–175 (2008)

    Google Scholar 

  14. Jin, L., Li, C., Mehrotra, S.: Efficient Record Linkage in Large Data Sets. In: Proc. of International Conf. on Database Systems for Advanced Applications, p. 137 (2003)

    Google Scholar 

  15. Lim, E.P., Srivastava, J., Prabhakar, S., Richardson, J.: Entity Identification in Database Integration. In: Proc. of International Conf. on Data Engineering, pp. 294–301 (April 1993)

    Google Scholar 

  16. Weis, M.: Fuzzy Duplicate Detection on XML. In: VLDB PhD Workshop (2005)

    Google Scholar 

  17. Weis, M., Naumann, F.: Duplicate Detection in XML. In: Proc. of the ACM SIGMOD Workshop on Information Quality in Information Systems, pp. 10–19 (2004)

    Google Scholar 

  18. Ananthakrishna, R., Chaudhuri, S., Ganti, V.: Eliminating Fuzzy Duplicates in Data Warehouses. In: Proc. of VLDB, pp. 586–597 (2002)

    Google Scholar 

  19. Bhattacharya, I., Getoor, L.: Relational Clustering for Multi-type Entity Resolution. In: Proc. of Workshop on Multi-Relational Data Mining, MRDM (2005)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Song, J., Bao, Y., Yu, G. (2010). A Multilevel and Domain-Independent Duplicate Detection Model for Scientific Database. In: Chen, L., Tang, C., Yang, J., Gao, Y. (eds) Web-Age Information Management. WAIM 2010. Lecture Notes in Computer Science, vol 6184. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14246-8_69

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14246-8_69

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14245-1

  • Online ISBN: 978-3-642-14246-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics