Skip to main content

Named Entity Matching in Publication Databases

A Case Study of PubMed in SONCA

  • Conference paper
Rough Sets and Current Trends in Computing (RSCTC 2012)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7413))

Included in the following conference series:

Abstract

We present a case study in approximate data matching for a database system that contains information about scientific publications. The approximate matching process is meant to identify whether several records in the database are in fact repeated instances of the same real-world object. In our case study we are concerned with matching instances of objects such as XML documents, persons’ names, affiliations, journal names, and so on. The particular data we are dealing with is a representation of the PubMed Central document corpus within the data warehouse that is a part of the SONCA system. SONCA system is being developed as one of components of the general scientific information platform SYNAT.

This work was supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland, the Polish National Science Centre grant 2011/01/B/ST6/03867 and by the Polish National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Association for Computing Machinery: The Digital Library: the ACM Guide to Computing Literature. WWW Page (2012), http://librarians.acm.org/acm-guide-computing-literature

  2. Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003), http://www.ncbi.nlm.nih.gov/books/NBK21087/

  3. Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)

    Google Scholar 

  4. Dorneles, C.F., Gonçalves, R., dos Santos Mello, R.: Approximate data instance matching: a survey. Knowl. Inf. Syst. 27(1), 1–21 (2011)

    Article  Google Scholar 

  5. Herba, K.: Semantic recognition and tagging of scientific articles. Master’s thesis, Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Warsaw, Poland (2011) (in Polish)

    Google Scholar 

  6. Infobright, Inc.: Infobright Enterprise Edition (IEE). WWW Page (2012), http://infobright.com

  7. Jonnalagadda, S., Topham, P.: Nemo: Extraction and normalization of organization names from pubmed affiliation strings. Journal of Biomedical Discovery and Collaboration 5, 50–75 (2010)

    Google Scholar 

  8. Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: Rdbms model for scientific articles analytics. In: Bembenik, et al. [3], ch. 4, pp. 49–60

    Google Scholar 

  9. Nadkarni, P.: The EAV/CR model of data representation. Tech. rep., Center for Medical Informatics, Yale University School of Medicine (2000), http://ycmi.med.yale.edu/nadkarni/eav_cr_frame.html

  10. National Center for Biotechnology Information: Archiving and Interchange Tag Set (2008), http://dtd.nlm.nih.gov/archiving/

  11. Nguyen, A.L., Nguyen, H.S.: On designing the sonca system. In: Bembenik, et al. [3], ch. 2, pp. 9–35

    Google Scholar 

  12. Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [3], ch. 1, pp. 1–8

    Google Scholar 

  13. Reuther, P., Walter, B., Ley, M., Weber, A., Klink, S.: Managing the Quality of Person Names in DBLP. In: Gonzalo, J., Thanos, C., Verdejo, M.F., Carrasco, R.C. (eds.) ECDL 2006. LNCS, vol. 4172, pp. 508–511. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Szczuka, M., Ślęzak, D.: Representation and Evaluation of Granular Systems. In: Watada, J., Watanabe, T., Phillips-Wren, G., Howlett, R.J., Jain, L.C. (eds.) Intelligent Decision Technologies. SIST, vol. 15, pp. 287–296. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Tsai, R.T.H., Sung, C.L., Dai, H.J., Hung, H.C., Sung, T.Y., Hsu, W.L.: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 7(S-5) (2006)

    Google Scholar 

  16. Zhang, D., Tang, J., Li, J.Z., Wang, K.: A constraint-based probabilistic framework for name disambiguation. In: Silva, M.J., Laender, A.H.F., Baeza-Yates, R.A., McGuinness, D.L., Olstad, B., Olsen, Ø.H., Falcão, A.O. (eds.) CIKM, pp. 1019–1022. ACM (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Szczuka, M., Betliński, P., Herba, K. (2012). Named Entity Matching in Publication Databases. In: Yao, J., et al. Rough Sets and Current Trends in Computing. RSCTC 2012. Lecture Notes in Computer Science(), vol 7413. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32115-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32115-3_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32114-6

  • Online ISBN: 978-3-642-32115-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics