Semantic Clustering of Scientific Articles Using Explicit Semantic Analysis

Szczuka, Marcin; Janusz, Andrzej

doi:10.1007/978-3-642-36505-8_6

Marcin Szczuka²¹ &
Andrzej Janusz²¹

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 7736))

1038 Accesses
3 Citations

Abstract

This paper summarizes our recent research on semantic clustering of scientific articles. We present a case study which was focused on analysis of papers related to the Rough Sets theory. The proposed method groups the documents on the basis of their content, with an assistance of the DBpedia knowledge base. The text corpus is first processed using Natural Language Processing tools in order to produce vector representations of the content. In the second step the articles are matched against a collection of concepts retrieved from DBpedia. As a result, a new representation that better reflects the semantics of the texts, is constructed. With this new representation the documents are hierarchically clustered in order to form a partitioning of papers into semantically related groups. The steps in textual data preparation, the utilization of DBpedia and the employed clustering methods are explained and illustrated with experimental results. A quality of the resulting clustering is then discussed. It is assessed using feedback form human experts combined with typical cluster quality measures. These results are then discussed in the context of a larger framework that aims to facilitate search and information extraction from large textual repositories.

This work was supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland, the Polish National Science Centre grant 2011/01/B/ST6/03867 and by the Polish National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Combining semantic and term frequency similarities for text clustering

Article 02 January 2019

Clustering articles based on semantic similarity

Article 27 February 2017

Web Search Results Clustering Using Frequent Termset Mining

References

Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003), http://www.ncbi.nlm.nih.gov/books/NBK21087/
Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)
Google Scholar
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the web of data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 7, 154–165 (2009)
Article Google Scholar
Feldman, R., Sanger, J. (eds.): The Text Mining Handbook. Cambridge University Press (2007)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007)
Google Scholar
Grochowalski, P., Suraj, Z.: RSDS - the Rough Set Database System - a bibliographic database on wide aspects of rough sets (2009), http://rsds.univ.rzeszow.pl/
Janusz, A.: Dynamic Rule-Based Similarity Model for DNA Microarray Data. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets XV. LNCS, vol. 7255, pp. 1–25. Springer, Heidelberg (2012)
Chapter Google Scholar
Janusz, A., Nguyen, H.S., Ślęzak, D., Stawicki, S., Krasuski, A.: JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers. In: Yao, J.T., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 422–431. Springer, Heidelberg (2012)
Chapter Google Scholar
Janusz, A., Ślęzak, D., Nguyen, H.S.: Unsupervised similarity learning from textual data. Fundamenta Informaticae 119(3)
Google Scholar
Janusz, A., Świeboda, W., Krasuski, A., Nguyen, H.S.: Interactive Document Indexing Method Based on Explicit Semantic Analysis. In: Yao, J.T., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 156–165. Springer, Heidelberg (2012)
Chapter Google Scholar
Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)
Google Scholar
Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: RDBMS model for scientific articles analytics. In: Bembenik, et al. [2], ch. 4, pp. 49–60
Google Scholar
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval (2007) (online edition), http://nlp.stanford.edu/IR-book/
Nguyen, A.L., Nguyen, H.S.: On designing the SONCA system. In: Bembenik et al. [2], ch. 2, pp. 9–35
Google Scholar
Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [2], ch. 1, pp. 1–8
Google Scholar
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2009), http://www.R-project.org
Shinyama, Y.: PDFMiner: Python PDF parser and analyzer (2010), http://www.unixuser.org/~euske/python/pdfminer/
Ślęzak, D., Janusz, A., Świeboda, W., Nguyen, H.S., Bazan, J.G., Skowron, A.: Semantic Analytics of PubMed Content. In: Holzinger, A., Simonic, K.-M. (eds.) USAB 2011. LNCS, vol. 7058, pp. 63–74. Springer, Heidelberg (2011)
Chapter Google Scholar
Ślęzak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB 1(2), 1337–1345 (2008)
Google Scholar
Szczuka, M., Janusz, A., Herba, K.: Clustering of Rough Set Related Documents with Use of Knowledge from DBpedia. In: Yao, J., Ramanna, S., Wang, G., Suraj, Z. (eds.) RSKT 2011. LNCS, vol. 6954, pp. 394–403. Springer, Heidelberg (2011)
Chapter Google Scholar
Szczuka, M., Janusz, A., Herba, K.: Semantic clustering of scientific articles with use of DBpedia knowledge base. In: Bembenik, et al. [2], ch. 5, pp. 61–76
Google Scholar
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boston (2006), http://www-users.cs.umn.edu/~kumar/dmbook/index.php
Google Scholar
The DBPedia Community: The DBPedia knowledge base (2011), http://DBpedia.org/
United States National Library of Medicine: Introduction to MeSH - 2011 (2011), http://www.nlm.nih.gov/mesh/introduction.html
Wikipedia Community: Wikipedia - the free Encyclopedia (2011), http://en.wikipedia.org/

Download references

Author information

Authors and Affiliations

Faculty of Mathematics, Informatics, and Mechanics, The University of Warsaw, Banacha 2, 02-097, Warsaw, Poland
Marcin Szczuka & Andrzej Janusz

Authors

Marcin Szczuka
View author publications
You can also search for this author in PubMed Google Scholar
Andrzej Janusz
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Manitoba, Winnipeg, MB, Canada
James F. Peters
University of Warsaw, Poland
Andrzej Skowron
University of Winnipeg, MB, Canada
Sheela Ramanna
University of Rzeszów, Poland
Zbigniew Suraj
University of Calgary, AB, Canada
Xin Wang

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Szczuka, M., Janusz, A. (2013). Semantic Clustering of Scientific Articles Using Explicit Semantic Analysis. In: Peters, J.F., Skowron, A., Ramanna, S., Suraj, Z., Wang, X. (eds) Transactions on Rough Sets XVI. Lecture Notes in Computer Science, vol 7736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36505-8_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-36505-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36504-1
Online ISBN: 978-3-642-36505-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics