Abstract
This paper summarizes our recent research on semantic clustering of scientific articles. We present a case study which was focused on analysis of papers related to the Rough Sets theory. The proposed method groups the documents on the basis of their content, with an assistance of the DBpedia knowledge base. The text corpus is first processed using Natural Language Processing tools in order to produce vector representations of the content. In the second step the articles are matched against a collection of concepts retrieved from DBpedia. As a result, a new representation that better reflects the semantics of the texts, is constructed. With this new representation the documents are hierarchically clustered in order to form a partitioning of papers into semantically related groups. The steps in textual data preparation, the utilization of DBpedia and the employed clustering methods are explained and illustrated with experimental results. A quality of the resulting clustering is then discussed. It is assessed using feedback form human experts combined with typical cluster quality measures. These results are then discussed in the context of a larger framework that aims to facilitate search and information extraction from large textual repositories.
This work was supported by the grant N N516 077837 from the Ministry of Science and Higher Education of the Republic of Poland, the Polish National Science Centre grant 2011/01/B/ST6/03867 and by the Polish National Centre for Research and Development (NCBiR) under Grant No. SP/I/1/77065/10 in frame of the strategic scientific research and experimental development program: “Interdisciplinary System for Interactive Scientific and Scientific-Technical Information”.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Beck, J., Sequeira, E.: PubMed Central (PMC): An archive for literature from life sciences journals. In: McEntyre, J., Ostell, J. (eds.) The NCBI Handbook, ch. 9. National Center for Biotechnology Information, Bethesda (2003), http://www.ncbi.nlm.nih.gov/books/NBK21087/
Bembenik, R., Skonieczny, Ł., Rybiński, H., Niezgódka, M. (eds.): Intelligent Tools for Building a Scientific Information Platform. SCI, vol. 390. Springer, Heidelberg (2012)
Bizer, C., Lehmann, J., Kobilarov, G., Auer, S., Becker, C., Cyganiak, R., Hellmann, S.: DBpedia – a crystallization point for the web of data. Journal of Web Semantics: Science, Services and Agents on the World Wide Web 7, 154–165 (2009)
Feldman, R., Sanger, J. (eds.): The Text Mining Handbook. Cambridge University Press (2007)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 6–12 (2007)
Grochowalski, P., Suraj, Z.: RSDS - the Rough Set Database System - a bibliographic database on wide aspects of rough sets (2009), http://rsds.univ.rzeszow.pl/
Janusz, A.: Dynamic Rule-Based Similarity Model for DNA Microarray Data. In: Peters, J.F., Skowron, A. (eds.) Transactions on Rough Sets XV. LNCS, vol. 7255, pp. 1–25. Springer, Heidelberg (2012)
Janusz, A., Nguyen, H.S., Ślęzak, D., Stawicki, S., Krasuski, A.: JRS 2012 Data Mining Competition: Topical Classification of Biomedical Research Papers. In: Yao, J.T., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 422–431. Springer, Heidelberg (2012)
Janusz, A., Ślęzak, D., Nguyen, H.S.: Unsupervised similarity learning from textual data. Fundamenta Informaticae 119(3)
Janusz, A., Świeboda, W., Krasuski, A., Nguyen, H.S.: Interactive Document Indexing Method Based on Explicit Semantic Analysis. In: Yao, J.T., Yang, Y., Słowiński, R., Greco, S., Li, H., Mitra, S., Polkowski, L. (eds.) RSCTC 2012. LNCS (LNAI), vol. 7413, pp. 156–165. Springer, Heidelberg (2012)
Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann, San Francisco (1997)
Kowalski, M., Ślęzak, D., Stencel, K., Pardel, P., Grzegorowski, M., Kijowski, M.: RDBMS model for scientific articles analytics. In: Bembenik, et al. [2], ch. 4, pp. 49–60
Manning, C., Raghavan, P., Schütze, H.: Introduction to information retrieval (2007) (online edition), http://nlp.stanford.edu/IR-book/
Nguyen, A.L., Nguyen, H.S.: On designing the SONCA system. In: Bembenik et al. [2], ch. 2, pp. 9–35
Nguyen, H.S., Ślęzak, D., Skowron, A., Bazan, J.: Semantic search and analytics over large repository of scientific articles. In: Bembenik, et al. [2], ch. 1, pp. 1–8
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2009), http://www.R-project.org
Shinyama, Y.: PDFMiner: Python PDF parser and analyzer (2010), http://www.unixuser.org/~euske/python/pdfminer/
Ślęzak, D., Janusz, A., Świeboda, W., Nguyen, H.S., Bazan, J.G., Skowron, A.: Semantic Analytics of PubMed Content. In: Holzinger, A., Simonic, K.-M. (eds.) USAB 2011. LNCS, vol. 7058, pp. 63–74. Springer, Heidelberg (2011)
Ślęzak, D., Wróblewski, J., Eastwood, V., Synak, P.: Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB 1(2), 1337–1345 (2008)
Szczuka, M., Janusz, A., Herba, K.: Clustering of Rough Set Related Documents with Use of Knowledge from DBpedia. In: Yao, J., Ramanna, S., Wang, G., Suraj, Z. (eds.) RSKT 2011. LNCS, vol. 6954, pp. 394–403. Springer, Heidelberg (2011)
Szczuka, M., Janusz, A., Herba, K.: Semantic clustering of scientific articles with use of DBpedia knowledge base. In: Bembenik, et al. [2], ch. 5, pp. 61–76
Tan, P.-N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison Wesley, Boston (2006), http://www-users.cs.umn.edu/~kumar/dmbook/index.php
The DBPedia Community: The DBPedia knowledge base (2011), http://DBpedia.org/
United States National Library of Medicine: Introduction to MeSH - 2011 (2011), http://www.nlm.nih.gov/mesh/introduction.html
Wikipedia Community: Wikipedia - the free Encyclopedia (2011), http://en.wikipedia.org/
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Szczuka, M., Janusz, A. (2013). Semantic Clustering of Scientific Articles Using Explicit Semantic Analysis. In: Peters, J.F., Skowron, A., Ramanna, S., Suraj, Z., Wang, X. (eds) Transactions on Rough Sets XVI. Lecture Notes in Computer Science, vol 7736. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36505-8_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-36505-8_6
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36504-1
Online ISBN: 978-3-642-36505-8
eBook Packages: Computer ScienceComputer Science (R0)