Knowledge-based framework for estimating the relevance of scientific articles

https://doi.org/10.1016/j.eswa.2020.113692Get rights and content

Highlights

  • Framework for automatic mentoring for scientific community.

  • Relevance-based lexicon generation.

  • Trending topics detection and evolution in science.

  • Calculation of reputation for scientific publications.

  • Evaluation of scientific publications according to their importance for community.

Abstract

The volume of published papers provided by the scientific community has increased over the last years in a drastic way. This fact has led to having a considerable growth of the topics covered by different publications. Despite topics under discussion on these publications were usually regarded as cutting edge subjects when released in conferences and journals, the restless evolution of science may have faded their relative importance away over the years. This issue undoubtedly poses big challenges to those researchers interested in gathering information to enrich their own background. Consequently, the development of a system able to automatically organize and provide relevance to scientific papers should play a crucial role to address the aforementioned problem. In this paper, the Webelance framework is presented. It makes use of a lexicon and Machine Learning techniques to accomplish these tasks. It has been built by using specific metrics for the scientific domain to measure the relative importance of papers. Several experiments using more than 50,000 articles focused on the medicine domain have been addressed to illustrate the viability of the proposal. The obtained results both confirm the usability of the system and its good performance.

Introduction

It is widely known that there exists a large amount of documents all over the world. The content of these documents shows to be very heterogeneous, having information that can be organized into multiple topics. Moreover, this information is usually static (e.g., videos or texts) and cannot be easily updated or upgraded. These facts could create issues to those users trying to make a correct selection of documents to obtain knowledge from.

In the case of the scientific community, this situation has been aggravated with the increase of the number of journals and conferences (Ware & Mabe, 2015). This leads to having scenarios where outdated texts and low quality studies are found along with well-known topics (Shojania et al., 2007, Pattanittum et al., 2012). Thus, new scientists aiming at improving their skills and background in a specific domain usually find difficulties. In these cases, the figure of a mentor who provides guidelines and proceeds with a first discrimination to filter the texts becomes basic (Williamson, German, Weiss, Skinner, & Bowes, 1989). Furthermore, this filtering process can be useful for the scientific experts in a specific topic or a determined scientific community, as the necessary time and the effort to accomplish a research can be considerably reduced.

Relevance of scientific articles is the fact of being valuable and useful to scientists in their work. Nevertheless, the discrimination of texts according to their relevance has always been a hard task to achieve due to the various factors influencing the process. Some of these factors are: the initial importance of the considered topics, their evolution through time (they could become obsolete), the reputation of the authors, the affected domains and the importance of the document for the community. Notice that some of these factors can be biased by humans depending on their background, opinions and skills related to the considered domains and topics (Kumar, 2016).

For this reason, it becomes a key issue to develop a system to support and assist during the process. The system has to be able to objectively measure the relevance of a text in a specific domain. In addition, the system should also complete a wide research by studying several corpora of documents, processing the gathered information and organizing the knowledge (simulating the background and skills of a mentor). Finally, the system should include a measure to rank and discriminate the texts.

In this paper, the Webelance framework is introduced. It makes use of two main types of artifacts: a relevance-based lexicon and Machine Learning (ML) models to measure the relevance of scientific papers. The lexicon is built by processing a large amount of papers belonging to a specific domain. This process obtains the concepts and also measures their relevance. The relevance metrics are based on the occurrence of the concepts and the paper reputation. This reputation is based on previous objective measures used by the scientific community (Fernández-Isabel et al., 2018). Thus, the scientific community acts as experts in order to generate the knowledge used to train the system. The ML models complete the framework by making predictions of the relevance for the unconsidered concepts by the lexicon. Thus, Webelance follows a well-known workflow in the Text Mining domain (Cambria, 2016).

The experiments carried out in this paper are oriented to validate the proposal. Different values for the internal parameters of the system are configured in order to test the performance of the framework. First, an experiment with neutral values of the parameters is performed. Then, a second experiment in which the values of the parameters are tuned by the experts evaluates the improvement achieved when the domain knowledge is considered. Finally, a third experiment validates the results provided by the system over time.

The system is evaluated by means of a test battery of documents previously labeled as relevant or non relevant by the experts. The medicine domain has been selected to address the experiments. This decision is motivated by three foundations. First, it is one of the most important domains for human beings (More, 2016). Second, it is constantly being updated and improved with the advances made by researchers (which implies that the trends are modified in a short period of time, generating several outdated manuscripts (Castiglioni, 2019)). Finally, it is one of the most explored fields in the scientific community (Richta, 2018), which facilitates the document gathering process required to create a wide corpus organized by year.

The rest of the paper is organized as follows. Section 2 situates the proposal in the domain highlighting its foundations. Section 3 presents the developed framework detailing its modules and components. Section 4 proposes several experiments on the medicine domain to illustrate the viability of the system. Section 5 concludes and provides some future guidelines.

Section snippets

Background

This section introduces the foundations of the Webelance framework. The first subsection overviews the concept of relevance, both by defining it and also by reviewing its evolution in the research field over the last years. Secondly, some techniques for automatic dictionary generation are introduced. Finally, given the fact that the article relevance analysis presented in this work is inspired by the classic Sentiment Analysis approach (de Diego, Fernández-Isabel, Ortega, & Moguerza, 2018),

Proposed framework

The main purpose of the Webelance framework resides in estimating the relevance of scientific texts in a predefined application domain. Webelance is an expert system that obtains information from experts in the scientific community to be developed. Thus, concepts such as number of citations and reputation of authors based on their importance in the field of application are used. Notice that the system needs other independent experts in the selected application domain to be evaluated and to show

Experiments

Medicine domain has been selected in order to exemplify and ratify the validity of the proposal.

Given a set of documents previously assessed by field experts (i.e., medical researchers), the experiments are focused on both evaluating the Webelance framework accuracy on document relevance estimation and verifying overall system sensitivity to variations on internal parameters values.

Four different groups of 10 documents (for the first and second experiments) and 5 specific documents related to

Conclusions

This paper has introduced the Webelance framework to provide a solution to an existing open issue in the scientific community: to estimate the relevance of scientific articles. The relevance (i.e., the degree to which something is related or useful to what is happening or being talked about) is a subjective concept, and therefore, it is almost impossible to learn exactly at any given time. However, there are many objective measures used to estimate it. These estimations are objective, but they

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Alberto Fernández-Isabel: Methodology, Writing - original draft, Writing - review & editing. Adrián A. Barriuso: Data curation, Software, Writing - original draft. Javier Cabezas: Investigation, Writing - original draft. Isaac Martín de Diego: Supervision, Writing - review & editing. J.F.J. Viseu Pinheiro: Data curation, Conceptualization, Validation.

CRediT authorship contribution statement

Alberto Fernández-Isabel: Methodology, Writing - original draft, Writing - review & editing. Adrián A. Barriuso: Data curation, Software, Writing - original draft. Javier Cabezas: Investigation, Writing - original draft. Isaac Martín de Diego: Supervision, Writing - review & editing. J.F.J. Viseu Pinheiro: Data curation, Conceptualization, Validation.

Acknowledgments

Research supported by grant from the Spanish Ministry of Economy and Competitiveness, under the Retos-Colaboración program: SABERMED (Ref: RTC-2017-6253-1); medical corpus provided by MMG and donation of the Titan V GPU by NVIDIA Corporation.

References (87)

  • M. Alfano et al.

    Development and practical use of a medical vocabulary-thesaurus-dictionary for patient empowerment

  • Allen Institute for Artificial Intelligence and Semantic Scholar (2018). Semantic Scholar API....
  • M. Anandarajan et al.

    Term-document representation

  • Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sentiwordnet 3.0: an enhanced lexical resource for sentiment...
  • A. Bandhakavi et al.

    Lexicon generation for emotion detection from text

    IEEE Intelligent Systems

    (2017)
  • K. Bhavsar et al.

    Natural language processing with python cookbook: Over 60 recipes to implement text analytics solutions using deep learning principles

    (2017)
  • P.A. Bloching et al.

    Assessing the scientific relevance of a single publication over time

    South African Journal of Science

    (2013)
  • D.B. Bracewell

    Semi-automatic creation of an emotion dictionary using wordnet and its evaluation

  • E. Cambria

    Affective computing and sentiment analysis

    IEEE Intelligent Systems

    (2016)
  • Cambria, E., Poria, S., Gelbukh, A., & Kwok, K. (2014). Sentic api: a common-sense based api for concept-level...
  • A. Castiglioni

    A history of medicine

    (2019)
  • Chandramouli, A. (2018). Domain-specific stopword removal from unstructured computer text using a neural network. US...
  • Q. Chen et al.

    Sentence similarity measures revisited: Ranking sentences in pubmed documents

  • Chen, Y., Beynon, J. A., Perlov, B., Ghatare, S. P., Bolivar, A., Parikh, N., et al. (2014). Methods and apparatus for...
  • Z. Chen et al.

    Long-tail vocabulary dictionary extraction from the web

  • Y. Cheng et al.

    Research and development of domain dictionary construction system

  • Chikersal, P., Poria, S., & Cambria, E. (2015). Sentu: sentiment analysis of tweets by combining a rule-based...
  • Clement Levallois. (2016). Lists of academic stopwords. URL:https://github.com/seinecle/Stopwords (Online: accessed...
  • D. Deng et al.

    Topic-adaptive sentiment lexicon construction

  • D. Deng et al.

    Sentiment lexicon construction with hierarchical supervision topic model

    IEEE/ACM Transactions on Audio, Speech, and Language Processing

    (2019)
  • Q. Deng et al.

    Building an environmental sustainability dictionary for the it industry

  • I.M. de Diego et al.

    A visual framework for dynamic emotional web analysis

    Knowledge-Based Systems

    (2018)
  • I. Donoso-Guzmán et al.

    An interactive relevance feedback interface for evidence-based health care

  • C. Fellbaum

    Wordnet

  • N. Fiorini et al.

    Best match: new relevance search for pubmed

    PLoS Biology

    (2018)
  • C.O. Freitas et al.

    Study of perceptual similarity between different lexicons

    International Journal of Pattern Recognition and Artificial Intelligence

    (2004)
  • X. Fu et al.

    Lexicon-enhanced lstm with attention for general sentiment analysis

    IEEE Access

    (2018)
  • G. Goeckenjan et al.

    Pubmed results

    Pneumologie

    (2011)
  • L. Goeuriot et al.

    Sentiment lexicons for health-related opinion mining

  • C. Gormley et al.

    Elasticsearch: The definitive guide: A distributed real-time search and analytics engine

    (2015)
  • Gupta, S. (2015). Distantly supervised information extraction using bootstrapped patterns. Ph.D. thesis Stanford...
  • H. Han et al.

    Generate domain-specific sentiment lexicon for review sentiment analysis

    Multimedia Tools and Applications

    (2018)
  • J. Han et al.

    Survey on nosql database

  • Cited by (3)

    • Optimal power generation and power flow control using artificial intelligence techniques

      2021, Renewable Energy Systems: Modelling, Optimization and Control
    1

    www.datasciencelab.es.

    View full text