Knowledge-based framework for estimating the relevance of scientific articles

doi:10.1016/j.eswa.2020.113692

Expert Systems with Applications

Volume 161, 15 December 2020, 113692

https://doi.org/10.1016/j.eswa.2020.113692 Get rights and content

Highlights

•
Framework for automatic mentoring for scientific community.
•
Relevance-based lexicon generation.
•
Trending topics detection and evolution in science.
•
Calculation of reputation for scientific publications.
•
Evaluation of scientific publications according to their importance for community.

Abstract

The volume of published papers provided by the scientific community has increased over the last years in a drastic way. This fact has led to having a considerable growth of the topics covered by different publications. Despite topics under discussion on these publications were usually regarded as cutting edge subjects when released in conferences and journals, the restless evolution of science may have faded their relative importance away over the years. This issue undoubtedly poses big challenges to those researchers interested in gathering information to enrich their own background. Consequently, the development of a system able to automatically organize and provide relevance to scientific papers should play a crucial role to address the aforementioned problem. In this paper, the Webelance framework is presented. It makes use of a lexicon and Machine Learning techniques to accomplish these tasks. It has been built by using specific metrics for the scientific domain to measure the relative importance of papers. Several experiments using more than $50, 000$ articles focused on the medicine domain have been addressed to illustrate the viability of the proposal. The obtained results both confirm the usability of the system and its good performance.

Introduction

It is widely known that there exists a large amount of documents all over the world. The content of these documents shows to be very heterogeneous, having information that can be organized into multiple topics. Moreover, this information is usually static (e.g., videos or texts) and cannot be easily updated or upgraded. These facts could create issues to those users trying to make a correct selection of documents to obtain knowledge from.

In the case of the scientific community, this situation has been aggravated with the increase of the number of journals and conferences (Ware & Mabe, 2015). This leads to having scenarios where outdated texts and low quality studies are found along with well-known topics (Shojania et al., 2007, Pattanittum et al., 2012). Thus, new scientists aiming at improving their skills and background in a specific domain usually find difficulties. In these cases, the figure of a mentor who provides guidelines and proceeds with a first discrimination to filter the texts becomes basic (Williamson, German, Weiss, Skinner, & Bowes, 1989). Furthermore, this filtering process can be useful for the scientific experts in a specific topic or a determined scientific community, as the necessary time and the effort to accomplish a research can be considerably reduced.

Relevance of scientific articles is the fact of being valuable and useful to scientists in their work. Nevertheless, the discrimination of texts according to their relevance has always been a hard task to achieve due to the various factors influencing the process. Some of these factors are: the initial importance of the considered topics, their evolution through time (they could become obsolete), the reputation of the authors, the affected domains and the importance of the document for the community. Notice that some of these factors can be biased by humans depending on their background, opinions and skills related to the considered domains and topics (Kumar, 2016).

For this reason, it becomes a key issue to develop a system to support and assist during the process. The system has to be able to objectively measure the relevance of a text in a specific domain. In addition, the system should also complete a wide research by studying several corpora of documents, processing the gathered information and organizing the knowledge (simulating the background and skills of a mentor). Finally, the system should include a measure to rank and discriminate the texts.

In this paper, the Webelance framework is introduced. It makes use of two main types of artifacts: a relevance-based lexicon and Machine Learning (ML) models to measure the relevance of scientific papers. The lexicon is built by processing a large amount of papers belonging to a specific domain. This process obtains the concepts and also measures their relevance. The relevance metrics are based on the occurrence of the concepts and the paper reputation. This reputation is based on previous objective measures used by the scientific community (Fernández-Isabel et al., 2018). Thus, the scientific community acts as experts in order to generate the knowledge used to train the system. The ML models complete the framework by making predictions of the relevance for the unconsidered concepts by the lexicon. Thus, Webelance follows a well-known workflow in the Text Mining domain (Cambria, 2016).

The experiments carried out in this paper are oriented to validate the proposal. Different values for the internal parameters of the system are configured in order to test the performance of the framework. First, an experiment with neutral values of the parameters is performed. Then, a second experiment in which the values of the parameters are tuned by the experts evaluates the improvement achieved when the domain knowledge is considered. Finally, a third experiment validates the results provided by the system over time.

The system is evaluated by means of a test battery of documents previously labeled as relevant or non relevant by the experts. The medicine domain has been selected to address the experiments. This decision is motivated by three foundations. First, it is one of the most important domains for human beings (More, 2016). Second, it is constantly being updated and improved with the advances made by researchers (which implies that the trends are modified in a short period of time, generating several outdated manuscripts (Castiglioni, 2019)). Finally, it is one of the most explored fields in the scientific community (Richta, 2018), which facilitates the document gathering process required to create a wide corpus organized by year.

The rest of the paper is organized as follows. Section 2 situates the proposal in the domain highlighting its foundations. Section 3 presents the developed framework detailing its modules and components. Section 4 proposes several experiments on the medicine domain to illustrate the viability of the system. Section 5 concludes and provides some future guidelines.

Section snippets

Background

This section introduces the foundations of the Webelance framework. The first subsection overviews the concept of relevance, both by defining it and also by reviewing its evolution in the research field over the last years. Secondly, some techniques for automatic dictionary generation are introduced. Finally, given the fact that the article relevance analysis presented in this work is inspired by the classic Sentiment Analysis approach (de Diego, Fernández-Isabel, Ortega, & Moguerza, 2018),

Proposed framework

The main purpose of the Webelance framework resides in estimating the relevance of scientific texts in a predefined application domain. Webelance is an expert system that obtains information from experts in the scientific community to be developed. Thus, concepts such as number of citations and reputation of authors based on their importance in the field of application are used. Notice that the system needs other independent experts in the selected application domain to be evaluated and to show

Experiments

Medicine domain has been selected in order to exemplify and ratify the validity of the proposal.

Given a set of documents previously assessed by field experts (i.e., medical researchers), the experiments are focused on both evaluating the Webelance framework accuracy on document relevance estimation and verifying overall system sensitivity to variations on internal parameters values.

Four different groups of 10 documents (for the first and second experiments) and 5 specific documents related to

Conclusions

This paper has introduced the Webelance framework to provide a solution to an existing open issue in the scientific community: to estimate the relevance of scientific articles. The relevance (i.e., the degree to which something is related or useful to what is happening or being talked about) is a subjective concept, and therefore, it is almost impossible to learn exactly at any given time. However, there are many objective measures used to estimate it. These estimations are objective, but they

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

CRediT authorship contribution statement

Alberto Fernández-Isabel: Methodology, Writing - original draft, Writing - review & editing. Adrián A. Barriuso: Data curation, Software, Writing - original draft. Javier Cabezas: Investigation, Writing - original draft. Isaac Martín de Diego: Supervision, Writing - review & editing. J.F.J. Viseu Pinheiro: Data curation, Conceptualization, Validation.

CRediT authorship contribution statement

Acknowledgments

Research supported by grant from the Spanish Ministry of Economy and Competitiveness, under the Retos-Colaboración program: SABERMED (Ref: RTC-2017-6253-1); medical corpus provided by MMG and donation of the Titan V GPU by NVIDIA Corporation.

References (87)

G. Abramo et al.
Predicting publication long-term impact through a combination of early citations and journal impact factor
Journal of Informetrics
(2019)
L. Averell et al.
The form of the forgetting curve and the fate of memories
Journal of Mathematical Psychology
(2011)
F. Benedetti et al.
Computing inter-document similarity with context semantic analysis
Information Systems
(2019)
R.L. Cecchini et al.
Topic relevance and diversity in information retrieval from large datasets: A multi-objective evolutionary algorithm approach
Applied Soft Computing
(2018)
K. Chen et al.
Turning from tf-idf to tf-igm for term weighting in text classification
Expert Systems with Applications
(2016)
A. Fernández-Isabel et al.
A unified knowledge compiler to provide support the scientific community
Knowledge-Based Systems
(2018)
S. Garrido-Jurado et al.
Generation of fiducial marker dictionaries using mixed integer linear programming
Pattern Recognition
(2016)
S.M. Rezaeinia et al.
Sentiment analysis based on improved pre-trained word embeddings
Expert Systems with Applications
(2019)
S. Zhang et al.
Sentiment analysis of chinese micro-blog text based on extended sentiment dictionary
Future Generation Computer Systems
(2018)
J. Akram et al.
Lexicon and heuristics based approach for identification of emotion in text

M. Alfano et al.

Development and practical use of a medical vocabulary-thesaurus-dictionary for patient empowerment

Allen Institute for Artificial Intelligence and Semantic Scholar (2018). Semantic Scholar API....

M. Anandarajan et al.

Term-document representation

Baccianella, S., Esuli, A., & Sebastiani, F. (2010). Sentiwordnet 3.0: an enhanced lexical resource for sentiment...

A. Bandhakavi et al.

Lexicon generation for emotion detection from text

IEEE Intelligent Systems

(2017)

K. Bhavsar et al.

Natural language processing with python cookbook: Over 60 recipes to implement text analytics solutions using deep learning principles

(2017)

P.A. Bloching et al.

Assessing the scientific relevance of a single publication over time

South African Journal of Science

(2013)

D.B. Bracewell

Semi-automatic creation of an emotion dictionary using wordnet and its evaluation

E. Cambria

Affective computing and sentiment analysis

IEEE Intelligent Systems

(2016)

Cambria, E., Poria, S., Gelbukh, A., & Kwok, K. (2014). Sentic api: a common-sense based api for concept-level...

A. Castiglioni

A history of medicine

(2019)

Chandramouli, A. (2018). Domain-specific stopword removal from unstructured computer text using a neural network. US...

Q. Chen et al.

Sentence similarity measures revisited: Ranking sentences in pubmed documents

Chen, Y., Beynon, J. A., Perlov, B., Ghatare, S. P., Bolivar, A., Parikh, N., et al. (2014). Methods and apparatus for...

Z. Chen et al.

Long-tail vocabulary dictionary extraction from the web

Y. Cheng et al.

Research and development of domain dictionary construction system

Chikersal, P., Poria, S., & Cambria, E. (2015). Sentu: sentiment analysis of tweets by combining a rule-based...

Clement Levallois. (2016). Lists of academic stopwords. URL:https://github.com/seinecle/Stopwords (Online: accessed...

D. Deng et al.

Topic-adaptive sentiment lexicon construction

D. Deng et al.

Sentiment lexicon construction with hierarchical supervision topic model

IEEE/ACM Transactions on Audio, Speech, and Language Processing

(2019)

Q. Deng et al.

Building an environmental sustainability dictionary for the it industry

I.M. de Diego et al.

A visual framework for dynamic emotional web analysis

Knowledge-Based Systems

(2018)

I. Donoso-Guzmán et al.

An interactive relevance feedback interface for evidence-based health care

C. Fellbaum

Wordnet

N. Fiorini et al.

Best match: new relevance search for pubmed

PLoS Biology

(2018)

C.O. Freitas et al.

Study of perceptual similarity between different lexicons

International Journal of Pattern Recognition and Artificial Intelligence

(2004)

X. Fu et al.

Lexicon-enhanced lstm with attention for general sentiment analysis

IEEE Access

(2018)

G. Goeckenjan et al.

Pubmed results

Pneumologie

(2011)

L. Goeuriot et al.

Sentiment lexicons for health-related opinion mining

C. Gormley et al.

Elasticsearch: The definitive guide: A distributed real-time search and analytics engine

(2015)

Gupta, S. (2015). Distantly supervised information extraction using bootstrapped patterns. Ph.D. thesis Stanford...

H. Han et al.

Generate domain-specific sentiment lexicon for review sentiment analysis

Multimedia Tools and Applications

(2018)

J. Han et al.

Survey on nosql database

Cited by (3)

Optimal power generation and power flow control using artificial intelligence techniques
2021, Renewable Energy Systems: Modelling, Optimization and Control
Artificial intelligence (AI) is a science that can actually be accomplished by a computer interfaced with devices such as robots to automate smart behaviors. The complexity of the intelligence networks is greatly enhanced for the growth of industrial production and power systems. As a result of this power system research, the process of information, remote device management, and utility have become more complicated and time-consuming, using traditional techniques and conclusions from the data acquired. In the current scenario approach, AI is needed in various field of power system such as planning for generation expansion, power system reliability, transmission line expansion, planning and operation of the distribution system, forecasting for electricity, control of reactive power, control of voltage, frequency and stability. With the increasing development of renewable system application, AI is also enhanced in the control of fuel cell and wind solar thermal hybrid power system. Further these advanced techniques are also utilized in automation for restoration, fault analysis, and network security. AI is designed and deployed with the aid of advanced technology tools to solve all of the aforementioned problems for large power systems as the mother of innovation is needed. Over the last few decades, a problem with traditional approaches has been identified by researchers, making AI techniques an effective tool to solve problems with the power system. Most power system problems are focused on a range of unfeasible demands. And the techniques of AI are the only ways to resolve them. Another requirement of AI in power system is reliable and efficient supply of energy, which is an important requirement for the world to avoid environmental impacts. It is achieved by tight supervision of the machinery and through the use of the power system. It needs highly effective, accurate, and automated AI-based techniques such as energy management system, intelligent substation ornamented with high-speed surveillance, monitoring, and communication. With the promotion of these advances through AI techniques, savings can be made in the field of remote monitoring of equipment, service, maintenance, and production. A great deal of research has been done, and a great deal of research is yet to be done to take full advantage of AI technology for cost reduction by improving the efficiency of the power system, centralized control, and monitoring.
This chapter elaborates the different AI techniques that relate to power system problems such as optimization of power generation to meet load demand and power flow problems across the areas. The application of optimized power generation and power flow is explored with the use of artificial neural networks, fuzzy logic control, expert techniques, genetic algorithms, and advanced technique of AI, that is game playing method. This chapter explains implementation of these AI techniques to obtain optimum generation to meet uncertain load demand.
Framework for scoring the scientific reputation of researchers
2024, Knowledge and Information Systems
An embedding approach for analyzing the evolution of research topics with a case study on computer science subdomains
2023, Scientometrics

¹: www.datasciencelab.es.

View full text

Knowledge-based framework for estimating the relevance of scientific articles

Highlights

Abstract

Introduction

Section snippets

Background

Proposed framework

Experiments

Conclusions

Declaration of Competing Interest

CRediT authorship contribution statement

CRediT authorship contribution statement

Acknowledgments

Journal of Informetrics

Journal of Mathematical Psychology

Information Systems

Applied Soft Computing

Expert Systems with Applications

Knowledge-Based Systems

Pattern Recognition

Expert Systems with Applications

Future Generation Computer Systems

Lexicon and heuristics based approach for identification of emotion in text

Development and practical use of a medical vocabulary-thesaurus-dictionary for patient empowerment

Term-document representation

Lexicon generation for emotion detection from text

IEEE Intelligent Systems

Natural language processing with python cookbook: Over 60 recipes to implement text analytics solutions using deep learning principles

Assessing the scientific relevance of a single publication over time

South African Journal of Science

Semi-automatic creation of an emotion dictionary using wordnet and its evaluation

Affective computing and sentiment analysis

IEEE Intelligent Systems

A history of medicine

Sentence similarity measures revisited: Ranking sentences in pubmed documents

Long-tail vocabulary dictionary extraction from the web

Research and development of domain dictionary construction system

Topic-adaptive sentiment lexicon construction

Sentiment lexicon construction with hierarchical supervision topic model

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Building an environmental sustainability dictionary for the it industry

A visual framework for dynamic emotional web analysis

Knowledge-Based Systems

An interactive relevance feedback interface for evidence-based health care

Wordnet

Best match: new relevance search for pubmed

PLoS Biology

Study of perceptual similarity between different lexicons

International Journal of Pattern Recognition and Artificial Intelligence

Lexicon-enhanced lstm with attention for general sentiment analysis

IEEE Access

Pubmed results

Pneumologie

Sentiment lexicons for health-related opinion mining

Elasticsearch: The definitive guide: A distributed real-time search and analytics engine

Generate domain-specific sentiment lexicon for review sentiment analysis

Multimedia Tools and Applications

Survey on nosql database