Graph based model for information retrieval using a stochastic local search

doi:10.1016/j.patrec.2017.09.019

Pattern Recognition Letters

Volume 105, 1 April 2018, Pages 234-239

https://doi.org/10.1016/j.patrec.2017.09.019 Get rights and content

Highlights

•
Graph based model for information retrieval.
•
Stochastic local search (SLS) for Index construction.
•
SLS method to extract subgraphs.
•
SLS model versus Cosine model for information retrieval.
•
SLS graph based model performance on the CACM collection.

Abstract

Graph has become increasingly important in modeling complicated structures and data such as chemical compounds, and social networks. Recent advance of machine learning research has witnessed a number of models based on graphs, from which information retrieval study is also benefited since many of these models have been verified by different information retrieval tasks. In this paper, we investigate the issues of indexing graphs and define a novel solution by applying a stochastic local search (SLS) method to extract subgraphs that will be used for the Information Retrieval process. To reduce the size of the index, we take into consideration the size of the query and the set of the frequent subgraphs. In other words, the subgraphs that will be used to create the index will have a size equal to the size of the query in order to optimize as much as possible the search space and the execution time. The proposed method is evaluated on the CACM collection, which contains the titles, authors and abstracts (where available) of a set of scientific articles and a set of queries and relevancy judgments and compared to the vector-based cosine model proposed in the literature. Our method is able to discover frequent subgraphs serving to establish the index and relevant documents for the Information Retrieval process’s output. The numerical results show that our method provides competitive results and finds high quality solutions (documents) compared to the relevant documents cited on the CACM collection.

Introduction

The development of information and communication technologies and the increasing number of sectors of human activity has resulted in the production of an unprecedented volume of information, databases size increases in the few past years from some Gigabytes to a thousand Exabyte [9]. In addition, the progressive interconnection of sites via large computer networks such as Internet, as well as the standardization of access and production techniques of these information (URL, HTML, XML,...) make documents available to any Internet user.

This difficulty of access to information has given rise to several information retrieval tools, with the aim of helping the user to find the relevant information he is looking for. It includes search tools by keywords (search engines), by theme (the thematic search tools), by region (geographic search tools) or by the use of several search engines (meta-engines).

Graphs have become increasingly important in modeling complicated structures and schemaless data such as social networks [1], chemical molecule structures [2] and XML documents [15]. A large number of such databases are available on the Web. Data mining and search methods for structured data are needed for users to quickly identify a small subset of relevant data for further analysis and experiments.

The classical graph query problem can be described as follows: Given a graph database D = {G₁, G₂, . . .,G_n} and a graph query q, find all the graphs in which q is a subgraph. The core of the problem is the complexity of subgraph isomorphism [5], a sequential scan is very costly since subgraph isomorphism is NP-complete [14].

In this work, we propose a new model based on stochastic local search (SLS) [10] meta-heuristic. The proposed SLS is used to extract from the set of graphs in the database certain subgraphs (Frequent Subgraphs [8]) that are used to build the index. The extraction process is based on the query size and the support of subgraphs. Then, a graph index is built. Finally, for a given subgraph query, all the indexed subgraphs of the query are determined, and the index is looked up with these subgraphs to obtain a candidate set of graphs containing the indexed subgraphs. The concept of query size is introduced to reduce the complexity of index construction. The proposed method for the graph querying problem is evaluated on the CACM collection.

The paper is organized as follows. Section 2 defines the graph query problem. Section 3 describes the proposed method. Section 4 discusses results of an experimental study and finally Section 5 provides conclusions and future works.

Section snippets

Definitions and problem formulation

In this section, we first give some basic definitions and then describe our graph query problem.

Proposed approach

In this paper, we propose an index construction method for the problem of graph query. In the following, we detail our proposed method.

Experimental study

In order to evaluate the performance of our method, we implemented the proposed algorithm in Java and run it on Windows machine i5-4570 3.20 GHz, 4GB of RAM.

The developed algorithm has been tested on 52 queries and the results are compared to the relevant judgments mentioned in qrels.txt file attached to CACM collection. The evaluation of the final result is based on the classical IR metrics (Precision (p) and Recall (r)):

•
Recall rate: measures the ability of an Information Retrieval System to

Conclusion

In this study, we have shown a presentation of an IR system in a particular way. We proposed an index construction method based on frequent subgraphs and query-size. It consists of applying an SLS method to extract subgraphs that will be used in order to reduce the search space for the Information Retrieval process. Our method develops several advantages as saving running time and eliminating irrelevant information. Experiments on synthetic data show that the developed algorithm provides good

References (15)

B. Chen et al.
Temporal and social network based blogging behavior prediction in blogspace
Proc. ICDM
(2007)
B. Sun et al.
Extraction and search of chemical formulae in text documents on the web
Proc. WWW
(2007)
D. Cook et al.
Graph-based data mining
IEEE Intell. Syst.
(2000)
R. Diestel
Graph Theory
(2005)
S. Fortin
The Graph Isomorphism Problem
Technical Report TR96-20
(1996)
J. Savoy, D. Vrajitoru, Evaluation of learning schemes used in information retrieval,...
J. Yang et al.
Query improvement in information retrieval using genetic algorithms: a report on the experiments of the TREC project
Proceedings of the 1st Text Retrieval Conference (TREC-1)
(1993)

There are more references available in the full text version of this article.

Cited by (10)

Document-level relation extraction via graph transformer networks and temporal convolutional networks
2021, Pattern Recognition Letters
Citation Excerpt :
It can predict whether a relation is included in the given text or not, and which relation class is contained in the given ontology indicated by the text [7]. RE is an important task of information retrieval in natural language processing, which has attracted extensive attentions [8,21,23,24,33,44,45] and can be used for many applications including machine reading comprehension [4,30], question answering [32], and text generation [20]. Most existing studies for RE focused on extracting entity relationships from a single input sentence and have made great progress in improving the inference capability and anti-noise ability [21,44,46].
Relation Extraction (RE) aims at extracting meaningful relation facts between entities in texts. It is an important semantic processing task in the field of natural language processing (NLP) and has many applications. Traditional RE focuses on extracting entity relationships from a single input sentence. Recently, the research scope has been extended from sentence level to document level. However, compared with sentence-level RE, document-level RE, which needs to identify the inter-sentence relations from entities scattered in different sentences, is more complex and still lacks of solutions. To solve this problem, we propose a novel document-level RE method based on Heterogeneous Graph Neural Networks in this paper. Concretely, to obtain token embeddings containing long-distance dependency signals well, we encode the document with Temporal Convolutional Networks, whose dilated convolution and residual structure allow the effective and efficient preservation of historical information. To better describe the interaction between different elements, we construct the input documents as heterogeneous graphs with different node and edge types and utilize Graph Transformer Networks to generate semantic paths. Numerical experiments on two document-level biomedical datasets demonstrate the effectiveness of the proposed method.
DeepCADRME: A deep neural model for complex adverse drug reaction mentions extraction
2021, Pattern Recognition Letters
Citation Excerpt :
The task of extracting ADR mentions can be considered as a biomedical named entity recognition (BNER) problem [14]. The BNER has shown a growing interest in many text mining applications such as information retrieval [8] and question answering [22–26]. The detection of ADR mentions is one of the most important task of ADR systems as the overall performance of such systems is heavily depending on the effectiveness of the integrated ADR mentions extraction system: if an ADR mentions extraction system fails to identify ADR mentions, further processing steps to extract potential relationships between them will inevitably fail too.
Extracting mentions of Adverse Drug Reaction (ADR) from biomedical texts, aiming to support pharmacovigilance and drug safety surveillance, remains a challenging task as many ADR mentions are nested, discontinuous and overlapping. To solve these issues, in this paper, we propose a deep neural model for Complex Adverse Drug Reaction Mentions Extraction, called DeepCADRME. It first transforms the ADR mentions extraction problem as an N-level tagging sequence. Then, it feeds the sequences to an N-level model based on contextual embeddings where the output of the pre-trained model of the current level is used to build a new deep contextualized representation for the next level. This allows the DeepCADRME system to transfer knowledge between levels. Experimental results performed on the TAC 2017 ADR dataset, show the effectiveness of DeepCADRME which leads to a new state-of-the-art performance by reaching a F1 of 85.35% and 85.41% with and without mention types, respectively. The evaluation results also highlight the benefits of exploring language model to effectively extract different types of ADR mentions.
Folksonomy-based user profile enrichment using clustering and community recommended tags in multiple levels
2018, Neurocomputing
Citation Excerpt :
The precision of information retrieved by search engines starts decreasing as they are inefficient to handle such a big volume of data and satisfy user information need. Farhi and Boughaci [2] designed an approach for information retrieval where stochastic modeling was used to extract the desired subgraph from a large web graph at a comparatively low computational cost. But even after embedding the approach of Farhi and Boughaci to the search engines, key issue remains the same.
Folksonomy (aka Collaborative tagging) systems provide a platform to the users where they can annotate a web resource by using any tag of interest. It is a first-hand information directly given by user without any middleman modification, therefore, it is more reliable than any other means. This paper proposes a novel methodology to construct a strong User Interest Profile (UIP) by exploiting user’s own activities and other activities occurring in user’s social network. UIP will provide a complete list of user preferences along with his level of interest in that preference. The proposed methodology is different from other strategies used for UIP enrichment as user’s own tags are not enough to construct a strong UIP. In the current research work, two strategies have been employed for the enrichment of UIP. First one is clustering of tags based on the concept of semantic relatedness between two tags in the real world. This has been measured using Word2vec model. The second one is the utilization of user’s real friendship network. It is believed that the present work is the first one to integrate the concept of semantic relatedness for tag clustering. The performance of proposed methodology has been evaluated on the basis of evaluation metrics i.e. MRR, imp, completeness and P@k using a dataset of del.icio.us. To analyse the impact of parameters, similarity measure and number of clusters in cluster set, on the performance of UIP constructed by proposed methodology extensive experiments are performed. The results reveal that the proposed methodology outperforms all the state of the art methodologies in terms of accurate and efficient UIP construction for every value of the parameters under consideration.
Information Retrieval in XML Document: State of the Art
2024, Lecture Notes in Networks and Systems
Optimization of the results of a multilingual search engine using a fuzzy recommendation approach
2023, Journal of Information and Organizational Sciences
How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation
2022, arXiv

View all citing articles on Scopus

View full text

Graph based model for information retrieval using a stochastic local search

Highlights

Abstract

Introduction

Section snippets

Definitions and problem formulation

Proposed approach

Experimental study

Conclusion

Temporal and social network based blogging behavior prediction in blogspace

Proc. ICDM

Extraction and search of chemical formulae in text documents on the web

Proc. WWW

Graph-based data mining

IEEE Intell. Syst.

Graph Theory

The Graph Isomorphism Problem

Technical Report TR96-20

Query improvement in information retrieval using genetic algorithms: a report on the experiments of the TREC project

Proceedings of the 1st Text Retrieval Conference (TREC-1)