An effective High Recall Retrieval method

doi:10.1016/j.datak.2017.07.006

Data & Knowledge Engineering

Volume 123, September 2019, 101603

https://doi.org/10.1016/j.datak.2017.07.006 Get rights and content

Abstract

The High Recall Retrieval (HRR) problem is one of the fundamental tasks for many applications such as patent retrieval, legal search, medical search, marketing research, charging and collecting tax, and literature review, etc. Given the data set obtained by the user’s query, the HRR problem is defined as finding the full set of relevant documents while less review effort will be required. It is very expensive to review a lot of documents since most of the reviewers are experts in the specific fields such as patent attorneys, lawyers, marketing, and medical professionals. However, the existing HRR methods have been far from satisfactory to make them enumerate all relevant documents. This is due to the fact that not only the sheer volume of documents inevitably including noises (non-relevant documents) but also the threshold measurements have been inadequately adopted. To deal with these problems, we propose a novel solution to efficiently find all the relevant documents among a large set of results. It consists of two steps: (a) to effectively classify the entire documents and (b) to select the representative documents in each class. We formalized the problem and theoretically verified the upper-bound of our method. In the experiments, our method is more efficient than the state-of-the-art query expansion methods.

Introduction

High precision in traditional information retrieval systems has been important to find the most relevant targets, and it may be acceptable even if some best ones will be omitted. Recall in Patent IR (PaIR), however, is a critical issue, since the prior-art search should have been conducted without exceptions before making a full-scale business investment decision. Because a patent which is missed in the searching procedure and the infringement of the missing one might cause an enormous risk usually including a huge settlement cost, the production indemnification, and the amount reimbursed regarding the degraded Corporate Identity value.

The High Recall Retrieval (HRR) problem is one of the fundamental tasks for many applications such as patent retrieval, legal searches, medical searches, marketing research, charging and collecting taxes, and literature review, etc. These can be exemplified by situations such as when a patent examiner needs to identify all relevant patents; a lawyer needs to find every piece of evidence related to his/her case from documents that are under a legal hold; a scientist does not want to miss any piece of prior work related to his/her ongoing research; the National Tax Service imposes duties on all taxpayers exclusively.

The conventional information retrieval systems have not satisfied HRR problem, and their main purpose of them has long been focused on to maximize the precision. In HRR problem, to reduce the review efforts for the users and also not to miss any relevant results, the HRR experts have inevitably been conducted the query expression techniques which means that queries have consisted of many keywords that have repeated a lot of reformulation steps. This process has been required tedious efforts by the domain experts because the retrieval quality has been based on their ability to skillfully construct variations of queries and after that to laboriously investigate the retrieved documents. Unfortunately, the larger the collection of a target database, the more cumbersome the investigation efforts and the lower the resulting quality.

Basically, the keyword based approaches by the traditional IR methodologies have been not made successful to address the patent databases. The main reasons can be summarized as follows: (1) New keywords have been coined from each and every patent since the patent per se will be invented by a new idea or new technology. (2) The patents have been composed to hide their core keywords intentionally since they might not have been retrieved by the keyword search engine as much as possible. And (3) similar to the second reason, the patents have been tried to hide important concepts by utilizing ‘common keywords’, which will be shuffled with a lot of noise keywords. Thus, the traditional keyword-based IR technologies, as well as the cutting edge ones, have been far from the patent retrieval requirements.

As an example, consider Fig. 1 (nodes only, ignore the links that will be discussed in the next paragraph again) with a total of six documents. Assume that node 6 is a relevant document. The conventional Relevance Feedback approach determined by Rocchio method [1] will decide that node 1, 2, and 3 are individually turned out to be ”not relevant”. Accordingly, the IR system will try to find the farthest node (excluding the relevant documents) from the average values of nodes 1, 2, and 3 (0.67, 0.67, 0.67) so that the node 4 will be selected and it also is not relevant, so that the process will be repeated and no relevant document is found, until all the node has to be enumerated. This means that the Rocchio based IR system will check all documents independently and will not be able to find the right solution properly since the relevance feedback results will return an inexact threshold. Basically, this phenomenon comes from the non-relevant document decision with parameter $γ$ in the Rocchio formula.

However, the method we propose is capable of finding relevant documents as quickly as possible which focuses on the central nodes (representative node) and a distance threshold. Now, consider Fig. 1 as a graph model (nodes and links together) and assume that the distance are calculated using the Edit distance. (Our method is not restricted by a certain distance metric.) Suppose that the distance threshold $ε = 0.5$ , node 5 is the central node, and the Edit distance between random nodes such as $d (1, 4)$ , $d (1, 5)$ , $d (5, 6)$ are 0.33, 0.5, 0.1, respectively ( $d (1, 4) = (| 0 - 0 | + | 0 - 1 | + | 1 - 1 |) ∕ 3 = 0.33$ , $d (1, 5) = (| 0 - 0.5 | + | 1 - 0.5 | + | 1 - 0.5 |) ∕ 3 = 0.5$ , $d (5, 6) = 0.1$ ). Based on the given threshold and the computed distances, we can find out that node 1 and 6 are not identifiable from each other since the distance between them does not satisfy the distance threshold, $d (1, 6) = 0.6 > 0.5$ . Thus, the process of finding node 6 (relevant document) is performed without checking all documents independently, since it is identifiable from the central node (node 5).

Our approach is the first one for the effective examination of HRR problem with respect to 100% recall. The existing research has mostly focused on increasing recall value, but they have not been exploited to the HRR problem nor tacked to reduce the examination costs. In this study, however, we try to solve the HRR problem and to verify our method with the examination costs, that is the metric that how quickly the all relevant documents are detected.

The HRR problem with the supervised learning has been addressed to separate the relevant documents from non-relevant documents which bisect the whole document hyperplane. In supervised learning, each of the pre-selected set of documents (the “training set”) is labeled as relevant and is trained by a machine-learning algorithm, which then classifies or ranks the documents in a corpus (the “test set”) according to their likelihood of relevance. The fundamental limitation of the supervised learning approach [2] is that is valid only in the binary case of “relevant or not” so that multi-topics of patents cannot be covered. Multiple topics and multiple categories are a prerequisite for real patents since almost all patents are relevant to multiple particular IPC (International Patent Classification) codes. Another weakness of the method is For high-recall tasks, the opinion of an expert has been provided, giving their personal decision of relevance. But, obtaining authoritative opinions for even a small set of documents may have a fluctuation, or may incur unacceptable costs and time.

To the best of our knowledge, this kind of research is unprecedented. The nearest work to our study is ReQ–ReC (ReQuery–ReClassify) [3]. This research considered a scenario where a searcher requires both high precision and high recall from an interactive retrieval process. When accessing to the entire data set, an active learning loop was used to ask for additional relevance feedback labels to refine the classifier. This model uses a representational method of relevance feedback, as Rocchio [1], and machine learning method, as Support Vector Machine (SVM). The method has a restriction in the kernel function for multi-dimensions since the HRR targets such as patent documents are definitely related to multiple classifications. The ReQ–ReC, however, is only valid for a single classification that is not realistic.

Contributions To overcome the mentioned difficulties, we propose a dynamic effective method for HRR problem where our main contributions are summarized as follows:

•
We develop an effective dynamic retrieval technique for HRR problem.
•
We formalize a diverse retrieval method based on graph theoretic approach.
•
We provide an efficient algorithm to remove the documents that cannot be among $k$ relevant ones, which can minimize the reviewing time.
•
The benefits of the above features are verified through conducting experiments using various datasets.

Organizations: The rest of this paper is organized as follows. An overview of High Recall Retrieval framework and its key components is given in Section 2. High Recall Retrieval with a Single-Step (HRR $^{1}$ ) and Double-Step (HRR $^{2}$ ) describe in Sections 3 High Recall Retrieval with a single-step (, 4 High Recall Retrieval with a double-step (, respectively. Section 5 simply describes the evaluation metrics for high recall retrieval. Moreover, experimental studies are given in Section 6. We discuss related work in Section 7. Finally, we conclude the paper in Section 8.

Section snippets

Problem statements

Let $D = {d_{1}, d_{2}, \dots, d_{n}}$ denote the results set obtained by a user query, and $\tilde{D} = {{\tilde{d}}_{1}, {\tilde{d}}_{2}, \dots, {\tilde{d}}_{k}}$ is the set of relevant documents, where ${\tilde{d}}_{i} \in \tilde{D}$ , $\tilde{D} \subseteq D$ and $1 \leq k \leq n$ . For each $d_{i} \in D$ , the relevancy of the document is unknown, unless the document is reviewed. The relevant score of $d_{i}$ is denoted as $r e l (d_{i})$ which is equal to 1 if $d_{i}$ is the relevant document and $0$ otherwise.

Problem 1 High Recall Retrieval Problem

Given the data set $D$ , the high recall retrieval is represented by $S (D)$ , then $| S (D) |$ numbers in needed to satisfy $k$ relevant documents for $k \leq l$

High Recall Retrieval with a single-step (HRR $^{1}$ )

Before presenting our main HRR $^{2}$ (High Recall Retrieval with a Double-Step) method, we first describe a conceptual simple scheme for a better understanding. We call this scheme HRR $^{1}$ (High Recall Retrieval with a single-step) which works well for data in small size. We will then present HRR $^{2}$ in Section 4 which is applicable on large scale datasets.

The HRR $^{1}$ method (in 1) consists of two stages. The first is to perform Minimum $ε$ -ReS Diverse set within the entire collection which is the

High Recall Retrieval with a double-step (HRR ²)

As mentioned in Section 3, HRR¹ is not applicable on large scale datasets. The size of ReS is often too large which brings about a very expensive review and collection process. To tackle this problem, we introduce the High Recall Retrieval with a Double-Step (HRR²) that partitions a data graph into multiple clusters. Many clustering methods have been proposed in the literature (e.g., [10], [11], [12], [13]). Since our main focus is on retrieval process and representative sampling, we will not

Evaluation metrics

The simplest evaluation measure to assess the retrieval performance is evaluating the recall. However, the problem of doing this is that it fails to reflect how early a system retrieves the relevant documents, and thus a number of user review efforts that cannot be counted. Table 2 shows an illustrative example of how different metrics perform with four different IR systems when a collection is searched by a given query. In this case, there are four relevant document results, and it is assumed

Experiment results

In this section, we have presented empirical experiments to evaluate the effectiveness of the HRR $^{2}$ compared with ReQ–ReC (ReQuery–ReClassify) and RF (Relevance Feedback). The ReQ–ReC [3] is the state-of-the-art query expansion (Rocchio) with SVM. The RF is Rocchio relevance feedback method [1]. Two data sets (Yeast and 20-newsgroup) were used in the evaluation. The criterion for the selection of suitable datasets is that the whole data set must have relevant information and this is because

Related work

The research of patent information retrieval is mainly divided into patent search and patent analysis. The first, patent search is concerned with finding all filed patents relevant to a given patent application. The queries in patent retrieval are typically very long since they take the form of a patent claim or even a full patent application in the case of prior-art patent search. These types of research are called query formulation and query expansion [16], [17], [18].

The second is patent

Conclusion

We presented algorithms named HRR¹ and HRR² which are suitable for the high recall retrieval problem without sacrificing precision and minimize the review efforts. We also presented the theoretically proved that our approach can reduce the upper bound effectively. Given a certain precision, the efforts to achieve the full recall level can be processed so that the HRR¹ and HRR² algorithms will find the most promising region and decide to move to the next promising region dynamically based on the

Acknowledgment

This work was supported by Inha University .

Justin JongSu Song received his B.Sc. and M.Sc. in Industrial Engineering, with high honors, from Inha University, Korea, in 2012. He is currently Ph.D. candidate in Inha University. His research interests include Team Formation Problem, Social network, Information Retrieval, and Patent Analysis.

References (36)

ClarkB.N. et al.
Unit disk graphs
Discrete Math.
(1990)
FavaronO.
Two relations between the parameters of independence and irredundance
Discrete Math.
(1988)
JardineN. et al.
The use of hierarchic clustering in information retrieval
Inf. Storage Retr.
(1971)
ShiC. et al.
A link clustering based overlapping community detection algorithm
Data Knowl. Eng.
(2013)
TsengY.-H. et al.
Text mining techniques for patent analysis
Inf. Process. Manage.
(2007)
MasiakowskiP. et al.
Integration of software tools in patent analysis
World Patent Inf.
(2013)
BoninoD. et al.
Review of the state-of-the-art in patent information and forthcoming evolutions in intelligent patent informatics
World Patent Inf.
(2010)
KimY.G. et al.
Visualization of patent analysis for emerging technology
Expert Syst. Appl.
(2008)
SegevA. et al.
Identification of trends from patents using self-organizing maps
Expert Syst. Appl.
(2012)
Baeza-YatesR.A. et al.
Modern information retrieval
(1999)

SebastianiF.

Machine learning in automated text categorization

ACM Comput. Surv.

(2002)

LiC. et al.

ReQ-ReC: High recall retrieval with query pooling and interactive classification

GareyM.R. et al.

Computers and Intractability; A Guide to the Theory of NP-Completeness

(1990)

BergeC. et al.

Graphs and Hypergraphs, vol. 7

(1973)

BlidiaM. et al.

Bounds on the k-independence and k-chromatic numbers of graphs

Ars Combin.

(2014)

BollobásB. et al.

Graph-theoretic parameters concerning domination, independence, and irredundance

J. Graph Theory

(1979)

DrineasP. et al.

Clustering large graphs via the singular value decomposition

Mach. Learn.

(2004)

CroftW.

Cluster-based retrieval using language models

Inf. Retr.

(2004)

Cited by (4)

A bitwise approach on influence overload problem
2024, Data and Knowledge Engineering
Increasingly developing online social networks has enabled users to send or receive information very fast. However, due to the availability of an excessive amount of data in today’s society, managing the information has become very cumbersome, which may lead to the problem of information overload. This highly eminent problem, where the existence of too much relevant information available becomes a hindrance rather than a help, may cause losses, delays, and hardships in making decisions. Thus, in this paper, by defining information overload from a different aspect, we aim to maximize the information propagation while minimizing the information overload (duplication). To do so, we theoretically present the lower and upper bounds for the information overload using a bitwise-based approach as the leverage to mitigate the computation complexities and obtain an approximation ratio of $1 - \frac{1}{e}$ . We propose two main algorithms, B-square and C-square, and compare them with the existing algorithms. Experiments on two types of datasets, synthetic and real-world networks, verify the effectiveness and efficiency of the proposed approach in addressing the problem.
Reducing the user labeling effort in effective high recall tasks by fine-tuning active learning
2023, Journal of Intelligent Information Systems
PatentNet: multi-label classification of patent documents using deep learning based language understanding
2022, Scientometrics
Patent prior art search using deep learning language model
2020, ACM International Conference Proceeding Series

Wookey Lee received the B.S., M.S., and Ph.D. from Seoul National University, Korea, and the M.S.E. degree from Carnegie Mellon University, USA. He currently is a Professor in Inha University, Korea. He has served as chairs and PC members for many conferences such as CIKM, DASFAA, IEEE DEST, VLDB, BigComp, EDB, etc. He is currently one of the Executive Committee members of IEEE TCDE. He won the best paper awards in IEEE TCSC, KORMS and KIISE. Now he is the EIC of Journal of Information Technology and Architecture, and an associate editor for WWW Journal. His research interests include Cyber–Physical systems, Graph and Mobile systems, Data Anonymization, and Patent Information.

Jafar Afshar was born in Tehran, Iran in 1985. He received the B.Sc. (2009) and M.Sc. (2014) degrees in Industrial Engineering from Azad University, Iran, and Universiti Teknologi Malaysia, Malaysia, respectively. He is currently Ph.D. candidate in Inha University. His research interests lie in Team Formation Problem, Information Retrieval, Patent Analysis, and Social Network.

View full text

An effective High Recall Retrieval method

Abstract

Introduction

Section snippets

Problem statements

High Recall Retrieval with a single-step (HRR 1 )

High Recall Retrieval with a double-step (HRR 2)

Evaluation metrics

Experiment results

Related work

Conclusion

Acknowledgment

Discrete Math.

Discrete Math.

Inf. Storage Retr.

Data Knowl. Eng.

Inf. Process. Manage.

World Patent Inf.

World Patent Inf.

Expert Syst. Appl.

Expert Syst. Appl.

Modern information retrieval

Machine learning in automated text categorization

ACM Comput. Surv.

ReQ-ReC: High recall retrieval with query pooling and interactive classification

Computers and Intractability; A Guide to the Theory of NP-Completeness

Graphs and Hypergraphs, vol. 7

Bounds on the k-independence and k-chromatic numbers of graphs

Ars Combin.

Graph-theoretic parameters concerning domination, independence, and irredundance

J. Graph Theory

Clustering large graphs via the singular value decomposition

Mach. Learn.

Cluster-based retrieval using language models

Inf. Retr.

High Recall Retrieval with a single-step (HRR $^{1}$ )

High Recall Retrieval with a double-step (HRR ²)