1 Introduction

Patent search is an active sub-domain of the research field known as information retrieval (IR; Tait 2008). A common task in patent IR is the prior-art search. This type of search is usually performed by individuals who need to ensure originality before applying for, or granting, a new patent. These individuals use IR systems to search databases containing previously filed patents. The entire patent application, or some subset of words extracted from it, is typically used as the query (Mahdabi et al. 2012; Piroi et al. 2011; Xue and Croft 2009).

At the time of writing, there are various state-of-the-art patent IR systems (e.g., Becks et al. 2011; Lopez and Romary 2010; Magdy and Jones 2010; Mahdabi et al. 2011). All of these systems use single query representations of the patent application. In this paper, we describe an approach to prior-art search that uses multiple query representations. Given a patent application, we generate a set of similar queries. Each of these queries is an alternative representation of the information contained in the application. This set of queries is submitted to an IR engine. We treat each batch of search results as a set of ‘ratings’. We analyse these ratings using collaborative filtering (CF) algorithms. Subsequently, we merge this pseudo-collaborative feedback with a set of standard IR results to achieve a final document ranking.

The remainder of this paper is organized as follows. In Sect. 2, we summarise related work from the fields of patent IR, collaborative filtering and data fusion. In Sect. 3, we describe a CF-based implementation of patent prior-art search. In Sect. 4, we discuss an extension of our technique called iterative refinement. Section 5 documents the experiments we used to evaluate our algorithms. Section 6 presents our results. Section 7 concludes the paper and proposes future work.

2 Related work

2.1 Patent IR

One of the defining challenges in patent IR is the problem of representing a long, technical document as a query. Early systems mimicked the approach taken by professional patent examiners, who (at the time) valued high frequency words as query terms (Itoh et al. 2003; Iwayama et al. 2003). Recently, use of the entire patent application (or a large set of terms automatically extracted from it) has become popular. Automatically extracting appropriate query terms from a patent application is a difficult task. These documents usually contain a large volume of text sectioned into multiple fields (e.g. Title, Abstract, Description, Claims. etc.). This being the case, how do we extract the ‘right’ terms, and which field(s) do we extract those terms from?

Xue and Croft (2009) examined query terms taken from different fields of a patent application. In an experiment using data published by the United States Patent and Trademark Office (USPTO), they found the best performance was obtained using high frequency terms extracted from the raw text of the Description field. These results were subsequently confirmed by other research teams. Magdy et al. (2011) produced the second best run of CLEF-IP 2010 Footnote 1 (Piroi 2010) using patent numbers extracted from the Description field, and Mahdabi et al. confirmed this finding when experimenting with Language Models (LM; Mahdabi et al. 2011, 2012).

The status of phrases in patent IR is somewhat uncertain. One study has suggested that retrieval performance can be improved by including noun phrases (obtained via a global analysis of the patent corpus) in the query (Mahdabi et al. 2012). Another study, using a different patent collection (CLEF 2011 rather than CLEF 2010), found quite the opposite (Becks et al. 2011).

There are several key differences between patent search and conventional IR. Patent queries, which typically contain several hundred terms, are obviously much longer than standard IR queries. This makes high precision retrieval very difficult, as the information need is quite diffuse. Techniques which work well in conventional search do not always translate gracefully to patent IR. Pseudo-relevance feedback (PRF), for example, performs very poorly in this particular context (Ganguly et al. 2011; Mahdabi et al. 2012). In patent IR, precision is relatively low even in the top-ranking results. Expanding a query using terms extracted from top-ranked patents tends to produce additional noise, rather than focus the information need.

2.2 Collaborative filtering

Collaborative filtering is a technique commonly used by commercial recommender systems. Recommender systems make predictions about the likelihood that a user u will like an item i. A prerequisite for this operation is a matrix relating items to ratings (Shardanand and Maes 1995). These ratings are awarded by u and his/her peers (i.e., the user community). Assuming the availability of this matrix, the recommendation process works as follows:

  • Find the subset of all users who have awarded ratings to other items that agree with the ratings awarded by u

  • Use ratings awarded by like-minded users to predict items for u

Given a large enough matrix, this process quickly becomes computationally expensive. There are a number of memory- and model-based algorithms designed to optimise the process. Memory-based algorithms [e.g., item-based and user-based systems (Resnick et al. 1994; Sarwar et al. 2001)] exploit the whole matrix when computing predictions. Generally, these predictions are calculated from the ratings of neighbours (i.e. users or items that are similar to the active user/item). In contrast, model-driven techniques make predictions based on user behaviour models. The parameters of the models are estimated offline. Techniques exploiting singular value decomposition (SVD; Billsus and Pazzani 1998) and probabilistic methods [e.g., latent class models (Hofmann 2004)] are common in this context.

A number of CF algorithms use graph-based analysis to calculate item predictions. A common approach involves modelling the users as nodes in an undirected weighted graph, wherein edges represent the degree of similarity between users based on rating activity (Aggarwal et al. 1999; Luo et al. 2008). There are a number of variations from this basic pattern. For example, Wang et al. proposed a recommendation scheme based on item graphs (Wang et al. 2006). In this scheme, items are nodes and edges represent pairwise item relationships. Huang et al. advanced this idea, proposing a bipartite graph comprising of item nodes and users nodes (Huang et al. 2004). In this scheme, ratings are modelled as links connecting nodes from the disjoint sets. Transitive associations between the nodes are subsequently used to generate item predictions.

It is worth noting that memory- and model-based algorithms both experience difficulties when the ratings matrix is sparsely populated. Accurately recommending products to new users (i.e., the ‘cold start’ problem) is also challenging (Cacheda et al. 2011). In the past, collaborative feedback algorithms have been combined effectively with conventional IR models (Zhou et al. 2013). Researchers have exploited IR rankings and click-through logs to improve the performance of CF algorithms (Cao et al. 2010; Liu and Yang 2008; Weimer et al. 2007). However, to date, there has been no attempt to combine CF algorithms and IR models in the field of patent search.

2.3 Data fusion algorithms

Our work combines multiple results sets retrieved using alternate query representations (i.e., data fusion). There are two general approaches when fusing search results. The first approach is unsupervised. Shaw and Fox have proposed a number of successful algorithms in this context, including CombSUM and CombMNZ (Shaw and Fox 1994). Other unsupervised algorithms, developed for monolingual and multilingual search, include CombRSV, CombRSVNorm (Powell et al. 2000) and CORI (Callan et al. 1995; see also Savoy 2004, 2005). The second approach to data fusion is supervised (Sheldon et al. 2011; Si and Callan 2005; Tsai et al. 2008). The supervised technique involves two steps. In step one, the quality of various result sets is ‘learnt’ from relevance judgements. In step two, unseen results sets are merged using predictions based on step one. Optional pre-processing (e.g., systemic bias, query ‘gating’) may be applied during this stage (Sheldon et al. 2011; Si and Callan 2005; Tsai et al. 2008).

3 Collaborative patent prior-art search (CPAC)

In this section, we explain our approach to patent prior-art search. We begin with an explanation of our notation. Our technique deals with a finite set of queries, \(Q=\{q_a, q_1, q_2 \ldots q_n\}\), and a finite set of documents \(D=\{d_1, d_2 \ldots d_m\}\) aggregated from documents retrieved by Q. Each query \(q\in Q\) is associated with a profile, which consists of a set of documents retrieved by submitting that query to a standard IR engine, \(D_q\subseteq D\), and the corresponding retrieval scores. Note that we treat these retrieval scores as CF ratings. These ratings, denoted R, will always correspond to real numbers. The first query we send to the IR system is denoted q a . The subset of queries that have retrieved a certain document d is defined as \(Q_d\subseteq Q\). Note that q and d (used in the subscript) vary over the sets \(\{q_a, q_1, q_2 \ldots q_n\}\) and \(\{d_1, d_2 \ldots d_m\}\).

Using the query profiles, we construct a rating matrix V. V will contain |Q| rows and |D| columns. Each element of V, \(v_{qd}\in R\cup \varnothing,\) denotes the rating given by query \(q\in Q\) to document \(d\in D.\) A value of \(\varnothing\) for v qd indicates that the query q has not retrieved the document d yet. We process this matrix using a CF algorithm (see Algorithm 2). The goal of this algorithm is to predict the value v for documents which have not been retrieved. Let us denote the prediction for \(d\in D\) by query \(q\in Q\) as \(p_{qd}\in R\cup \varnothing\) (p ad for q a ). If our CF algorithm is not able to make this prediction, then we set \(p_{qd}=\varnothing.\) For later use, we define the subset of document ratings for the query q as \(v_{q\cdot}=\{v_{qd}\in V/d\in D_q\},\) and the subset of query ratings for the document d as \(v_{\cdot d}=\{v_{qd}\in V/q\in Q_d\}.\) We also denote the document mean rating for a query q as \(\overline{v_{q\cdot}} (\overline{v_{a\cdot}}\) for q a ) and query mean rating for a document d as \(\overline{v_{\cdot d \cdot}}\)

Now, assume our system receives a single query q a . First, we obtain a set of results for this query from a standard IR engine. Next, we generate a set of queries \(Q'=\{q_1, q_2 \ldots q_n\}\) similar to q a . Each of these auto-generated queries is an alternative representation of the information contained in q a (see Sect. 3.2 for the query representations used in this study). We retrieve the top-ranked documents for each query in \(Q (Q \leftarrow q_a \cup Q'\)), using this data to construct our ratings matrix. This process is described in Algorithm 1. In this algorithm, x is the number of top-ranked documents we retrieve for each query q, and RATE() returns the score from the IR engine. Note that the top ranked documents \(\{d_1, \ldots, d_x\}\) are likely to be different for each \(q_i \in Q.\) We cache the documents returned by ir-retrieve (), together with the scores, for later use.

figure a
$$V=\left[\begin{array}{lllllll} & d_1 & d_2 & d_3 & d_4 & d_5 & d_6\\ q_a & \times & \times & \diamond & \diamond & \diamond & \diamond\\ q_1 & \diamond & \times & \times & \diamond & \diamond & \diamond\\ q_2 & \diamond & \diamond & \times & \times & \diamond & \diamond\\ q_3 & \times & \diamond & \diamond & \diamond & \diamond & \times \end{array}\right]$$

An example of a populated matrix is shown above. In this matrix, d i denotes the sequence number of the document in the entire corpus. Here, we assume that x = 2. The RATE() function in Algorithm 1 returns real numbers, but we have replaced all real numbers with a symbol (× ) to simplify the diagram. A (⋄) symbol indicates that a query q i has not retrieved that specific document d i .

Having built the rating matrix V, we predict the relevance p ad of each document \(d \in D\) to q a . This procedure is described in Algorithm 2. In this algorithm, we iterate through all documents in D (excluding those documents which were retrieved using the original query q a ) to produce a vector of predictions \(\overrightarrow{p_{a\cdot}}\). These predictions are calculated using a collaborative filtering algorithm (Cacheda et al. 2011). In our experiment, we try four different CF algorithms, as follows:

User-based

$$p_{ad} = \overline{v_{a\cdot}} + \sigma_a \frac{\sum_{q\in neigh_a} \left [(\frac{v_{ad}-\overline{v_{a \cdot}}}{\sigma_q})s(a,q)\right ] }{\sum_{q\in neigh_a}s(a,q)}$$

Item-based

$$p_{ad} = \frac{\sum_{d^{\prime}}(s(d^{\prime},d)v_{ad})}{\sum_{d^{\prime}}|s(d^{\prime},d)|}$$

SVD

$$p_{ad} = \overline{V_{a\cdot}}+U_r\cdot\sqrt{S^T_r}(u)\cdot \sqrt{S_r}\cdot R^T_r(d)$$

SlopeOne

$$p_{ad} = \frac{\sum_{d^{\prime} \in D_q-\left \{ d \right \}}(\sum_{x\in S_{dd^{\prime}}\frac{v_{xd}-v_{xd^{\prime}}}{|S_{dd^{\prime}}|}+v_{qd}})|S_{dd^{\prime}}|}{\sum_{d^{\prime} \in D_q-\left \{ d \right \}}|S_{dd^{\prime}}|}$$
figure b

In the user-based algorithm, we use Pearson’s correlation coefficient to measure the similarity between q a and \(q\in Q\) (denoted as s(aq)) as follows:

$$s(a,q)= \frac{\sum_{d\in D_{a}\cap D_{q}}(v_{ad}-\overline{v_{a\cdot}})(v_{qd}-\overline{v_{q\cdot}})}{\sqrt{\sum_{d\in D_{a}\cap D_{q}}(v_{ad}-\overline{v_{a\cdot}})^{2}\sum_{d\in D_{a}\cap D_{q}}(v_{qd}-\overline{v_{q\cdot}})^{2}}}$$

where D a denotes documents retrieved for q a and D q denotes documents retrieved for \(q\in Q.\) After calculating the the similarity between different queries, we calculate predictions by considering the contribution of each neighbour in the matrix, weighted by its similarity to q a   (neigh a ). We use the technique suggested by Herlocker et al. (2002), taking into account the mean \(\overline{v_{a\cdot}}\), as well as the standard deviation σ a and σ q of the meaning ratings for the queries q a and q in Q. Similarly, we define the similarity between different documents (denoted as s(d′, d)) for the item-based algorithm as:

$$s(d^{\prime},d)= \frac{\sum_{q\in Q}(v_{qd}-\overline{v_{q\cdot}})(v_{qd^{\prime}}-\overline{v_{q\cdot}})}{\sum_{q\in Q}(v_{qd}-\overline{v_{q\cdot}})^{2}\sum_{q\in Q}(v_{qd^{\prime}}-\overline{v_{q\cdot}})^{2}}$$

In the weighted SlopeOne algorithm, S dd' is the set of queries that have ‘rated’ both documents d and d′. In the SVD algorithm [referred to as ‘LSI/SVD’ in Cacheda et al. (2011)], we use a matrix factorization technique that converts V into three matrices:

$$V=U \cdot S \cdot R^T$$

where U and R are orthogonal matrices, and S is a diagonal matrix of size k × k (where k is the rank of V). This matrix is iteratively reduced by discarding the smallest values, to produce a matrix S γ with γ < k. The reconstructed matrix, \(V_\gamma=U_\gamma \cdot S_\gamma \cdot R_\gamma^T\) is the best rank-γ approximation of the rating matrix V.We calculate CF predictions from this (reduced dimension) matrix using the formula stated above.

We choose these four CF algorithms because they are very popular and have produced good results (Cacheda et al. 2011). Other CF algorithms could be used instead. Whichever algorithm is used, the output of this stage will be another ranked set of documents. In the final stage of our procedure, we fuse a set of IR-generated results with the CF-generated results. This procedure is described in Algorithm 3, where we combine the documents returned for q a by the IR engine with the vector of predictions produced in Algorithm 2. We tried a number of combinatorial methods. CombRSVNorm seems to work best in this context (see further Sect. 6.1). It is usually defined in the following way:

$$COMBRSVNORM=SUM[(RSV_i-MIN_{RSV})/(MAX_{RSV}-MIN_{RSV})]$$

where RSV denotes the retrieval status value (i.e., the score). Now we have three sets of document rankings (IR scores, CF scores and COMBRSVNORM scores). We sort the documents using all three scores (sort precedence as listed above, descending order) to produce a final ranking.

figure c

3.1 Possible weaknesses of our technique

As mentioned above, CF algorithms have two known weaknesses:

  1. [1]

    Sparsity—In a typical recommender system, most users will rate only a small subset of the available items. This means that most of the cells in the rating matrix will be empty.

  2. [2]

    Cold start—CF algorithms struggle to generate recommendations for users recently introduced into the system.

Neither problem affects our technique. We can ensure that the matrix is sufficiently populated by manipulating the size of the top-ranked results lists (via parameter x). And ‘cold start’ does not occur because ‘similar’ queries are auto-generated (see below).

3.2 Query representations

In this section, we introduce the various query representations used in our technique. The first query representation, denoted ALL, is the Description field of a full-length patent application (stop words and numbers removed, terms stemmed). Note that the stop word list used is not patent-specific (unlike Becks et al. 2011; Mahdabi et al. 2011) and phrases are ignored.

The second query representation, denoted LM, adopts the unigram model proposed by Mahdabi et al. (2012). Applying the unigram model involves estimating the importance of each term in the patent application according to a weighted log-likelihood approach, as follows:

$$P(t|q_{LM})=Z_tP(t|\Uptheta_q)\log\frac{P(t|\Uptheta_q)}{P(t|\Uptheta_C)}$$

where Z t  = 1/∑ t P(t|q) is the normalization factor. \(\Uptheta_q\) is a model describing the query language. \(\Uptheta_C\) is a model describing the language used in the corpus. These models are defined in the following way:

$$P(t|\Uptheta_q)=(1-\lambda)\cdot P_{ML}(t|d)+\lambda\cdot P_{ML}(t|C)$$

where the maximum likelihood estimation of a term t in a document d,  P ML (t|d), is defined as \(\frac{n(t,d)}{\sum_{t'}n(t',d)}. \,0<\lambda<1\) is a parameter used to control the influence of each estimation, and n(td) is the term frequency of term t in document d.

The third query representation, denoted LMIPC considers International Patent Classifications (IPC) (Mahdabi et al. 2012). We build a relevance model \(\Uptheta_{LMIPC}\) specifically for this purpose. The result model is defined as:

$$P(t|q_{LMIPC})=(1-\lambda)\cdot P(t|\Uptheta_{LMIPC})+\lambda\cdot P(t|q_{LM})$$

where \(P(t|\Uptheta_{LMIPC})\) is calculated using:

$$P(t|\Uptheta_{LMIPC})=\sum_{d\in LMIPC}P(t|d)\cdot P(d|\Uptheta_{LMIPC})$$

and

$$P(D|\Uptheta_{LMIPC})=Z_d\sum_{t}P(t|\Uptheta_d)\log \frac{P(t|\Uptheta_{LMIPC})}{P(t|\Uptheta_C)}$$

where \(Z_d=1/\sum_{D \in LMIPC}P(D|\Uptheta_{LMIPC})\) is a document specific normalization factor. Next, we have three query representations exploiting standard IR weighting schemes, denoted TF,  TFIDF, and BM25 respectively:

$$P(t|q_{TF})=\frac{n(t,d)}{\max_{t'}n(t',d)}$$
$$P(t|q_{TFIDF})=\frac{n(t,d)}{\max_{t'}n(t',d)}\cdot\log\frac{|D|}{df_{t}}$$
$$P(t|q_{BM25})=\sum_{t}w_{t}\frac{(k_1+1)n(t,d)}{K+n(t,d)}\frac{(k_3+1)n(t,d)}{k_3+n(t,d)}$$

where |D| is the total number of documents, df is document frequency, \(w_t=\log\frac{|D|-df_t+0.5}{df_t+0.5}\) is the inverse document frequency weight of term t and \(K=k_1\cdot((1-w_b)+w_b\cdot\frac{|d|}{avg|d|})\) includes the impact of the average document length. k 1,  k 3 and w b are free parameters (see Sect. 5.5 for the values used in this experiment). The final query representation, denoted UFT, is the raw text of the Description field minus unit frequency terms (i.e., terms which occur only once in the patent query).

We chose these query representations because they are popular and produce good results. Other query representations could be substituted. Throughout, we followed the normal procedures when calculating the top weighted terms in the patent (Becks et al. 2011, Ganguly et al. 2011; Mahdabi et al. 2012; Piroi 2010).

4 Iterative refinement

In this section, we describe a method which improves the basic performance of CPAC. This method assumes that CF-generated results and IR-generated results mutually reinforce one another. Updating one set of scores should iteratively propagate to the other set of scores via pairwise document relationships (i.e., associations created when the same document is rated by different query representations). To exploit these relationships, we adjust the CF-generated scores and IR-generated scores using a function which regularizes the smoothness of document associations over a connected graph. These document associations are easy to model within our CF framework. We construct an undirected weighted graph describing documents that are ‘rated’ by queries. In this graph, nodes represent documents and edges represent pairwise document relationships.

Let G = (DE) be a connected graph, wherein nodes D correspond to the |D| documents rated/retrieved by different queries, and edges E correspond to the pairwise document relationships between documents. The weights on these edges are calculated using the ‘ratings’ assigned by queries, derived by multiplying a transpose of the ratings matrix V  (V T) with itself (V T V). Further, assume an n × n symmetric weight matrix B on the edges of the graph, so that b ij denotes the weight between documents d i and d j . We further define M as a diagonal matrix with entries

$$M_{ii}=\sum_j b_{ij}$$

We also define a n × 2 matrix F with

$$F=\left [ \overrightarrow{p_{a}\cdot} \overrightarrow{IR_{a}\cdot} \right ]$$

where IR a is a vector of IR scores retrieved by q a and f(ad) denotes q a ’s ratings assigned to d.

Thereafter, we develop a regularization framework for adjusting the CF-generated and IR-generated scores. Formally, the cost function ℜ(FaG) in a joint regularization framework is defined as:

$$\Re (F,a,G) = \frac{1}{2}\sum\limits_{{i,j = 1}}^{n} {b_{{ij}} } \left\| {\frac{{f(a,d_{i} )}}{{\sqrt {M_{{ii}} } }} - \frac{{f(a,d_{j} )}}{{\sqrt {M_{{jj}} } }}} \right\|^{2} + \mu \sum\limits_{{i = 1}}^{n} {\left\| {f(a,d_{i} ) - f^{0} (a,d_{i} )} \right\|^{2} }$$

where μ > 0 is the regularisation parameter, f 0(a,d i ) is the initial rating of the test query q a and the document d i . F and F 0 are the refined matrix and the initial matrix, respectively. The first term on the right-hand side of the cost function is the global consistency constraint. This constraint ensures that the weighting function does not change too much between nearby points. In this experiment, ‘nearby points’ are the refined rating scores reflecting the initial relationships between documents and the nearby documents. The second term on the right hand side of the cost function is the fitting constraint. This constraint ensures that the ratings assigned to documents fit the initial ratings. The trade-off between these two variables is controlled by the parameter μ.

Given the above, the final weighting function is defined as:

$$F^*= arg \min_{\substack{F \in {\mathcal{F}}}} \Re(F,a,G)$$

where arg min stands for the argument of the minimum,Footnote 2 \(\mathcal{F}\) denotes the set of n × 2 matrices and \(F\in \mathcal{F}.\) After simplification, we can derive the following closed form solution:

$$F^*= \mu_2(I-\mu_1S)^{-1}F^0$$

where:

$$\mu_1=\frac{1}{1+\mu}$$
$$\mu_2=\frac{\mu}{1+\mu}$$
$$S=M^{-\frac{1}{2}}BM^{\frac{1}{2}}$$

and I is an identity matrix (see further Zhou et al. 2004; Zhu et al. 2003). Note that S is a normalized graph Laplacian matrix. Given the refined weighting matrix F, we can extract the refined \(\overrightarrow{p_{a\cdot}}\) and \(\overrightarrow{IR_{a\cdot}}\) scores. Footnote 3 This refinement method is described in Algorithm 4. In the iteration step of this algorithm, each node receives information from its neighbours while retaining its initial information. When F(s) converges, it is equivalent to the close form solution of F * [refer to Zhou et al. (2004) for proof].

figure d

5 Evaluation

In the following section, we describe a series of experiments designed to answer the following questions:

  1. [1]

    Does our technique outperform state-of-art patent IR systems?

  2. [2]

    How effective is the refinement method described in Sect. 4?

  3. [3]

    Which query representation is most effective?

  4. [4]

    Which collaborative filtering algorithm performs the best?

5.1 Experimental data

The text corpus used in our evaluation was built using the CLEF-IP 2011 test collection. This collection contains 3.5 million XML-encoded patent documents, relating to approximately 1.5 million individual patents. Footnote 4 These documents were extracted from the MAREC data corpus. Footnote 5 We used the CLEF-IP 2011 query set, which contains 1,351 topics (English subset). Each topic is a patent application comprising several fields. We built all queries using the Description field. Prior to indexing and retrieval, a suffix stemmer (Porter 1997) and a stop word list Footnote 6 were applied to all documents and queries. We also removed all numbers. Citation information was ignored. Relevance judgements were produced by CLEF campaign organizers. Judgements were extracted from published search reports.

5.2 Evaluation metrics

We used the following evaluation metrics in this experiment:

  • The precision computed after 10, 50 and 100 documents were retrieved (P@10, P@50 and P@100)

  • Normalized discounted cumulative gain (NDCG; Järvelin and Kekäläinen 2000)

  • The recall computed after 10, 50 and 100 documents were retrieved (R@10, R@50 and R@100)

  • Mean average precision (MAP).

Unless otherwise stated, results indicate average performance across all topics. Statistically-significant differences in performance were determined using a paired t test at a confidence level of 95 %.

5.3 Retrieval systems

All information retrieval functions in our experiment were handled by the Terrier open source platform (Ounis et al. 2006). Footnote 7 We used the BM25 retrieval model as it achieved (slightly) better results during set-up.

5.4 Baseline systems

We used a number of baseline systems to evaluate our technique. The first seven baseline systems relate to the seven query representations described in Sect. 3.2 (i.e., we used each query representation in isolation as a performance baseline). We also used the phase-based model described in Mahdabi et al. (2012), denoted LMIPCNP, and the query reduction method presented in Ganguly et al. (2011), denoted QR. LMIPCNP extends the LMIPC method, adding key phrases with similar semantics to the patent query. These phrases are extracted using the noun phrase patterns defined in Mahdabi et al. (2012). QR reduces a patent query by comparing segments of that query to top ranked documents using Language Models. The least similar segments are subsequently removed (Ganguly et al. 2011; Mahdabi et al. 2012). To measure the effectiveness of PRF in this context, we also carried out two retrieval runs using pseudo-relevance feedback (denoted ALLPRF and UFTPRF). We used the implementations provided by the Terrier platform for these two baselines (see further Robertson 1991).

5.5 Parameter settings

We used the training topics provided by the CLEF-IP 2011 organizers to empirically set all of the parameters used in this experiment, including those used by the baseline systems. Footnote 8 First, we set the parameters for our own method. We conducted a number of runs with different values for x (i.e., the number of top-ranked documents used when generating ratings). As shown in Fig. 1, there were no significant changes in MAP scores between 15 and 50 documents. The optimal value was obtained when x = 10. This is a relatively low figure, but it is consistent with the search environment. In patent search, precision is typically much lower than web search, often dropping off fairly quickly. Too many documents (e.g., x = 100) and the noise disrupts our technique. Too few (e.g., x = 5) and we miss relevant documents. The parameter μ was set to 0.99, consistent with prior work (Zhou et al. 2004, 2012).

Fig. 1
figure 1

Varying the number of top ranked documents (parameter x)

The parameters for the baseline systems were set as described in Mahdabi et al. (2012) and Ganguly et al. (2011). Parameters k 1,  w b ,  k 3 (part of the BM25 model) were set to 1.2, 0.75 and 7 respectively. The number of phrases (used in LMIPCNP) was set to 10. The number of pseudo-relevant documents (used in QR) was set to 20. We used the training set to fix the number of query terms for TF,  TFIDF, BM25,  LM and LMIPC. We tried every number in the range \(\{5, 10, 15, 20, \ldots 100\}\) [following a study placing the effective upper bound at 100 (Xue and Croft 2009)]. We found that 50 query terms was the most effective.

6 Results

In our first evaluation, we compared the performance of our technique with the baseline systems listed in Sect. 5.4. The results are shown in Table 1. Our multiple query technique, CPAC, performed extremely well. It achieved statistically significant improvements over the top performing baselines, including ALLQR and LMIPCNP. Notably, it scored a 19.32 % improvement in MAP over LM (Mahdabi et al. 2012). These results support our assertion that multiple query representations are more effective than single query representations in patent search.

Table 1 Precision of collaborative patent search and various baselines

As shown in Table 1, our refinement method (CPACRegu) recorded statistically significant improvements over CPAC. In terms of MAP, CPACRegu scored 4.18 % higher than CPAC, and improved ALL (the highest scoring baseline) by 10.59 % . A similar trend emerged in terms of NDCG, where CPACRegu exceeded CPAC by 6.42 % and ALL by 10.48 %. The performance of the CPACRegu method measured by P@10, P@50 and P@100 was particularly strong, showing improvements w.r.t ALL of 20.33, 11.49 and 12.21 % respectively. These results support our earlier claim that CF-generated results and IR-generated results mutually reinforce each other.

Figure 2, which plots the precision-recall curves Footnote 9 for the various systems, suggests that the gains achieved using our methods are consistent. Interestingly, the refined version of our technique outscores the baselines on almost all of the evaluation metrics, despite being specifically tuned for MAP. Comparing our system to the results published for CLEF-IP 2011, we note that CPAC is only fractionally lower than the best performing run, while CPACRegu outperforms it completely (Piroi et al. 2011). To summarise, the results described above indicate that our technique is extremely suitable for patent prior-art search, and that it is capable of state-of-the-art performance.

Fig. 2
figure 2

Precision-Recall curves for the top performing systems

6.1 Comparison with standard data fusion

In this section, by way of comparison, we examine an alternative approach to multiple query patent search that does not use CF-based analysis. In this approach, we create multiple query representations of each patent application as described above. Then we submit these queries to a standard IR engine, combining the search results using conventional data fusion algorithms (see Sect. 3). We wanted to know if we could outperform CPAC using this simpler technique.

The first task was to determine which type of data fusion algorithm to use (i.e., supervised or unsupervised). We evaluated 5 unsupervised methods and one supervised method (see Tables 2, 3) using a subset of the CLEF-IP 2011 English query set (675 topics). We calculated CombMNZ (an unsupervised method) by multiplying the sum of the scores for a document by the number of lists that contained that document. To apply the supervised technique (LAMBDAMERGE) we split the query set into two subsets (training and testing), selecting gating features appropriate to our query representations (Sheldon et al. 2011).

Table 2 A summary of six data fusion methods
Table 3 Recall of collaborative patent search and various baselines

Figure 3 compares the retrieval performance of the algorithms. CombSUM and CombMNZ were the lowest scoring techniques. Interestingly, the supervised technique, LAMBDAMERGE, was outperformed by two unsupervised methods (CombRSVNorm and CombRSV). This unexpected result may be due to the diverse query representations used in our study, which produced highly dissimilar result sets. This was a challenging environment for data fusion algorithms, one which clearly did not suit LAMBDAMERGE. Figures 4 and 5 report the results when we compared our CF-based technique to the unsupervised data fusion algorithms (entire query set). CPAC and CPACRegu outperformed the top scoring fusion algorithm CombRSVNorm with statistically significant results. This finding confirms our suspicions. Our CF-based technique produces results that we cannot replicate with simple data fusion.

Fig. 3
figure 3

Comparison of supervised and unsupervised data fusion methods (using a subset of CLEF-IP 2011 query set)

Fig. 4
figure 4

Precision of collaborative patent search and unsupervised data fusion algorithms (entire query set)

6.2 Baseline systems

In this section, we evaluate the performance of the baseline systems used in our experiment. Overall, the best performing baseline was ALL (i.e., the entire pre-processed Description field). This finding is consistent with work published at CLEF 2010 and CLEF 2011. UFT also performed well, probably because it closely resembles ALL. Filtering out the unit frequency terms from the Description field leaves most of the original terms intact. A similar effect was observed in the QR run.

We found that TF produced the worst performance on the CLEF-IP test collections. This result conflicts with previous work showing positive results on the USPTO corpus (Xue and Croft 2009). These results are possibly due to the citation practices common to that corpus. As expected, the effect of pseudo-relevance feedback on the top performing baselines was negative (see Figs. 5, 6). The use of IPC information improved overall performance (i.e., LMIPC scored better than LM). Consistent with Becks et al. (2011), adding phrases (LMIPCNP) led to a modest (i.e., not statistically significant) improvement in retrieval effectiveness.

Fig. 5
figure 5

Recall of collaborative patent search and unsupervised data fusion algorithms (entire query set)

Fig. 6
figure 6

The effect of PRF on the top-performing baselines

6.3 CF algorithms

We studied the performance of the different CF algorithms. The results are shown in Fig. 7. The model-based CF algorithms (SVD and Weighted SlopeOne) produced marginally better results than the memory-based alternatives (User-based and Item-based). Memory-based CF algorithms tend to perform poorly when the rating matrix is sparse. Model-based algorithms are generally less sensitive. All of the CF algorithms produced equivalent results. This supports our assertion that collaborative patent search is less sensitive to the problem of matrix sparsity.

Fig. 7
figure 7

The effect of changing the CF algorithm

6.4 Per-query analysis

We performed a per-query analysis comparing the results produced by ALL and CPACRegu. We found that 61 % of all queries (824 out of 1,351) benefited from our refined, multiple query representations technique. Only 18.7 % of queries (252 out of 1351) were better off with a single query representations of the Description field.

6.5 Recall

In addition to the precision-based measurements described above, we evaluated our algorithms using recall-based metrics. Given the context, this is quite fitting. Patent prior-art search is a recall-oriented task wherein the primary focus is to retrieve relevant documents at early ranks. We found that CPAC and CPACRegu achieved better recall than the baseline systems. These improvements were quite stable across all evaluation metrics (see Fig. 8; Table 2). The high recall performance of our technique has an intuitive explanation. Relevant documents are being crowd-sourced (i.e., ‘found’ by other query representations).

Fig. 8
figure 8

Recall of collaborative patent search and various baselines

7 Conclusion and further work

In this paper, we have described a pseudo-collaborative approach to patent IR which combines results lists from multiple query representations. We have also proposed an iterative method for refining its performance. In a multi-stage evaluation using CLEF-IP data, our experimental system delivered statistically significant improvements over state-of-the-art baseline systems. In future work, we intend to explore the differences between IP test collections. We also plan to evaluate the use of citation information alongside more selective query generation techniques. Further scrutiny of data fusion algorithms, and their application to our technique, is another obvious extension.