1 Introduction

With the rapid development of the Internet, a large amount of network information is available in the websites. Search engines play an important role in searching information and gaining knowledge in people’s daily life. The goal of information retrieval is to return a ranking result list that satisfies a user’s information needs according to the given query. As the sources of information continues to increase, people’s search needs become more diverse and may not be satisfied with a single source of information.

Distributed Information Retrieval (DIR), or Federated Search is an important task to search from multiple distributed collections (resources) [1]. By forwarding the queries to appropriate content providers, the DIR system merges the results of different collections to get more relevant and diverse results. There are three key phases in DIR: resource description, resource selection and results merging. In resource description phase, the DIR system uses as much information as possible to describe the content of a resource. Resource selection aim to select some relevant resources for the query. Then in the results merging phase, the DIR system merges multiple retrieval results from selected resources and returns a single rank list to user.

Resource selection plays a key role in distributed information retrieval. By selecting a small number of relevant resources out of all resources to forward the queries, it would significantly improve the efficiency of the retrieval system. Considerable efforts in recent years have been made to resource selection, which can be divided into large-document methods, small-document methods and supervised methods. In large-document methods, each resource is treated as a big bag of words and is selected according to the similarity with the query. Small-document methods use sample ranking in CSI to estimate complete ranking of each resources and rank resources based on the sample ranking or scores. Supervised methods train a classifier or ranking function for resource selection.

Different resource selection algorithms consider different factors to rank resources, which are mainly limited to term matching similarity or scores of sample documents. Combining multiple factors can often improve the performance of resource selection. In addition, learning to rank algorithms have been widely used in web search to rank documents, few of them has been used for resource selection.

In this paper, we propose a learning to rank based resource selection algorithm named LTRRS. By using term matching features, CSI based features and topical relevance features, we train a LambdaMART model to combine these features and optimize resources ranking list NDCG metric to rank resources. Experimental results show that the proposed algorithm LTRRS significantly outperform the classical algorithms.

2 Related Works

Resource selection methods can broadly be categorized into three types: large-document methods, small-document methods and supervised methods. This section will discuss these three kinds of methods.

2.1 Large-Document Methods

Large-document methods take each resource as a big document containing a large bag of words and rank resource according to the similarity of this large document with query. The similarity calculation and document ranking methods in the traditional document ranking are adopted to rank resource, such as matching by term frequency, or the query likelihood method in the language model. The CORI algorithm [2] adopted the inference network in INQUERY, by combining the factors like document frequency, collection frequency of the query and collection size to calculate query likelihood probability. Xu and Croft et al. [3] use a document clustering method to establish a language model for each resource. And then, the similarity between the query and the resource can be measured by calculating the KL distance between the query language model and the resource language model.

In this kind of methods, the documents in the resource are not distinguished and only be treated as a whole to calculate the similarity with query. However, When the number of documents in the resource is getting much bigger, it will difficult to distinguish relevance resource only calculating the whole similarity between query and resource.

2.2 Small-Document Methods

Small-document methods use sample ranking to estimate the relevance of each resource. In this type of method, the DIR system firstly sample a small number of documents from all resources to form central sample documents. And then a centralized sample index (CSI) is built on the central sample documents. Finally, the score of the resource is calculated according to the scores or ranking of the documents in the CSI.

In ReDDE algorithm [4], the system retrieve from CSI to get document result list firstly, and then the rank of top n documents in the result list are used to calculate the score of each resource. The ReDDE.top algorithm [4] is the improvement from ReDDE algorithm, using the scores of documents in CSI result list instead of ranking information to calculate the relevance of each resource. Similarly, the CRCS algorithm [5] considers the ranking of the documents in CSI to calculate the scores.

2.3 Supervised Methods

Supervised methods use machine learning methods to train model for resource selection. There are mainly three kinds of supervised methods for it: query classification methods, resource classification methods and learning to rank methods.

In query classification methods, training data is used to train a classifier for query. Kang and Kim [6] combine query features and document features to classify the query.

Resource classification methods train model to assign each resource to select or not class. Arguello et al. [7] proposed the classification method for resource selection. In their work, three categories of features are used to train LR models, including the score features from the traditional resource selection algorithm, the topical category features of the query, and the click rate related features.

Learning to rank resource methods train learning to rank model and ranking functions for resource selection. Xu and Li [8] proposed features for collection selection and using SVM and RankingSVM to learn ranking functions. Dai and Kim et al. [9] train SVMrank model to rank resources in selective search, combining query-independent features, term-based features, and CSI-based features.

Furthermore, some resource selection methods [10, 11] aim to improve the efficiency of the resource selection. By using appropriate strategies such as load balancing methods, they pay more attention to the efficiency of the system.

3 Framework

3.1 Definitions

Given a query, and a set of resources \( R = \left\{ {r_{1} ,r_{2} , \ldots ,r_{n} } \right\} \), in which \( n \) represent the number of all resources. Let \( {\text{v}}( {q,r_{i} } ) \) denote features extracted from the ith resource \( r_{i} \) and query \( q \) pair. The goal of resource selection is to select top \( k \) resources from resources set \( R \). While the learning to rank methods for resource selection are aimed at learning ranking function \( F( {v( {q,r_{i} } )} ) \) to rank resources.

3.2 Architecture

Figure 1 shows the architecture of the proposed algorithm LTRRS. The whole architecture can be divided into two stages: offline stage and online stage. The offline stage includes the offline calculation and preparation for online stage. While the online stage is ranking for resources in real time. Furthermore, the offline stage can be divided into four module including preprocess module, resource description module, query expansion and learning module. In this section, we describe the details of each part.

Fig. 1.
figure 1

The architecture of LTRRS

3.3 Preprocess Module

The preprocessing module is mainly responsible for data preparation. In an uncooperative environment, the resources may represent search engines, we need to get sample documents for each search engine. While in a cooperative environment, like large scale collection retrieval, we need to partition collections to get a number of resources.

3.4 Resource Description Module

Resource description module is responsible for using information from collection to describe each resource. In this module, we use four parts including LDA model part, word2vec part, CSI part and term statistics part. Latent Dirichlet Allocation (LDA) is a kind of topic model [12], and word2vec [13] is a word embedding technique. An LDA model is trained on central sample documents, getting the topic distribution vector of each resource centroid. The word2vec part uses the central sample documents to train word2vec model, so as to use in query expansion. CSI part uses the central sample documents to build index and retrieve results from it to get scores of each document, while the term statistics part calculates the lexical statistics of each resource offline.

Each part will extract features for training learning model to rank resources. Details of feature calculations are described in the Sect. 4.

3.5 Learning Module

In learning module, all features are combined to train ranking model for resource selection. In this work, we use LambdaMART [14] model to rank resources. LambdaMART is a listwise learning to rank algorithm, which optimizes listwise loss. By optimizing the NDCG metric of resource ranking list, we improve the effectiveness of the resource selection.

4 Multi-scale Features

There are many factors influencing the performance of resource selection. In this section, we mainly discuss the features using in this paper, including term matching features, CSI-based features and topical relevance features. Among them, the term matching features and the CSI-based features are adopted from the features proposed by Dai and Kim et al. [9], while the topic relevance features are proposed in this work.

4.1 Term Matching Features

The term matching features are obtained by calculating the term statistics of the resources and the query. According to the matching degree of the terms in the query and the resource, we get the similarity of them. In the resource description module in offline stage of LTRRS algorithm, the DIR system calculate the term statistics of each resource. After the precomputation, the DIR system can calculate the term matching features efficiently in online stage.

Query Likelihood Features:

This kind of features calculate the likelihood of resource language models generating query. The terms use unigram and bigram sequences applying for document title and body, generating 4 features totally. The calculating equations are shown as Eq. (1)–(3).

$$ {\rm{log}} P( {q |r_{i} } ) = \sum\nolimits_{w \in q} {{\rm{log}} (\lambda P( {w |r_{i} } ) + ( {1 - \lambda } )P(w|G))} $$
(1)
$$ P( {w |r_{i} } ) = (\mathop \sum \nolimits_{{d_{j} \in r_{i} }} \frac{{TF(w,d_{j} )}}{{LEN(d_{j} )}})/|r_{i}| $$
(2)
$$ P( {w |G} ) = \mathop \sum \nolimits_{{r_{i} \in R}} \frac{{P(w|r_{i} )}}{\left| R \right|} $$
(3)

where \( logP( {q\left| {r_{i} } \right.} ) \) is the log likelihood of the ith resource \( r_{i} \) generating the query, \( P( {w\left| {r_{i} } \right.} ) \) is the probability of the ith resource \( r_{i} \) generating the term \( w \) in query \( q \). Meanwhile, \( P( {w\left| G \right.} ) \) is the probability of all resources generating the term \( w \), used to smooth \( P( {w |r_{i} } ) \) in case of zero probability problem. And \( \lambda \) is a smoothing parameter, set to 0.8 in this work. \( TF( {w,d_{j} } ) \) is the term frequency of \( w \), \( LEN( {d_{j} } ) \) is the document length of \( d_{j} \), \( \left| {r_{i} } \right| \) is the resource size, \( \left| R \right| \) is the number of resources.

Query Term Statistics Features:

This type of features are created by calculating the statistics of query terms in resource, including maximum and minimum resource term frequency, TFIDF value of query terms. Similarly, Unigram and bigram sequences of terms are used in calculation for document title and body, generating 16 features totally. The details are as follows:

$$ tf_{max} ( {q,r_{i} } ) = \mathop {max}\limits_{w \in q} tf( {w,r_{i} } ) $$
(4)
$$ tf_{min} ( {q,r_{i} } ) = \mathop {min}\limits_{w \in q} tf( {w,r_{i} } ) $$
(5)
$$ tfidf_{max} ( {q,r_{i} } ) = \mathop {max}\limits_{w \in q} tf( {w,r_{i} } )*idf( {w,r_{i} } ) $$
(6)
$$ tfidf_{min} ( {q,r_{i} } ) = \mathop {min}\limits_{w \in q} tf( {w,r_{i} } )*idf( {w,r_{i} } ) $$
(7)

4.2 CSI-Based Features

CSI-based features are calculated by building the central sample index. In offline stage, the DIR system build central sample index from central sample documents in resource description module. While in online stage, the DIR system retrieves from the central sample index, and calculate the features in real time.

ReDDE Features:

We create ReDDE features using ReDDE and ReDDE.top resource selection algorithm [4], the ReDDE and ReDDE.top scores are used as two features. Besides, we use the inverse rank of each resource in ReDDE.top scores as another feature, generating 3 features totally. The inverse rank calculation is defined as follows.

$$ inverseRank = 1/( {rank + k} ) $$
(8)

where \( rank \) denotes the ranking value of each resource using ReDDE.top score, and \( k \) is a parameter setting to 10 in this work.

Centroid Distance Features

Centroid distance Features capture the distance between each resource centroid and top k documents of retrieval results from CSI. The assumption of this type of features is that the closer the resource centroid is to the top-k document in retrieval results from CSI, the more relevant the resource is to the query. Therefore, we compute the KL divergence and cosine similarity between average vectors of top k documents and resource centroid vector. In Addition, we use the inverse of KL divergence as the feature in this work. We compute features using \( k = \left\{ {10,50,100} \right\} \), generating 6 features totally as follows:

$$ kl\_CentDist( {q,r_{i} } ) = 1/KL( {mean( {dt( {docs_{topk} } )} ),cen_{{r_{i} }} } ) $$
(9)
$$ cos\_CentDist( {q,r_{i} } ) = cosine( {mean( {dt( {docs_{topk} } )} ),cen_{{r_{i} }} } ) $$
(10)

where \( dt( {docs_{topk} } ) \) denotes for topic distributions of top k document results retrieved from CSI, \( cen_{{r_{i} }} \) is the topic distribution of the ith resource \( r_{i} \).

4.3 Topical Relevance Features

Topic model is widely used to get the abstract topic of the documents in a collection, Latent Dirichlet Allocation (LDA) is an example of topic model [12]. Topical relevance features are based on the topical similarity of the query and the resource documents. By training LDA model on sample documents, we apply k-means clustering method on topic distributions of sample documents to get the centroid of each resource. By calculating the similarity of the topic distribution of resource centroid and query, we get topical relevance features.

Considering that the query is usually too short to get topic information, query expansion using word2vec is used in this work to expand query. we train word2vec model on title of central sample documents, in which the dimension is set to 100 and single word and phrases are both used to learn a word2vec model. Then in online stage, for each term w in a query, we use the trained word2vec model to select top 20 most similar words to was expansion words of w. Similarly, we use unigram and bigram sequences in query terms to form w. All the expansion words of a query constitute the query expansion words.

After query expansion, the trained LDA model are used to infer query expansion words to get query topic distribution. By calculating the KL divergence and cosine similarity of the topic distribution of resource centroid and query, we get topical relevance features. Similarly, we use the inverse of KL divergence as the feature in this work. And the unigram and bigram sequences in query terms are used to calculate features separately, generating 4 features totally. We compute the inverse KL divergence and cosine similarity features as follows:

$$ kl\_sim( {q,r_{i} } ) = 1/KL( {dt( {qextend} ),cen_{{r_{i} }} } ) $$
(11)
$$ cos\_sim( {{\text{q}},{\text{r}}_{\text{i}} } ) = cosine( {dt( {qexpand} ),cen_{{r_{i} }} } ) $$
(12)

where \( qextend \) denotes the query expansion words for given query \( q \), \( dt( {qextend} ) \) denotes for topic distribution of \( qextend \), \( cen_{{r_{i} }} \) is the topic distribution of the ith resource \( r_{i} \).

5 The Proposed Algorithm

The goal of resource selection is to select \( k \) relevant resources from resources. We use LambdaMART model as the learning to rank model in the proposed algorithm LTRRS. In LTRRS, we optimize NDCG@20 metric of resource ranking list in learning model, so as to learn to get the optimal resource ranking list. It is worth mentioning that other evaluation metrics such as Precision@k can also be used in the algorithm.

The Proposed algorithm LTRRS are summarized in Algorithm 1, which is divided into offline stage and online stage. In the training phase, we use central sample documents to build CSI, and use the retrieval results from CSI to compute CSI-based features. Meanwhile, we gather the term statistics from each resource, and compute term matching features based on term statistics. In addition, the term statistics information include unigram, bigram term frequency and document frequency on body and title of resource documents. While in topical relevance features calculation phase, a LDA model and word2vec model are trained on central sample documents, and then we compute topical similarity between query and resources to get topical relevance features. Finally, all features are combined to train a LambdaMART model and optimize resource ranking list evaluation metric NDCG@20. In online stage, we use the information and model calculated in offline stage to get resource score online.

figure a

6 Experiments

6.1 Dataset

Sogou-QCL dataset [15] is used in this work, which is sampled from query logs of commercial search engine Sogou. Five relevance labels from click model are provided in each query document pair. Each document record contain title, body, html page, frequency and relevance labels information. Besides, a small dataset contains 2000 queries and about 50 thousand documents is also in Sogou-QCL dataset, which is annotated by crowdsourcing. We use this small dataset to evaluate the effectiveness of resource selection algorithm in this work. Models were trained by 5-fold cross-validation. For a given query, the relevance score of each resource are the sum of the documents relevance under corresponding resource.

We use the partitioning method in preprocess module to construct resource lists. First, we sample 2% documents randomly from 7736480 query-document pairs in Sogou-QCL Dataset. Then we train LDA model to get topic distribution of sample documents. After that, we use K-means clustering algorithm to cluster topic distribution of sample documents to get 100 clusters centroid. The 100 clusters centroid are the resources centroid. We partition all the documents into 100 resource according to the distance of each document with 100 clusters centroid. The statistics are show in Table 1.

Table 1. Statistics of resources

6.2 CSI Setup

The Indri search engine was used to index and retrieve from CSI. For a given query, top 200 documents retrieved from CSI are used to calculate the score of each resource. Language model with Dirichlet smoothing are used as the retrieval model, in which the smoothing parameter \( \upmu \) are set to the default value 2500. For each query, we use the sequential dependency model (SDM) to construct our query. In particular, the similarity between the query and the document in the experiment is computed by the weighted sum of the following three sequential dependent methods. The weights set to the unigram, bigram and unordered window bigram are 0.5, 0.25, 0.25, and the window size of the unordered window is set to 8.

6.3 Result Analysis

The Baseline methods in the experiment are the ReDDE algorithm [4], the ReDDE.top [4] algorithm, and the method using LDA [12] to calculate topical relevance in this work denoted as LDA. We use the Ranklib implementation of AdaRank and LambdaMART model in the LTRRS. We use the default parameters in AdaRank and LambdaMART model and the learning rate is set to 0.05. The evaluation metrics of the experiment are NDCG@k and P@k.

6.3.1 Performance Comparison

We compare LTRRS and the baselines in this section. As can be seen from Table 2, LTRRS outperforms the baselines significantly in all metrics. ReDDE and ReDDE.top have similar performance, while LDA method has a worse performance. The result shows that LTRRS have obvious advantage when we want to select top 5 to 20 resources out of all resources. We also compared two ranking model of AdaRank and LambdaMart, and LTRRS based on LambdaMart achieves better results.

Table 2. Comparision between LTRRS and baselines

6.3.2 Feature Analysis

Based on above results, we investigate different kinds of features on LTRRS in this section. The comparison of the contributions of three kinds of features is shown in Table 3. LTRRS_csi represents for the effectiveness of using CSI-based features while LTRRS_term denotes for using term matching features and LTRRS_topic using topical relevance features. LTRRS_all is the performance of using all the features in this work. As can be seen from Table 3, the performance of LTRRS_all using all features is better than the method using single type of features. The results indicate that three types of features have different contributions to the LTRRS algorithm by covering different aspects of information. By incorporating three kinds of features, LTRRS_all improves the performance compared to that using any single kind of features.

Table 3. Performance of LTRRS using different feature sets

In the comparison of the three kinds of features, LTRRS_term has a best performance compared to other two types of features. LTRRS_csi method rank second and LTRRS_topic has the worst performance. The results suggest that term matching features have a biggest contribution to LTRRS algorithm. The method combining query likelihood method in language model and other term matching method still have great help for resource selection.

The three types of features have different computational costs. In the calculation of CSI-base features, the central search index should be built in offline stage, while in online stage, the DIR system need to retrieve results from CSI and calculate CSI-based features. As a result, it has a high computational cost in both online stage and offline stage. In the term matching feature calculation, the term statistics information is computed in offline stage, while in online stage, the DIR system just need to calculate the matching of query and each resource term statistics. So term matching feature calculation has a low computational cost in online stage but a high computational cost in offline stage. Similarly, in the topical relevance feature calculation, after the computation in offline stage, it would have a low computational cost in online stage.

Overall, term-matching features have a low computational cost in online stage and best performance. The result suggests that combining features from term statistics of each resource can effectively improve the performance of resource selection algorithm.

7 Conclusion

In this paper, we present a learning to rank based resource selection algorithm named LTRRS. There are many factors that affect the performance of resource selection method. Learning to rank resources method can effectively combine the characteristics from various aspects. By combining term matching features, topical relevance features and CSI-based features, a listwise learning to rank model LambdaMART is trained to optimize resources ranking list in LTRRS. Experiments on Sogou-QCL dataset show that LTRRS can effectively combine all features and outperform the classical resource selection algorithm.