1 Introduction

Nowadays, it is widely acknowledged that information retrieval (IR) is not always a solitary activity but might be undertaken in collaboration by teams. Prominent examples do exist, e.g., in work settings when colleagues are engaged in writing a jointly-authored report, in academic settings when students or scientist work together on a project, but also in ordinary life when family members make travel planning (Morris and Horvitz 2007). These scenarios are characterized by participants who share the same information need and explicitly work together to satisfy that need.

Work in modern organizations is to a large extent performed collaboratively by multidisciplinary teams (Fidel et al. 2000) that aim to benefit from the often complementary skills, experiences and abilities of the individual team members (Cummings 2004). It is therefore not surprising that many empirical studies identified collaborative information retrieval (CIR) as everyday work patterns. These studies have been conducted in many fields of work, such as engineering (Poltrock et al. 2003), healthcare (Reddy and Spence 2008), legal practice (Attfield et al. 2010), or intellectual property domain (Hansen and Järvelin 2005).

Searching in those professional settings has its own unique characteristics. In patent retrieval tasks, for example, it is important not to miss a relevant document, i.e., those tasks recall-oriented. Therefore, patent engineers examine up to several hundred documents retrieved from the IR system (Joho et al. 2010). This is opposed to, e.g., Web-search where users typically examine only the top few items of a query result (Jansen et al. 2000).

Previous research in the field of collaborative IR (CIR) has developed and evaluated tools and systems for use at each stage of the information searching process: (1) query construction, (2) obtaining results for inspection, and (3) assessing results (Böhm et al. 2013). However, recent empirical studies showed that, despite the increasing availability of tools specifically designed to support CIR, these technologies are not used in professional practice (Morris 2013). Instead, search systems and interfaces designed for individual usage are diverted which are based on IR models not designed to be aware of collaboration. A fundamental issue of collaborative search is that existing IR models are not designed to be aware of collaboration. Thus, user support implemented by these systems concentrates on individual rather than on team level. That is, users are provided with ranked lists of documents for inspection or query terms for expansion with the top-ranked items being the most suggested ones.

Optimum ranking for individuals has been extensively investigated in IR research. The Probability Ranking Principle (PRP) states that a retrieval system performs optimally, i.e., cost minimizing, if a list of documents is ranked according to decreasing probabilities of relevance (Robertson 1977). Recently, the PRP has been enhanced by considering IR as an interactive process and by relaxing the assumption of independence between documents (Fuhr 2008). Furthermore, approaches based on the Portfolio Theory (Zuccon et al. 2010; Wang and Zhu 2009) and Quantum Theory (Zuccon and Azzopardi 2010) aimed to increase novelty, diversity as well as to cope with interdependent document relevance. However, information searchers are still assumed to be individual actors. Little work has been done in developing a general ranking criterion for collaborative search sessions of multidisciplinary teams, i.e., estimating which document should be inspected by whom. We refer to such estimations as activity suggestions (provided by the system).

Whereas the PRP is justified by minimizing costs for an individual (Robertson 1977), our research hypothesis is that in collaborative search sessions, minimizing the costs for a team as a whole is more effective than minimizing the costs for each team member individually.Footnote 1

In this paper, we develop a formal, theoretically sound model that provides activity suggestions in collaborative search sessions. Our approach is justified by Bayesian decision theory and accounts for differences in knowledge and skills within the team and shifts documents among team members accordingly.

The rest of the paper is structured as follows: In Sect. 2, we review related work. Section 3 describes our probabilistic model and Sect. 4 presents an experimental evaluation. Section 5 analyzes the results of our experiments and Sect. 6 summarizes this paper and gives a brief outlook on future research.

2 Background and related work

The concept CIR is overloaded with several meanings. Most prominent examples are collaborative filtering systems, such as the ones provided by online Web shops (Linden et al. 2003). This is regarded as implicit collaboration, since although people may be generally aware that their results are based in part on data obtained from other users, they may not know who those users were or what information need they had. Thus, collaboration here exists because the search engine used historical data as a source of evidence for document relevance. Explicit collaboration is conceptually distinct by having two or more people who share the same information need and explicitly set out together to satisfy that need (Golovchinsky et al. 2009).

Corresponding CIR support systems provide an environment where collaboration between team members is mediated at different layers, also called depth of mediation (Golovchinsky et al. 2009), at which the mediation of the multi-user search process occurs.

Communication In its simplest form, collaboration is mediated using generic communication and data exchange tools such as e-mail or instant messaging. Among others, the empirical studies of Twidale et al. (1997) and Morris and Horvitz (2007) observed collaborative activities among searchers who utilized systems although these systems were not designed for collaborative usage. Common to both studies is that multiple people were engaged in information retrieval activities and combined their efforts in pursuit of a common, or at least similar, information need. Searchers communicated about the search process and the search products, but neither the user interface nor the search engine (or digital library respectively) was aware that people intended to collaborate. In these settings, CIR was mediated by face-to-face communication and computer-based communication tools.

Front-end mediation Using front-end mediation, collaboration is supported via integrated functions in the user interface that focus on supporting explicit interaction between team members. There are many CIR support systems, such as Daffodil (Klas et al. 2008) SearchTogether (Morris and Horvitz 2007), Coagmento (Shah and Marchionini 2010) and CollabSearch (Yue etal 2014), that allow for exchanging queries and search results through a shared user interface. Additional awareness information, e.g., provision of query history and visitation history of team mates, as well as integrated communication channels help participants of collaborative sessions to better coordinate team activities themselves. Common to these approaches is that searchers collaborate at the user interface level, but the search engine itself does not support collaboration. However, the attention to others’ queries and results is required and searchers must manually reconcile activities with their team mates.

Back-end mediation Using back-end mediation, each participants’ activities are tracked and logged. Integrated algorithms in the search engine evaluate these activities and combine them algorithmically to produce retrieval effects that follow some defined strategy. The intention is to allow team members to work independently but still be influenced by their team mates by incorporating their activities in the result-sets (merging, splitting) and queries (term re-weighting).

Joho et al. (2009) as well as Foley and Smeaton (2010) explored the potential benefit of adopting several IR techniques, such as search result division and relevance feedback, to support division of labor and sharing of knowledge among team members. For example, the search engine implemented a division of labor strategy by preventing documents to be provided to users as query response that were inspected by a team member already. Sharing of knowledge was realized by re-weighting of query terms that occurred in documents judged as relevant by team mates.

Other approaches explored regrouping of search results based on user roles that were manually predefined. In Pickens et al. (2008), back-end mediation was implemented based on team members fulfilling asymmetric roles (Prospector and Miner) in a synchronous search session. The search engine performed a re-ranking of a result list based on judgments of all participants. This re-ranking was based on the two measures relevance (the ratio of relevant and non-relevant documents in a response list) and freshness (the ratio of inspected to non-inspected documents) in a result list. Documents not inspected by the Prospector were forwarded for examination by the Miner. Similarly, Shah et al. (2010) also investigated merging and splitting of query results among team members with asymmetric roles (Gatherer and Surveyor). However, these studies restricted user roles into predefined categories. To provide more flexible CIR support, Soulier et al. (2014) aimed at mining such user roles in a collaborative search session to leverage diverse sets of knowledge present in the team. Common to all these examples is that the search system realized an information flow between the participants that do not have to manually decide how to divide the IR task and which documents to inspect.

Motivation Previous attempts towards CIR support systems were conducted in experimental settings that either focused on algorithmic support of collaboration (back-end mediation) or improving human-human and human-computer interaction by facilitating communication, coordination and awareness mechanisms (front-end mediation). Little work has been carried out on the IR model level.

Our approach differs from those in terms of methodology. We consider that recent research indicates that—in professional practice—collaborative users employ tools of their daily work routines for collaboration (Böhm et al. 2013; Morris 2013). Thus, we aim to develop a formal IR model towards support of collaborative sessions. Missing awareness support of team member’s activities calls for flexible and adjustable activity coordination among team members. Our probabilistic approach addresses this by making not only estimations about document relevance, but also about team member’s information activities and resulting document redundancy.

3 Modeling CIR support

Our approach is based on modeling a collaborative search task by describing the document sets, such as (electronic) sources, retrieved, viewed, inspected, and assessed documents, associated with team members over the course of search. In this way, the model refers to the system’s representation of information and information needs, as it is common in probabilistic models (Fuhr 2008).

Our model adopted the general collaboration framework by Baeza-Yates and Pino (1997) who described a collaborative task to be performed by a team T consisting of N members and the task to be divided in L stages. Team members \(\tau _i \in T\) perform iterative search sessions independently and relevant search results of all team members are accumulated. We useed the Information Dialog developed by Landwich et al. (2000) to describe an individual’s information searching process, i.e., cycles of dialog stages. Different document sets and activities are associated with these stages:

  1. 1.

    Activities of Access produce a result set of documents from a given source as response to query. The elements of this set have been determined by the IR system based on an estimated probability of relevance, i.e., system relevance (Saracevic 2007). Let D be the set of documents contained in an information source. The result in response to a query \(q_i\) issued by team member \(\tau _i \in T\) is defined as the subset \(R(q_i) \subseteq D\).

  2. 2.

    Activities of Orientation create result subsets which reach the field of vision of the user. For example, a user might decide to scroll through the result list until a certain rank or request another page of the result set. In both cases, the user captures the document’s representation (e.g., as Rich Snippet) cognitively. Finally, a subset of documents is selected by the user for inspection. That is, the user might request the abstract or full text of a document to eventually read it. Let \(R(q_i) \subseteq D\) be a query result. The set of documents viewed by team member \(\tau _i \in T\) is defined as the subset \(V(q_i) \subseteq R(q_i)\) and the set of documents inspected by team member \(\tau _i \in T\) is defined as \(I(q_i) \subseteq V(q_i)\).

  3. 3.

    Activities of Assessment identify useful documents for the task at hand. For example, the inspection of a document allows the user to assess the affective relevance (Saracevic 2007) of it. Let \(A(q_i) \subseteq I(q_i)\) be these documents.

Fig. 1
figure 1

Schematic visualization of search process using Venn diagrams: a document sets associated with team member’s activities, b projection of overlapping subsets

The main characteristic of a collaborative search task is that it involves multiple users aiming at collaboratively solving a shared information need (Golovchinsky et al. 2009). For the setting of interest in this paper, each team member is assumed to perform several cycles of the Information Dialog individually and independently. For constructing a schematic visualization of such search sessions, we use Venn diagrams of documents sets, as proposed in Landwich et al. (2000). Figure 1a depicts an example of a general schematic visualization. We see information activities of two collaborative users in timely order. Each of them issued a query, viewed, inspected, and assessed results. When team members search to satisfy the same information need, they often use the same or very similar queries (Foley and Smeaton 2010) which is likely to result in very similar result sets returned by the IR system. Figure 1b depicts the projection of the query results associated with both team members. As can be seen in Fig. 1b, their document sets can overlap which may lead to less coverage and less productivity due redundant work.

In this paper, we introduce the notion of activity suggestions that allocate documents to a team members for inspection and assessment. In our example, the union of result documents \(D^{(l)}\) at time point l, with size \(M=|D^{(l)}|\), represents the basis for activity suggestions. Optimizing the individual contributions calls for ranking the available documents with respect to the activities of the team and suggesting an appropriate subset \(D^{(l)}_{i} \subseteq D^{(l)}\) to a team member \(\tau _i \in T\).

3.1 Problem statement

Key motivation for collaboration is the provision of a diverse skill set by team members, whose background may come from different fields (Cummings 2004). Such asymmetry in user knowledge (Golovchinsky et al. 2009) includes a varying level of familiarity with a specific domain. It has been recognized that such varying domain knowledge can affect the construction of search requests, i.e., users with profound domain knowledge generally use a more specific vocabulary (White et al. 2009), and it can affect the quality of relevance assessments (Bailey et al. 2008). This becomes even more vital if an information need covers two or more domains, as it is often the case, for example, in the intellectual property domain (Hansen and Järvelin 2005).

One way to support different levels of familiarity with domains is to acknowledge that experts of distinct domains or experts and novices of the same domain (Golovchinsky et al. 2009) may (1) issue own queries to address the shared information need and (2) may judge the relevance or non-relevance of the same document \(d_j \in D^{(l)}\) differently.

Under this premise, a nuanced approach is required which aims at increasing the chances that team members will identify relevant documents (based on their specific expertise) that otherwise would be lost. In the example depicted in Fig. 1, the union of documents \(D^{(l)} = R(q_i) \cup R(q_2)\) at time point l represents the basis our approach. Without any awareness information and limited communication about team member’s information activities, several team members might decide to inspect and assess the same document \(d_j \in D^{(l)}\). However, in the setting considered in this paper, information activities of team mates are not observable.

3.2 A cost model for CIR

Decision theory has often been used in IR research for coming up with solutions or criteria for various IR tasks. This covers, e.g., database selection in networked IR (Fuhr 1999), the justification of the PRP (Robertson 1977), and the development of the Probability Ranking Principle for Interactive IR (Fuhr 2008).

Decision theory is concerned with determining which decision, from a set of possible alternatives, is optimal. The decision is characterized by several alternatives and the consequences resulting from a choice are imperfectly known, i.e., the decisions are made in the face of uncertainty (North 1968).

Each decision will incur costs that are quantified by a loss function. For example, for the justification of the classical PRP, Robertson (1977) introduced a simple loss function that defined costs associated with the decision as to whether or not to retrieve a document depending on its expected relevance with regard to a searchers information need. Retrieving a relevant document incurred abstract costs which were assumed to be smaller than the costs incurred by retrieving a non-relevant one. This assumption allowed to proof the cost optimality of the PRP (see also Fuhr 1992).

In our model, decisions are made about suggesting documents to members of a team who are assumed to inspect and assess them. For our attempt towards an initial cost model for CIR, however, we wanted to maintain the clarity of Robertson’s approach and adjust the corresponding cost model only slightly to make it applicable to CIR. Additionally, since we aimed at developing a general framework, we did not further differentiate the various interaction costs, since those depend on the actual interface design and features of the underlying IR system. We introduced the following adjustments:

Assessment costs Our cost model acknowledged that costs of assessments may vary among members of a team due to their domain knowledge and experience with which they contribute to team work. For example, a domain expert may require less effort to assess a document that lies within his or her area of expertise than a domain novice. Generally speaking, the decision to suggest a document to a particular team member will incur some initial costs in terms of efforts required by the user and these efforts are specific to the user \(\tau _i\) and the particular document \(d_j\). We therefore denote these costs with \(C_{i,j}\).

Costs of redundancy Our model incorporates the costs produced by found documents that are relevant with regard to the shared information need but redundant for the team. With the aim of minimizing abstract costs of the team, we shall consider the wasted efforts of users who assess documents that have been examined by a team mate already.

We will use the term benefit for referring to negative costs. The assessment of a relevant and non-redundant document will provide the team with the benefit B in return. However, without any awareness information about team member’s information activities, several team members might decide to examine this document. Let the constant \({\bar{B}}\) denote the benefit accrued from a relevant but redundant document.

Our probabilistic model for CIR is characterized by introducing a second probabilistic parameter \(\delta _{\lnot {i},j}\) which is an estimate about information activities of other team members. Besides the relevance relation between document and information need (expressed by the formalized query \(q_i\) issued by team member \(\tau _{\lnot {i}}\)), documents may relate to other team mates \(\tau _{\lnot {i}}\) according to \(\delta _{\lnot {i},j}\), that is, how likely it is that another team mate \(\tau _{\lnot {i}}\) discovered the corresponding document during the course of search. Moreover, let \(\rho _{i,j}\) be the probability of a document \(d_j\) being relevant for a specific user \(\tau _i\). We can summarize the expected costs that are incurred by suggesting a document to a team member by building the sum of the defined costs multiplied by the probabilities of these costs occurring:

$$EC(\tau _i, d_j) = C_{i,j} + \rho _{i,j}(1-\delta _{\lnot {i},j})B + \rho _{i,j}\delta _{\lnot {i},j}{\bar{B}}$$
(1)

3.2.1 Discussion of the cost model components

For our cost model, we introduced different benefits, B and \({\bar{B}}\), that reflected the redundancy and non-redundancy of relevant documents. Please note that the ratio of these two constants may depend on the specific collaboration scenario. In this paper, we aim at minimizing redundant work in professional search tasks and, thus, we assume \(B/{\bar{B}}<1\), i.e., the costs incurred by a non-redundant document are lower than those incurred by a redundant one. However, in other collaboration scenarios, the aim might be to validate team members assessments to facilitate the confidence in the search outcome and, thus, one would assume \(B/{\bar{B}}>1\). This might be thinkable, e.g., in higher education when domain novices, such as undergraduate students, collaborate in a project to explore an unfamiliar field. However, in this paper, we make the former assumption about the ratio of the benefit constants.

Equation 1 represents a cost function that is dependent on two parameters, i.e., a specific team member and a particular document. Also, the specified team member \(\tau _i\) implicitly defines \(\tau _{\lnot {i}}\). The paring of team member and document parametrizes the probabilistic parameters but also the assessment costs \(C_{i,j}\). These costs cover document properties, such as the length, but may also reflect the relation between the user’s domain knowledge and document’s subject. For example, Soulier et al. (2013) proposed to estimate this relation using the cosine similarity between a users knowledge profile and the subject covered by the document where both, document and knowledge profile were represented using weighted topic vectors generated using the LDA algorithm. However, we acknowledge that user and document specific assessment costs are still a simplification. For example, in an empirical study conducted by Villa and Halvey (2013), it has been found that document properties, such as the length, as well as the degrees of relevance influence the efforts of relevance assessment. For example, it was reported that relevant documents require more effort to assess compared to highly relevant documents. However, our cost model does not consider different degrees of relevance and resulting efforts so that this extension is subject for future work and model refinement.

So, Eq. 1 represents a rather simple cost model but it does capture the main elements of interest and is similar to the cost model originally proposed for the justification of the PRP (Robertson 1977). Equation 1 suggests that, in order to minimize the expected costs, a system should allocate documents that satisfy the information need and at the same time have not been discovered by another team member yet. However, please note that in this paper, we do not address the issue of estimating the probability distribution \(\delta _{\lnot {i},j}\) as this is a subject of investigation on its own and will be left for future work.

3.3 Activity suggestions

Based on the formal cost model for collaborative search sessions, in this section, we derive a formal criterion that describes optimum collaboration strategies in IR, i.e., estimates about which document should be inspected by whom. In order to simplify the following derivation, we introduce an approximation of \(C_{i,j}\). That is, we consider that costs of document assessments are only specific to a team member but each team member may require uniform efforts for each document, i.e., \(C_{i,j} \approx C_i\). This may roughly apply if all documents belong to the same subject, e.g., if they were retrieved from a thematically specialized digital library, and have a comparable length.

Additionally, we consider that in practice, the notion of a budget limits the amount of time to search for information (Tait 2014), i.e., assessment costs are important given budgetary constraints. Let the product \(N \cdot F\) represent an abstract budget where N is the number of team members and F is, for example, the time frame provided to the team to accomplish the search task. Each human assessor may examine a different number of documents within this time frame; let \(K_i\) be this number. Hence, \(C_i = F/C_i\) represents the (average) assessment costs per document of team member \(\tau _i\).

This paper introduces the notion of activity suggestions that allocate specific subsets of documents \(D_i^{(l)} \subseteq D^{(l)}\) to each team member \(\tau _i \in T\) for inspection and assessment. An IR system has a set of options to choose from which can be described by \(T \times {\mathcal {P}}(D^{(l)})\). That is, the IR system may allocate some subset of \(D^{(l)}\) (an element of \({\mathcal {P}}(D^{(l)})\)) to any team member \(\tau _i \in T\). Optimizing the individual’s contributions calls for allocating the available documents with respect to skills and experiences of the team members and suggesting an appropriate set of documents \(D_i^{(l)} \in {\mathcal {P}}(D^{(l)})\) to team member \(\tau _i \in T\).

This is described by a simple mapping (that we denote with suggestion mapping) \(s : T \times D^{(l)} \rightarrow \{0,1\}\). If a document \(d_j\) is suggested to a team member \(\tau _i\), the tuple (ij) is mapped to 1; in case of no suggestion it is mapped to 0. Moreover, we define \(M=|D^{(l)}|\). This allows for describing each subset \(D_i^{(l)}\) as follows:

$$D_i^{(l)} = \{ d_j \in D^{(l)} | s_{i,j}=1 \}$$
(2)

If we assume that the subsets \(D_1^{(l)}, \ldots , D_N^{(l)}\) are known, we can formulate the expected costs considering the whole team T using Eq. 1 as follows:

$$EC( D_1^{(l)}, \ldots , D_N^{(l)} ) = \sum _{i=1}^N \sum _{d_j \in D_i^{(l)}} EC( \tau _i , d_j )$$
(3)

However, for the purpose of involving the suggestion mapping into this equation, we furthermore introduce the following two constraints. In our scenario, each team member is assumed to have only a limited capacity \(K_i\) to accept and assess documents which results from the budgetary constraint. Furthermore, introducing an additional constraint allows assuring that \(\delta _{\lnot {i},j}\) is zero: This is achieved by limiting the number of times a document is suggested to users to one (to avoid redundant work). This allows describing the overall costs for a team involving the suggestion mapping:

$$\begin{aligned} EC( D_1^{(l)}, \ldots , D_N^{(l)} | s )&= \sum _{i=1}^N \sum _{j=1}^M s_{i,j} [ C_i + \rho _{i,j} B ] \nonumber \\ {\text {subject \, to }}&\sum _{i=1}^N s_{i,j} \le 1,\quad \forall j \nonumber \\ {\text {and }}&\sum _{j=1}^M s_{i,j} \le K_i,\quad \forall i \end{aligned}$$
(4)

3.3.1 Optimum suggestions

We now want to develop a criterion (or rule) that ensures that the costs resulting from the suggestion mapping, as given by Eq. 4, are minimized. Since the constraints introduced with the suggestion mapping restrict the number of documents each team member is provided with to \(K_i\), and because the term \(C_i \cdot K_i\) is uniform across all team members (and it applies \(F = C_i K_i\) ), we can further simplify Eq. 4 to:

$$\begin{aligned} EC( D_1^{(l)}, \ldots , D_N^{(l)} | s )&= \sum _{i=1}^N K_i C_i \sum _{j=1}^M s_{i,j} \rho _{i,j} B \nonumber \\ &= F \sum _{i=1}^N \sum _{j=1}^M s_{i,j} \rho _{i,j} B \nonumber \\ {\text {subject \, to }}&\sum _{i=1}^N s_{i,j} \le 1,\quad \forall j \nonumber \\ {\text {and }}&\sum _{j=1}^M s_{i,j} \le K_i,\quad \forall i \end{aligned}$$
(5)

The suggestion mapping should minimize the expected costs for the whole team considering all suggestions. In Eq. 5, one can easily see that we reduced the cost function to be dependent only on parameters \(s_{i,j}\) and \(\rho _{i,j}\), and that the term \(\rho _{i,j}B\) is strictly monotonically decreasing with \(\rho _{i,j}=[0;1]\). Thus, minimizing the expected costs \(EC( D_1^{(l)}, \ldots , D_N^{(l)} | s )\) corresponds to the following maximization problem consisting of an objective function and constraints that together represent an integer linear program (ILP):

$$\begin{aligned} \max&\sum _{i=1}^N \sum _{j=1}^M s_{i,j} \rho _{i,j} \nonumber \\ {\text {subject \, to }}&\sum _{i=1}^N s_{i,j} \le 1,\quad \forall j \nonumber \\ {\text {and }}&\sum _{j=1}^M s_{i,j} \le K_i,\quad \forall i \end{aligned}$$
(6)

So, in conclusion, we can make the following statement which we denote with Optimum Criterion for Collaborative Search: In order to maximize the productivity of a collaborative search task, an IR system should allocate documents to team members according to Eq. 6.

3.4 Model discussion

From the cost model introduced in Sect. 3.2, we derived the notion of Activity Suggestions that represent a formal criterion that can be used for determining optimum collaboration strategies of teams, e.g., result division among team members.

It is important to note that the derived criterion is declarative, i.e., it describes how an optimum result division is characterized, but it does not explicitly states how this optimum is reached or can be computed. This is in contrast to previous imperative approaches where a scoring function (Pickens et al. 2008) or algorithm (Soulier et al. 2013) were hypnotized to result in better retrieval performance for a team.

In this criterion (Eq. 6), the suggestion mapping \(s_{i,j}\) represents the unknowns of the ILP to be determined and \(\rho _{i,j}\) represents the user-specific relevance probabilities that need to be estimated beforehand. For estimating parameter \(\rho _{i,j}\), IR research provides a large body of knowledge covering many approaches for computing query expansion terms or re-ranking of results, each with the aim of increasing the quality of search results towards resolving the actual information need. As an example, research in the field of search personalization, e.g., Bennett et al. (2012), provides useful approaches that are based on building a user profile from the search history and incorporating this profile into the ranking function. However, in this paper we assume that besides an issued query, there is no additional information about the user available to a the IR system. Hence, in the next section, we will estimate \(\rho _{i,j}\) using BM25 (Robertson et al. 2004). Moreover, the user-specific parameter \(K_i\) still needs to be estimated. A real IR system could estimate this based on the user’s history, e.g., using the average number of documents the user has assessed per unit of time in the past. In this paper, however, we will use data gathered by empirical studies that provide typical average values for the number of documents examined per query by professional searchers (see table 2 in Joho et al. 2010).

It is interesting to note that, although the cost function introduced in Sect. 3.2 depends on several parameters, the derived criterion has a rather simple structure and that the solution of the ILP can be computed easily using a numeric solver for ILPs.

A second interesting issue is that, if we assume a team size \(N = 1\), we can derive the PRP as a special case of our optimum criterion, as this leads to: \(\max \sum _{j=1}^M a_{j} p_{i,j}\) subject to \(\sum _{j=1}^M a_{j} \le K_i\), with \(a : D^{(l)} \rightarrow \{0,1\}\). This equation represents an alternative formulation of the PRP: For each rank, or for each given number of documents \(K_i\) requested by a user, respectively, documents with a maximum probability of relevance are allocated. Hence, our optimum criterion represents a generalization of the well-known PRP but also includes the same limitations, such as assumed independence of documents.

4 Experimental evaluation

Simulations of user’s interactions have been used extensively in IR research. This includes evaluation of CIR support systems with back-end mediated collaboration: For example, Pickens et al. (2008) showed how their algorithm could achieve an effective collaboration by way of simulation, Shah et al. (2010) demonstrated how search processes that were virtually combined could result in achieving results that are both relevant and diverse. Foley and Smeaton (2010) as well as Soulier et al. (2013) demonstrated the effectiveness of their models by simulating users searching together synchronously based on interaction logs of individual users from the TREC interactive track experiments.

In this paper, we also apply simulation as experimental methodology to explore different search result division strategies that could be employed by a collaborating team and to study the effects of different levels of domain knowledge with which team members contribute towards task completion.

4.1 Experimental setup

Data sets Experiments were conducted using two IR test-collections covering two domains of interest: the intellectual property domain and medical domain. We used OHSUMED test-collection (Hersh et al. 1994) for the simulation of a retrieval task upon the request of disease information on a medical literature corpus. The OHSUMED corpus is composed of 348,566 MEDLINE documents from 270 journals published between 1987 and 1991. This test-collection includes 106 topics. OHSUMED contains relevance assessments manually annotated using three relevance levels (definitely relevant, possibly relevant, and not relevant). We considered both, definitely and possibly relevant, as ‘relevant’.

The task for the intellectual property domain is a patentability search which aims to find patents that constitute prior art and may conflict with a new patent (Joho et al. 2010). As patent corpus, we used the CLEF-IP corpus (Roda et al. 2010) which consisted of 1,958,955 patent-documents pertaining to 1,022,388 patents with publication date between 1985 and 2000. In general, one patent (identified by a unique patent number) corresponds to several patent documents generated at different stages of the patent’s life-cycle. We indexed the documents according to the CLEF-IP track guideline,Footnote 2 that is, combine the patent documents in a ‘virtual’ publication by taking each field from the latest publication and index this ‘virtual’ patent. Please note that we only considered patent documents with English texts for indexing and skipped documents where no English texts were available. The CLEF-IP test-collection also contains topic definitions and relevance assessments. Topics name the patent to which prior art is to be identified. The relevance assessments list patents constituting the prior art. Notably is that relevance is measured on patent-level not on patent-document level.

We decided to use these two test-collections, since both are freely available which allows for easier reproduction of our scientific results. Also, simulation of professional search in the medical domain as well as in the intellectual property domain has recently been conducted using the OHSUMED collection (Kim et al. 2011), although a different patent collection was used for the patent retrieval task by Kim, Seo and Croft in Kim et al. (2011).

From the topics of the test-collections, we selected only those topics that provided enough relevant documents to be able to create at least three distinct topical clusters from them. This cut-off ensured that we had enough subtopics to evaluate different assessment behavior of simulated users (see also paragraph Relevance Assessments). This left 83 topics out of the OHSUMED collection and 231 out of the CLEF-IP collection.

Statistics about the considered topics are summarized in Table 1.

Table 1 Statistics about relevant documents of remaining topics

Measures and tools We used overall recall of the team (also called group recall in Baeza-Yates and Pino 1997) as the measure of the retrieval performance throughout the experiments because our research interest in this paper covered recall-oriented tasks.

Our simulator is a Java-based tool that uses Apache Lucene 4.9 as search engine. As numeric solver for integer linear programs, we used lp_solve Footnote 3 with the Java wrapper javailp.Footnote 4 For document clustering, we used carrot2.Footnote 5 Documents have been indexed using Porter’s stemming and a standard stop-word list for English text.

Collaboration In the CIR domain, there are no official test-collections nor baselines to be used for evaluation and comparison of techniques. Therefore, to obtain results that allow for comparison with prior experiments in the field of CIR, we chose a simulation procedure used for evaluation in Shah et al. (2010) and similarly in Joho et al. (2009). It consisted of the following steps: (1) each simulated user issued a query. (2) Documents of all query responses were merged into a shared result-set using the CombSUM algorithm which combines the scores of user’s queries (Shah et al. 2010). (3) From this shared result-set, each simulated user was provided with a result-page for assessment consisting of \(K_i\) documents. In our experiments, result-pages were extracted from the shared result-set either by applying one of the baseline procedures (see below) or by applying our optimum criterion (see Eq. 6) that we denote with ILP in the results section.

Relevance assessments For our experiments, we chose the team size of two as this has been empirically identified as typical size for collaborative search teams (Morris and Horvitz 2007). We wanted to investigate to which extend different levels of familiarity with a domain affect the collaborative search process and its outcome, and how such a diversity within the team can be leveraged by different search result division strategies.

We setup four teams of two, each characterized by the differences in the sets of documents that simulated users would assess as relevant or non-relevant, as it is illustrated in Fig. 2. Figure 2 schematically depicts four conditions of team member’s positive assessment outcomes (i.e., relevant documents) using Venn diagrams. In Fig. 2a, we see that these assessment outcomes are disjoint, i.e., team members represent perfect complements. Conversely, Fig. 2b depicts the condition where team members are perfect substitutes, i.e., their assessment outcomes are identical. A condition that lies in between the above mentioned ones is depicted in Fig. 2c, where the document sets assessed as relevant by team members overlap. Finally, Fig. 2d depicts the condition where the assessment outcomes of one team member are fully covered by the other one, but not vice verse.

Fig. 2
figure 2

Schematic visualization of theoretical relevance assessment outcomes of two team members using Venn diagrams. a Disjoint assessments. b Equal assessments. c Overlapping assessments. d Covered assessments

We implemented these conditions by partitioning the documents annotated as relevant for a topic of a test-collection into clusters. In this way, we created several sub-topics per test-collection topic. We assigned these clusters to users to simulate a familiarity with the corresponding sub-topics and to obtain the desired assessment behavior. This was done as follows:

  1. 1.

    Disjoint assessments We created two topical clusters using k-means and assigned each cluster to one simulated user. This simulated collaboration between two experts of different, non-related (i.e., non-intersecting) domains.

  2. 2.

    Equal assessments All relevant documents of a topic have been assigned to both simulated users. This simulated collaboration between two (equally skilled) experts of the same domain.

  3. 3.

    Overlapping assessments We created three topical clusters using k-means and assigned each of the first two clusters to one simulated user and the third cluster to both of the simulated users. In this way, the third cluster represented the overlap. This simulated collaboration between two experts of different, but somewhat related (i.e., intersecting) domains.

  4. 4.

    Covered assessments We created two topical clusters using k-means, assigned the first cluster to both simulated users and the second cluster to only one of the simulated users. This simulated collaboration between an expert and a novice of the same domain.

Documents appearing in the selected result-page were counted as relevant documents found by a team member, only if this document was contained in the cluster of relevant documents assigned to the corresponding simulated user. This is different from previous simulation work, e.g., Joho et al. (2009) and Shah et al. (2010), where all relevant documents appearing in a result-page were counted as document found by a team member.

Table 2 summarizes statistics about the clusters of relevant documents assigned to simulated users.

Table 2 Statistics about the clusters of relevant documents assigned to simulated users for each assessment condition

Query construction Recent developments in formal models for simulating user querying behavior allows generating queries which obtain performance similar to the performance of actual user queries (Azzopardi et al. 2010; Azzopardi 2009). We used the query generation process examined by Azzopardi (2009) who modeled a user who selects terms from an imagined ideal relevant document: \(P(t|query) = (1-\lambda ) P(t|topic) + \lambda P(t)\). Sometimes, chosen terms will be on topic, P(t|topic), while other times terms will be off topic, P(t). The distribution P(t|topic) describes the occurrences of terms in the ideal relevant document and relates to the user’s background knowledge. For estimating P(t|topic), we used the strategy called Frequent (Azzopardi 2009) which assumed that users are likely to select terms that stand out in some way so that more frequent terms are more likely to be used as query terms. In Azzopardi et al. (2010), it was shown queries created using this strategy (called Popular in Azzopardi et al. (2010)) were similar to real user’s queries and also delivered performance that was most like that obtained from real queries. Because we needed to model two different users, we decided to vary the \(\lambda\) parameter slightly, i.e., \(\lambda \in \{0.1, 0.3\}\) which also reflects the amount of noise observed in real queries (Azzopardi et al. 2010).

For each topic of the test-collection, this allowed us to generate weighted term vectors w(dt) for each of the simulated users, based on the cluster of relevant documents assigned. We ranked w(dt) from highest to lowest and used the first \(k=10\) terms as query (see the paragraph below for the justification of the choice \(k=10\)).

Query length and result-page size There is ample empirical evidence which suggests that professional search tasks, such as a patentability search, differ from standard retrieval tasks, such as Web-search, in many ways and have its own unique characteristics. For example, in contrast to Web-search, where users typically inspect only the top few search results (Jansen et al. 2000), professional searchers will carefully examine up to several hundred documents retrieved from the IR system (Joho et al. 2010). Moreover, queries generated for a patentability search task comprising up to hundreds of terms as opposed to Web-search where queries are very short and typically consist of only three terms (Arampatzis and Kamps 2008). The experiments conducted for this paper aimed to reflect both of these characteristics. We considered the following assessment capacities \(K_i\) which represent average values resulting from different professional search task as presented in table 2 of Joho et al. (2010): \(K_i \in \{10, 50, 75, 100, 150, 200\}\).

Query generation for professional search was topic of many investigations (Xue and Croft 2009; Jochim et al. 2010; Becks et al. 2011). For example, Xue and Croft (2009) reported that query terms extracted from the abstract of patent documents resulted in best performing queries. However, generally, queries generated in professional search tasks are relatively long. For example, Jochim et al. (2010) reported that they used queries that were generated from patent document abstracts and that resulting queries consisted of (on average) 20 or 46.3 terms, respectively, depending on the generation procedure. Also, Kim et al. reported that they extracted between 10 to 20 terms from several patent features and combined them into one query. Considering that longer queries result in better retrieval effectiveness (Becks et al. 2011), we felt that setting \(k=10\) seemed to represent a reasonable lower bound for query length.

Baselines We used the following two baselines for our experiments. (1) To extract a result-page from the shared result-set, for each user, we re-ranked the whole shared result-set according to the user’s formalized query (estimated using BM25) and did a cut-off after \(K_i\) documents. This approach simulated the case of team members employing search tools designed for individual usage, i.e., results are optimized towards an individual. We called this baseline PRP. The obvious disadvantage is that the result-pages created are likely to have an overlap. (2) To avoid this overlap, we also used a baseline employed in Shah et al. (2010) and Joho et al. (2009): The shared result-set is split using a Round Robin procedure, i.e., one user got documents 1, 3, 5, etc., and the other user documents 2, 4, 6, etc. Both distinct halves of the shared result-set are then re-ranked towards the team member’s formalized query and cut-off after \(K_i\) documents. We called this baseline R R. This baseline did both, avoiding redundancy and optimizing search results towards an individual. Please note that prior experiments employed a k-means clustering baseline, too, but the Round Robin procedure has been reported as the stronger baseline (Joho et al. 2009).

4.2 Results

Figures 3, 4, 5 and 6 present the results of our experiments for both test-collections used, OHSUMED and CLEF-IP. Diagrams in these figures depict curves resulting from our ILP approach, along with the two baselines introduced above. Each data point in the diagrams is an average of 83 samples and 231 samples, respectively (one per topic considered).

5 Analysis

This section provides a detailed analysis of the results gathered in Sect. 4. As can be seen in the diagrams, generally, the overall recall rates gathered using the CLEF-IP collection are lower than those gathered using the OHSUMED collection. This might be due to a sub-optimal query generation procedure (we used relatively short queries). However, results gathered using both test-collections indicate the same trends that we discuss in more detail below.

Fig. 3
figure 3

Plot of overall recall as the number of assessed documents per user change: Disjoint assessments. a OHSUMED collection. b CLEF-IP collection

Figure 3a, b shows the retrieval performance for teams with disjoint relevance assessments. As can be seen in both diagrams, the ILP approach provided only modest increases in performance over baseline PRP. In fact, the performance of the two approaches ILP and PRP became similar at page-sizes larger than 100. In Fig. 3a, we can even see that the performance of baseline PRP was slightly better at page-sizes 150 and 200. Clearly, baseline RR resulted in the lowest retrieval performance, since it does not account for the complementing expertise of team members and, thus, prevents simulated users from assessing documents that only they can judge correctly. Results of baseline PRP show that in this condition, optimizing a result-page towards a particular expert is a sufficient strategy, since those result-pages are not likely to have an overlap and, hence, create no redundant work.

Fig. 4
figure 4

Plot of overall recall as the number of assessed documents per user change: Equal assessments. a OHSUMED collection. b CLEF-IP collection

Figure 4a, b shows the retrieval performance of teams with equal relevance assessments. An expected result in this condition was the performance of the baseline RR that benefits from the avoided overlap between the result-pages. In Fig.  4a, b, we can see that with this simple division of labor strategy, baseline RR performed (nearly) equivalently to the ILP approach. Due to the equal relevance assessments of team members, it makes no difference how to split and distribute a search-result, because each simulated user would identify all relevant documents appearing in a result-page. However, as can be seen in both Fig. 4a, b, the baseline PRP resulted in the lowest retrieval performance which resulted from the overlap between the result-pages provided to simulated users.

Fig. 5
figure 5

Plot of overall recall as the number of assessed documents per user change: Overlapping assessments. a OHSUMED collection. b CLEF-IP collection

Finally, results depicted in Fig. 5a, b as well as in Fig. 6a, b shows that the ILP approach was successful at improving the performance over baselines PRP and RR in both conditions: overlapping and covered assessments. However, in case of the OHSUMED collection, Figs. 5a and 6a, this performance improvement decreased as the number of assessed documents per user grew.

In these two conditions, there were documents in the result-set that could be assessed relevant by only one team member exclusively, and there were documents in the result-set that could be assessed relevant by each member of the team. The differences in retrieval performance resulting from the two baselines and the ILP approach highlight the importance of allocating documents to the corresponding experts and, at the same time, avoiding redundancies between the created result-pages. Both is ensured by our ILP approach.

To confirm or dis-confirm statistical significance of the performance improvements of the ILP approach over the best performing baseline (i.e., PRP) in these two conditions, we did a paired Student’s t test. Tables 3 and 4 show the statistical data which cover averages of recall values (avg.), along with the standard deviations (SD), ratio between recall values of ILP and PRP, and finally the t value.

Fig. 6
figure 6

Plot of overall recall as the number of assessed documents per user change: Covered assessments. a OHSUMED collection. b CLEF-IP collection

Table 3 ILP approach compared against baseline PRP using OHSUMED collection for two selected conditions
Table 4 ILP approach compared against baseline PRP using CLEF-IP collection for two selected conditions

From these results, we can make some overall conclusions: Generally, in all four conditions, the ILP approach allowed a team to find more or just as many relevant documents as both baselines did. The results show that a simple division of labor strategy (baseline RR) or an optimization towards a particular expert (baseline PRP) can both be sufficient strategies, but for special conditions only. That is, for teams consisting of perfect substitutes or perfect complements, respectively. However, such conditions are rather artificial and in real world settings, one cannot expect teams to meet these characteristics.

For diverse teams, the ILP approach clearly outperformed both baselines, as it allocated documents to the corresponding experts and, at the same time, avoided redundancies between the created result-pages. Hence, our ILP approach allowed a team to benefit from the expertise brought in by different participants, i.e., it leveraged the diversity within the team. However, the performance improvement of our approach decreased as the number of assessed documents per user grew. Our approach is the most beneficial, if team members tend to request relatively small result pages, as it is the case, for example, in patentability search tasks where the average page-size is 50 (see table 2 in Joho et al. 2010).

Although parameters of our simulation were well justified, real world user behavior is more complex and our simulation represents an idealized collaborative session. For example, we simulated synchronous collaboration, as it was in focus in past CIR research (Pickens et al. 2008; Foley and Smeaton 2010). Also, like in Shah et al. (2010), we only considered a single iteration process. Simulations of multi-iteration processes, however, would require further assumptions about user behavior which we wanted to avoid, because we felt that a thorough examination of the influence of user’s specific domain knowledge required to hold the number of variables small to be able to analyze the subject of interest as precisely as possible.

So, the results depict an optimum performance improvement that could be achieved using our approach. Whereas in practice, such ideal conditions may not apply, the conducted experiments indicate that our optimum criterion provides the potential for more (retrieval) effective collaboration in search sessions.

6 Summary and outlook

In professional practice, IR is often performed in collaboration by teams that utilize a broad set of tools and services that are not specifically designed for collaborative usage. Thus, professionals typically perform their search and collaboration activities loosely coupled and independently. The objective of the research presented in this paper was to develop and formalize a model of system-based CIR support for focused and (potentially) geographically distributed teams of professional searchers that aim at resolving a shared information need.

In this paper, we presented an approach complementing prior research in the area of CIR. Previous research approaches were either observational or experimental. That is, observational works were based on empirical field studies, such as the ones presented in Poltrock et al. (2003), Hansen and Järvelin (2005), Reddy and Spence (2008), and Attfield et al. (2010). Those works aimed at capturing the CIR activities at the various stages of the search process and are helpful to describe how people interact and behave under various circumstances. Conversely, in experimental works, study participants were provided with a set of tools implementing various collaboration services and they were asked to perform some predefined tasks (Morris and Horvitz 2007; Shah and Marchionini 2010; Pickens et al. 2008). Such works mainly aimed at assessing collaborative tools or setups.

Our approach differed from those in terms of methodology. We developed a formal, theoretically sound model for supporting a team during collaborative performance of IR activities in the technical environments of todays professional practice. We developed a formal cost model from which we derived activity suggestions for collaborative users. That is, a general criterion that describes optimum collaboration strategies in IR as the solution of an integer linear program (ILP). We demonstrated the practicability of the developed formal criterion by means of search result division among team members in two professional search tasks. The influence of different domain knowledge and resulting relevance assessments of team members in four different conditions was studied. Results yielded improvements of potential retrieval effectiveness of recall-oriented search tasks. That is, generally, our ILP approach allowed a team finding more or as many relevant documents, as the baselines did.

However, it is important to note that the contribution of this paper is not the provision of a novel search result division technique even though this was in focus of the experimental part of this paper. The main idea of the developed optimum criterion was to provide a general, declarative model of CIR. That is, a model that describes formally how optimum collaboration is characterized. This was achieved by formulating CIR as an ILP which is novel in the field of IR. However, there can be different strategies for approximating the solution of that ILP. To this end, the chosen baselines (taken from Joho et al. 2009; Shah et al. 2010) can be considered as rather rough approximations of the solution of the ILP. Moreover, other algorithms than the chosen baselines are thinkable which could provide more precise approximations of the solution. However, the employment of a numerical solver creates a very precise solution of the same problem and, hence, the corresponding experimental results outperformed the baselines. Moreover, previous approaches towards CIR support were concerned with the investigation of the influence of different search roles [e.g., Prospector and Miner (Pickens et al. 2008)] fulfilled by the team members. Generally, such role assignments are also covered by our model if the user-specific estimations of the relevance probabilities are adopted in accordance to the role definitions.

While the contributions summarized above are important ones, future research will need to explore how the achieved improvements of retrieval performance can be translated into real user benefit. However, our results allow for confirmation of our initial research hypothesis, and, along with results obtained from prior experiments, support the view that IR can benefit from systems specifically designed to support collaboration in the back-end rather than make users divert systems designed for single-user usage.