1 Introduction

Accurate relevance assessments are vital when developing and evaluating information retrieval (IR) systems like search engines. Since its inception in 1992, the Text REtrieval Conference (TREC) has played an important role in the IR community, creating reusable test collections and relevance assessments for a wide variety of IR tasks. This has been underpinned by robust relevance assessments produced by specialist TREC assessors, or by the participating groups themselves.

However, this style of assessment also holds some profound limitations. Most notably, judgement by TREC assessors is expensive in terms of time and resources, while not being greatly scalable (Alonso and Mizzaro 2009). Furthermore, while engaging the participants for judging is free, the volume of judgements that can be produced is limited by the number of participants to the task in question.

On the other hand, crowdsourcing (Howe 2010) has been championed as a viable method for creating relevance assessments, and indeed, as an alternative to traditional TREC assessment (Alonso and Mizzaro 2009). The reputed advantages of crowdsourcing are fourfold: judging can be performed quickly, cheaply, at a larger scale and with redundancy to achieve sufficient quality (Alonso et al. 2008). However, crowdsourcing has also been the subject of much controversy as to its effectiveness, in particular with regard to the lower quality of work produced (Atwood 2010), the lack of motivation for workers due to below-market wages (Callison-Burch 2009) and susceptibility to malicious workers (Downs et al. 2010).

In TREC 2010, the Blog track examined real-time news story ranking within the blogosphere. In particular, the task incorporated two distinct sub-tasks, namely, to rank news stories by their newsworthiness on a day of interest, referred to as story ranking, and to produce a diverse ranking of blog posts relating to each of those news stories, referred to as blog post ranking. The story ranking sub-task represents the ranking of news stories for display on the home or category pages of major news websites, e.g. Reuters Footnote 1 or the BBC.Footnote 2 The blog post ranking task represents a subsequent search for related blog posts about top news stories for the user.

In this paper, we aim to determine how feasible and effective crowdsourcing is at relevance assessment for a modern TREC task. In particular, we describe our experience when crowdsourcing relevance assessments for both sub-tasks of the TREC 2010 Blog track top news stories task. Notably, our study is the first example where crowdsourced relevance assessments have been used at TREC, indeed the assessments detailed in this paper were used to rank participant systems in the TREC 2010 Blog track, without input from TREC assessors. Moreover, our work differs markedly from the previous TREC-related crowdsourcing studies by Alonso and Mizzaro (2009) and Alonso and Baeza-Yates (2011). In particular, we tackle the modern (2010) story and blog post ranking tasks which use recent newswire and blog post corpora from 2008, rather than older newswire corpora (TREC disks 4–5) from 1991 to 1994. Moreover, we experiment with crowdsourcing at scale, generating full relevance assessments for the TREC Blog track totalling in excess of 30,000 topic-document judgements. Indeed, to our knowledge the largest prior TREC-related crowdsourcing study was by Alonso and Baeza-Yates (2011) totalling only 1950 topic-document judgements.

The contributions of this paper are fourfold: (1) we detail the first successful Footnote 3 instance of crowdsourcing at TREC, (2) we describe and analyse the two different crowdsourcing approaches used to generate relevance assessments for the two aforementioned sub-tasks, identifying the design aspects that contributed the most impact, (3) we quantitatively assess both the crowdsourcing jobs themselves, as well as the judgements produced and (4) we conclude upon the suitability of crowdsourcing for TREC relevance assessment, and propose new best practices based upon the experience gained. Indeed, we believe that this paper—as an analysis of crowdsourcing for relevance assessment in a TREC environment—will be of particular use to future and previous participants of the Crowdsourcing track which is currently being run at TREC. Footnote 4

The structure of this paper is as follows. In Sect. 2, we describe relevance assessment at TREC, while in Sect. 3, we motivate crowdsourcing for relevance assessment and detail prior work in the field of crowdsourcing. Section 4 provides an overview of crowdsourcing marketplaces and Amazons Mechanical Turk in particular. In Sect. 5, we describe in more detail the TREC sub-tasks for which we generate relevance assessments, as well as formulate the crowdsourcing task for each. Section 6 describes our crowdsourcing design for the news story ranking task, while we evaluate the assessments produced for it in Sect. 7. Similarly, in Sect. 8, we discuss our crowdsourcing design for the blog post task, while in Sect. 9, we empirically evaluate the assessments produced. In Sect. 10 we discuss our findings, while in Sect. 11 we provide concluding remarks and best practices.

2 TREC relevance assessment

Information Retrieval (IR) has a long history of improvement through experimentation. The Text REtrieval Conference (TREC) is a collection of IR workshops sponsored by National Institute of Standards and Technology (NIST) and the Disruptive Technology Office of the US Department of Defence. Unlike standard conferences in IR however, TREC was designed to encourage evaluation within the IR community by providing the infrastructure necessary for large-scale evaluation (Voorhees et al. 2005). TREC each year runs a number of tracks, which each represents a topic of research. A single track will contain one or more tasks which groups can participate in, by providing an afore-mentioned test collection suitable for evaluating that task. At TREC, the test collections are created using a pooling technique (Sparck-Jones and van Rijsbergen 1975), where all the participating groups provide initial document rankings using the corpus and topics, known as runs. Only the top n documents in each of these runs are judged by human assessors, which are then merged to create the relevance judgements/assessments. The idea is that, by using runs from multiple diverse IR systems the relevance judgements will not be biased toward any single system or algorithm.

However, assessing large numbers of documents is time consuming and expensive. For example, if a document takes 30 seconds to assess (Voorhees et al. 2005), then to judge each of the 19,381 pooled documents for the TREC 2011 Web track, for example, will take in excess of 161 man-hours, or 6.7 weeks for a single assessor (assuming a 7 hours working day). Indeed, assuming a national minimum wage of $7.25 (US dollars) per hour, the cost of recreating the TREC Web track relevance assessments totals $1,170.88. TREC, sponsored by NIST, has traditionally paid a group of specialist assessors to judge documents for the participants (Voorhees et al. 2005). However, NIST has a limited amount of funds to support the tracks that it runs. Indeed, in rare cases, TREC tracks have been known to have used the participants to judge documents if NIST could not supply sufficient funding (Macdonald et al. 2009). However, such an approach is limited, as the number of documents that can be judged is determined by the number of participants. Furthermore, the size of document pools used to assess systems, in comparison to size of the collections examined, i.e. the completeness of produced relevance assessments, has been diminishing almost year-on-year (He et al. 2008). This violates the completeness assumption of TREC-style assessment (Voorhees et al. 2005) to an ever greater degree, increasing the probability of error during evaluation. For these reasons, it is important to find alternative, cheaper and faster methods to perform relevance assessment at a larger scale.

In this paper, we investigate crowdsourcing as a possible alternative to using traditional TREC assessors or the participants for relevance assessment. To the best of our knowledge, this is the first study detailing how relevance assessments were crowdsourced for a TREC track instead of using paid TREC assessors. Indeed, the relevance assessments produced using the techniques described in this paper were used to evaluate participant systems for the TREC 2010 Blog track. In the next section, we describe crowdsourcing in general, motivate why it is suitable for TREC-style relevance assessment and summarise related work in the field.

3 Crowdsourcing for relevance assessment

Crowdsourcing in general is the act of outsourcing tasks, traditionally performed by a specialist person or group, to a large undefined group of people or community (referred to as the “crowd”), through an open call (Howe 2010). There are many motivations for crowdsourcing tasks. For example, simple tasks can be completed at a relatively small cost, and often very quickly (Alonso et al. 2008). Moreover, by employing a crowd of ‘users’ to do assessments as opposed to a few ‘experts’, a wider range of talent can be accessed and expert bias avoided (Howe 2008). Indeed, in a TREC setting, it has been observed that relevance assessments by TREC assessors often differ to those produced by external assessors (Bailey et al. 2008). However, crowdsourcing has also been the subject of much controversy as to its effectiveness. In particular, the work produced is known to often be of low quality (Atwood 2010) and results can be rendered meaningless by random or malicious work (Downs et al. 2010).

Nevertheless, crowdsourcing remains an attractive alternative to using specialist assessors or engaging the participants when creating relevance assessments for TREC and other tasks/tracks. Indeed, Alonso et al. (2008) first suggested crowdsourcing as an alternative to specialist assessors when creating relevance assessments for a TREC-style ad-hoc test collection. They noted the potential for crowdsourcing to provide a cheap, fast, effective and flexible way to generate relevance assessments. However, they do not empirically show whether indeed this is the case for a real TREC task.

Later, Alonso and Mizzaro (2009) and Alonso and Baeza-Yates (2011) built upon this early work by crowdsourcing relevance assessments for the 1998 TREC-7 and 1999 TREC-8 ad-hoc tasks respectively. In particular, Alonso and Mizzaro selected only a single TREC-7 topic (011), sampled a small subset of 29 documents pooled for that topic, and crowdsourced alternative relevance assessments for each of those 29 documents. Alonso and Baeza-Yates crowdsourced a larger set of 1950 relevance assessments spanning an 11 topic subset from the TREC-8 ad-hoc task. In contrast to these small early studies, we crowdsource complete topic relevance assessments for two modern TREC tasks run during 2010, spanning in excess of 15,000 unique documents and 30,000 individual assessments. Moreover, we provide a detailed analysis of the design of our crowdsourced assessment task, and empirically evaluate the assessments produced.

Other research in the field of crowdsourcing for document assessment, has focused on techniques to assure the quality of the work produced outside of a TREC setting. Snow et al. (2008) and Callison-Burch (2009) investigated the accuracy of crowdsourced labels generated for natural language processing tasks. They concluded that ‘expert’ levels of labelling quality can be achieved by having three or five workers complete each crowdsourcing job and taking a majority (most commonly selected) label. Notably, to produce our TREC relevance assessments, we also use a majority-based judging approach to improve the quality.

Kittur et al. (2008) highlighted the need to validate the output produced by each worker, showing how label quality could be markedly improved by introducing questions with verifiable, quantitative answers, often referred to as ‘gold judgements’ or a ‘honey-pot’. The aim of these questions is to validate a workers’ output in an online manner. In this way, poorly performing workers can be detected and then ejected from tasks early on in the evaluation, saving money and hopefully improving the quality of the final labels produced. Indeed, in this paper, we experiment with both a variant of gold judgement validation where we account for worker impact, as well as a novel manual validation technique based upon visual work summaries.

Notably, subsequent research by Ipeirotis (2011) has indicated that within a Web page classification context, gold judgements are unnecessary when larger numbers of assessors work on each task. In particular, their results indicate that when 10 assessors work on each task and a majority result is taken, then no appreciable gains in accuracy are observed when further adding a gold judgement validation. However, having 10 assessors work on each task adds a large monetary overhead. In this work, we use a combination of three workers per task and gold judgement validation to keep costs low.

In this paper, we aim to determine how feasible and effective crowdsourcing is at relevance assessment for a modern TREC task at scale. In particular, we crowdsourced relevance assessments for the two sub-tasks of the TREC 2010 Blog track top news stories identification task. Indeed, as noted earlier, our study is the first example where crowdsourced relevance assessments have been used at TREC. We detail the crowdsourcing strategies that we employed, empirically evaluate the assessments produced and conclude upon the overall suitability of crowdsourcing for these sub-tasks. In the next section, we describe the online marketplaces that facilitate crowdsourcing, including that which we use in our later experiments.

4 Crowdsourcing marketplaces and Amazon’s mechanical turk

Crowdsourcing is facilitated by a number of online marketplaces. These marketplaces allow individuals of organisations to post tasks to be crowdsourced and provide means for the ‘crowd’ to find and complete these tasks. There is a growing number of competing marketplaces available upon which tasks can be posted. Currently, the largest marketplace is Amazon’s Mechanical Turk (MTurk). Footnote 5

On MTurk, the individual or organisation who has work to be performed is referred to as the requester. People who sign up to perform tasks on MTurk are known as workers. A single task is comprised of smaller units, called Human Intelligence Tasks, or HITs. A HIT represents a small sub-task to be completed by one or more workers. A HIT tackled by one worker is known as an assignment. It has a small payment associated to it and an allotted completion time. Workers can see an example of any HIT that they are considering, along with the amount they will be paid for completing it and the time allotted.

MTurk provides requesters with a simple web service API. This API facilitates the submitting of tasks to the MTurk, the approval of completed tasks, and access to the answers generated (Alonso et al. 2008). Notably, poor quality work need not be accepted. In particular, HITs can be rejected without payment and workers can be blocked from completing tasks.

An important feature of MTurk workers is that they are anonymous to the requester. However, requesters can specify filters regarding the nationality of workers and their prior acceptance rate on other HITs. It is noteworthy that the usefulness of worker prior acceptance rate as an indicator of work quality/commitment is contested as high acceptance rates can be generated artificially (Eickhoff and de Vries 2011). Indeed, in our later experiments we do not rely upon worker acceptance rates to assure the quality of our relevance assessments.

Other crowdsourcing marketplaces provide either similar functionality to MTurk, like CloudCrowd Footnote 6 or implement a wrapper around MTurk to provide enhanced services, e.g. CrowdFlower Footnote 7 for a cost. Still other crowdsourcing platforms use known workers instead of anonymous ones, for example ODesk.Footnote 8 In this work, we use the most common MTurk marketplace, as the alternatives are either more costly, or lack such a large worker-base needed for our large-scale crowdsourcing experiments. In the next section, we describe the two tasks that we crowdsource using Amazon’s Mechanical Turk.

5 TREC assessment and the top news stories identification task

The TREC Blog track top news stories identification task investigated the news dimension of the blogosphere. In particular, it is comprised of two distinct sub-tasks, namely news story ranking and blog post ranking. We crowdsourced relevance assessments for both these sub-tasks. In this section, we describe both sub-tasks from the participant and evaluator standpoint, i.e. the task that the participants were to address and the evaluation task that is to be crowdsourced. In particular, Sect. 5.1 details the news story ranking sub-task, while Sect. 5.2 describes the blog post ranking sub-task.

5.1 News story ranking

The news story ranking task addresses whether the blogosphere can be used to identify the most important news stories for a given day for each of 5 news categories, namely: US, World, Sport, Business/Financial and Science/Technology news (Macdonald et al. 2010). It can be considered as answering the question “what are the most important news stories today”. This can be seen as a ranking task, where current news stories are ranked by their newsworthiness for placement on the homepage or category pages of a news website. Important news stories will receive prominent placement on the page, while lesser stories are displayed less prominently or not at all. For a set of 50 days of interest (topic days), the participants ranked news stories published on each day, by their newsworthiness.

Participating systems submit a number of runs. For each run, a system ranked news stories from the Thomson Reuters news corpus (Leidner 2010)—each represented by an article headline and associated content—by their newsworthiness on each of the 50 topic days. Notably, in contrast to a typical TREC task, where a document is ranked based upon its content, for the story ranking task, systems rank a news story by its newsworthiness based upon related discussion within the blogosphere, represented by the Blogs08 blog post corpus (Macdonald et al. 2010). The statistics of the TRC2 and Blogs08 corpora are shown in Table 1. Each run was evaluated based upon the number and placement of newsworthy stories ranked. The runs from the participating systems were sampled using statMAP sampling (Aslam and Pavlu 2007), to a depth of 32 stories per day and category, resulting in 160 stories per day to be judged, with 8,000 stories in total (50 topics * 5 new categories * 32 news stories) (Macdonald et al. 2010).

Table 1 Statistics for the TRC2 news corpus and Blogs08 blog post corpus used during the TREC 2010 Blog track top news stories identification task

The evaluation task, i.e. that which is to be crowdsourced, is to judge each of the 8,000 news stories by their newsworthiness on the appropriate topic day. Specifically, for a given news story, a day of interest and a news category, the crowdsourced workers should judge that story as either newsworthy or not for that day and category from an editorial perspective, i.e. such that a participant system’s ranking based upon the blogosphere can be compared to that produced by a newspaper editor.

Within a ranking of news stories for a news category and day, a story can either belong to that news category or not, as it was up to the participating systems to classify stories to categories. Only stories that belong to the named category should be judged newsworthy. Hence, each news story should be judged by workers as belonging to one of three classes:

  • Newsworthy and correct category: The story is newsworthy for the topic day and news category.

  • Not newsworthy but correct category: The story not particularly newsworthy for the topic day, but does match the news category.

  • Incorrect category: The story belongs to a different news category.

Notably, for the purposes of the final assessments produced, the ‘Not newsworthy but correct category’ and ‘Incorrect category’ are both considered non-newsworthy, and hence are merged to create binary judgements. News story ranking is an interesting evaluation problem for two reasons. Firstly, a news story’s newsworthiness is not wholly dependant upon itself, but is also dependant on the other stories published on the same day. As a result, the assessor needs to know about the other stories published on the same day. Secondly, unlike a traditional document relevance judging task, newsworthiness varies over time. We describe our HIT design, validation and setting for news story ranking in Sect. 6, while we empirically evaluate the assessments produced in Sect. 7.

5.2 Blog post ranking

The TREC Blog track blog post ranking sub-task represents a subsequent search for related blog posts about a top news story for the user. It answers the question “find me current and diverse blog posts relating to news story X”. This is a ranking task where blog posts are ranked by their relevance to a news story—represented by a news article—for the day that the news story was published. This ranking is subsequently diversified based on a set of perspectives that each blog post might cover, e.g. republican or democrat viewpoints for political stories, the quality and/or depth of the blog posts, etc.

In particular, participating systems were provided with 68 news stories selected from the Thomson Reuters news corpus (TRC2) (Leidner 2010), each with a specified publication date. For all 68 news stories, each participating system produces three rankings of blog posts, one containing posts published before each news story, one containing blog posts from the following day and before and one containing posts from a week following and before. This represents a system ranking for a news story at different times, i.e. as each news story matures. In each case, blog posts were ranked from the Blogs08 blog post corpus (Macdonald et al. 2010). These three rankings together form a single run. A run was evaluated based upon the number, ranking and diversity of relevant blog posts contained. To create the set of blog posts to be judged by assessors, we pooled the top 20 blog posts from the preferred run submitted by each group. This resulted in a pool of 7,975 blog posts to be judged.

The crowdsourced evaluation task is to judge each of the pooled blog posts as relevant, possibly relevant or not relevant to a news story, and also to suggest perspectives that describe each blog post. In particular, each of the 7,975 blog posts pooled should be judged as to their relevancy to a news story, as shown below.

  • Relevant: Story is discussed.

  • Possibly relevant: Post could be discussing the story.

  • Not relevant: Story is not discussed.

To evaluate the range of perspectives that each blog post ranking covers, for the 68 stories, each blog post should also be assessed in terms of the following perspectives:

  • Factual account: The post just describes the facts as are.

  • Opinionated positive: The post expresses a viewpoint endorsing some aspect of the story.

  • Opinionated negative: The post criticises some aspect of the story.

  • Opinionated mixed: The post expresses both positive and negative opinions.

  • Short summary/quick bites: The post contains only a sentence or two about the story.

  • Live blog: The post was continually updated at the time about the story.

  • In-depth analysis: The post goes into significant detail about the story.

  • Aftermath: The post gives a round-up or retrospective account of the story.

  • Predictions: The post was written before the story and discusses what might happen.

Blog post ranking holds similarities to traditional TREC-style evaluation tasks, whereby Web pages (blog posts) are judged as relevant or not to an information need, in this case a news story. As such, the lessons learned from crowdsourcing this sub-task, will be closely applicable to other TREC relevance assessment tasks. Indeed, a similar relevance assessment task is the subject of the TREC Crowdsourcing track. We describe our HIT design, validation strategy and setting for blog post ranking in Sect. 8, while we empirically evaluate the assessments produced in Sect. 9.

6 News story ranking assessment

In this section, we detail our HIT design for the news story ranking sub-task of the TREC Blog track top news stories identification task. Section 6.1 describes the design of our HIT, while our methodology for validating the work produced is detailed in Sect. 6.2. We provide details regarding the MTurk job setup in Sect. 6.3.

6.1 HIT design for story ranking

Recall that the crowdsourcing task is to judge each of 8,000 news stories as newsworthy or not for one of five news categories. As we described previously, we use Amazon’s online marketplace Mechanical Turk (MTurk) to perform our judging. In particular, each MTurk Human Intelligence Task (HIT) covers the 32 top stories sampled for a single day and news category. For these stories, we ask workers to judge each as either: (1) Important and of the correct category, (2) Not important but of the correct category or (3) of the wrong category. To inform this judgement, we present the worker the news story represented by an associated news article with both a headline and article content. To facilitate judging, a HIT must be designed that enables the viewing of individual news stories and the saving of the resultant judgement by each worker.

Figure 1 shows an instance of the HIT we designed. As can be seen, the current category and day of interest is shown at top left. Down the left hand side we provide a listing of the news stories that are to be judged in this HIT. This summary is colour coded. A blue [?] indicates that the story is yet to be judged, a green [+] indicates that the story has been judged as newsworthy for the stated category, an orange [−] denotes that the story was judged as not important but part of the correct category, while a red [x] indicates that the story does not belong to this category. On the right hand side, the current story to be judged is shown, including the headline and content of the article representing that story. A worker must judge each of the 32 stories in turn.

Fig. 1
figure 1

A screenshot of the external judging interface shown to workers within the instructions

The above HIT has two aspects that differentiate it from other HITs typically seen on MTurk. Firstly, each HIT spans 32 story judgements rather than a single one. This is much larger than typical MTurk HITs. The reasoning behind using such a large HIT is two-fold. Firstly, the relative nature of importance in this context requires that the worker hold some background knowledge of the other news stories of the day when judging. To this end, we asked that workers make two passes over the stories. During the first and longer pass, the worker would judge each story based on the headline and content of that story and the previous stories judged, while upon the second pass, the worker can change their judgement for any story now that they have knowledge of more news stories from that day. It is of note that the interface does not force workers to make the second pass over the stories. However, we did observe delays between the last story being judged and the task being submitted, indicating that workers were reviewing their judgements. The second reason is one of best practice. In particular, when submitting large jobs with thousands of required judgements, we have shown that it is advantageous to retain workers over many judgements to maintain consistency in judging (McCreadie et al. 2010). By increasing the HIT size, we have each worker perform at least 32 judgements.

The second distinguishing feature of our HIT, is that the interface that workers interact with, was hosted externally. Typical MTurk HITs host questions using an HTML form. When a worker accepts a HIT, the form is loaded for the worker to complete. Once it has been completed, the form is submitted and the results saved by Amazon. Requesters subsequently review the results by downloading them from the MTurk website. However, the top news stories identification task was run previously during TREC 2009. For this 2009 task, we had the participating groups judge news stories using a custom Web-based interface. Rather than design a new HTML form for MTurk from scratch, we instead propose to integrate the existing interface with MTurk. We note that MTurk provides a related function in the form of an ExternalQuestion HIT.Footnote 9 These HITs redirect workers to an external HTML form that collects the judgements and then sends them back to MTurk. We did not use this function as we wished to maintain the local assessment aggregation and output functions of the custom Web-based interface, rather than return the assessments to MTurk and thereby adding an additional layer of processing.

In particular, instead of loading a typical form, we load the existing interface within an HTML iframe for each worker along with the standard submit button. Indeed, this is similar to the startup CrowdFlower, which provides a wrapper service for creating jobs on MTurk and similar crowdsourcing marketplaces. Figure 2 illustrates this approach. Notably, this interface is hosted from our own servers and is completely external from MTurk. The worker interacts directly with our interface, and all judgements made are immediately stored on our local servers. In addition to the reduced cost that using an existing interface brings, we noted a key advantage that this has over using traditional HTML form HITs. In particular, we can better log the activity of workers and hence gain insights into how the workers are tackling each HIT. For example, we can count the number of workers that both view but not accept, or view, accept but not complete each HIT. This is useful to identify when tasks are difficult or unappealing and hence need improvement.

Fig. 2
figure 2

Information flow when using our externally hosted interface

Furthermore, the use of an externally hosted interface holds another advantage. In particular, our previous experiences with crowdsourcing indicates that there are bots exploiting common HTML form components, e.g. single entry radio buttons/text boxes, to automatically attempt jobs on MTurk (McCreadie et al. 2010). The degree of user interaction that our external interface requires makes this less likely to be an issue, as a bot has to iteratively make and submit 32 judgements before pressing the submit button.

6.2 Work validation

According to best practises in crowdsourcing, we had three individual workers perform each HIT (Snow et al. 2008). From these three judgements we take the majority vote for each story to create the final newsworthiness assessment for that news story. Moreover, as noted earlier in Sect. 3, to assure the quality of the resulting judgements, requesters typically use a set of gold judgements or ‘honey pot’ to detect workers that perform poorly on the task (Kittur et al. 2008). However, to do so, the requester must create these gold judgements manually at a not inconsiderable cost in terms of time and effort. We hypothesise that for tasks spanning only hundreds of HITs, it may be possible to validate results just as quickly in a manual fashion. To test this, we used the colour coded summaries of the stories and the judgements that each worker produced, to manually validate whether they were doing an acceptable job. In particular, we qualitatively assessed each of the 750 HIT instances based on 3 criteria:

  1. 1.

    Are all 32 stories judged?

  2. 2.

    Are the judgements similar across the 3 redundant judgements?

  3. 3.

    Are the stories marked important sensible?

Figure 3 illustrates how the judgements by three workers for a single HIT for the ‘world news’ category can be viewed as a colour-coded summary. As we observe, in clear cut cases, like ‘NSE details S&P CNX Nifty Inde...’ shown in Fig. 3, which is clearly from the incorrect Business/Finance category, there is strong agreement between workers. In less clear cases, such as ‘Indian shares up as settlement...’ where the story could belong to either the World or Business/Finance categories, we observe some levels of disagreement. However, in this case, it is clear that the workers were completing the HIT in ‘good faith’ and hence the HIT was approved and paid for. On the other hand, Fig. 4 shows a HIT that we believe has been attempted by a bot, as only the first judgement was made before the HIT was submitted.

Fig. 3
figure 3

Displayed summary of three workers judgements for a single HIT

Fig. 4
figure 4

HIT possibly completed by a bot

Although this validation strategy appears to involve a considerable volume of work, it took no longer than 5 hours for one person to validate all 750 HIT instances, which we estimated is comparable to the time required to create a recommended gold-standard set of 5% of the full workload size. This speed is due to the fact that colour coding of the judgments factilitate assessment of criteria (1) and (2) at ‘a glance’, while only a small proportion of judgments need be examined under (3). Moreover, this approach is advantageous, both because one does not have to waste judgments on validation, and by manually assessing we can have greater confidence that the workers are judging correctly. Indeed, overall, the assessed work was of good quality, with less than 5% of HITs rejected. On the other hand, such an assessment approach may not be scalable when the number of assessments required is numbered in the hundreds of thousands or greater. However, although not investigated in this work, it is possible that validation of this form could be crowdsourced at scale.

In general, the result of using a summary based interface is that the load on the validator is markedly reduced. This is achieved because the overall aim of the validator is not to assess documents but to assess workers, hence through block assessment on a per-HIT basis the overhead validation workload can be reduced. Moreover, one of the key advantages that the summary-based interface provides is that it enables the comparison between workers for the same task, this enables the validator to target his/her efforts on the more difficult cases.

6.3 MTurk job setup

The entire task totals 24,000 story judgements (8,000 news stories * 3 workers per HIT) spread over 750 HIT instances. We paid our workers $0.50 (US dollars) per HIT (32 judgements), totalling $412.50 (including Amazon’s 10% fees). For this task, we only used workers from the US. Our reasoning is that interational workers would likely not be able to accurately judge the importance of US news stories. Any incomplete HITs were rejected. As such that we collect exactly 3 judgements per story.

Following an iterative design methodology (Alonso et al. 2008), we submitted our HITs in 6 distinct batches, allowing for feedback to be accumulated and HIT improvements to be made. As for the blog post ranking task, between each batch we made minor modifications to the judging interface and updated the instructions based upon feedback from the workers. In the next section, we empirically evaluate the story ranking judgements produced. Screenshots of the instructions given to each worker are provided in Appendix 1.

7 Evaluating news story ranking assessments

In this section, we analyse our crowdsourcing job and the relevance assessments produced. We aim to determine how successful the crowdsourcing for this task was and areas where improvements can be made. In particular, in each of the following four sub-sections, we investigate a different research question, followed by conclusions in Sect. 7.6. The four research questions are:

  1. 1.

    Is crowdsourcing actually fast and cheap? (Sect. 7.1)

  2. 2.

    Are the resulting relevance assessments of sufficient quality for crowdsourcing to be an alternative to traditional TREC assessments? (Sect. 7.2)

  3. 3.

    Is having three redundant workers judge each story necessary? (Sect. 7.3)

  4. 4.

    If we use worker agreement to introduce multiple levels of story importance, would this affect the final ranking of systems at TREC? (Sect. 7.5)

7.1 Crowdsourcing analysis

Before analysing the actual relevance assessments produced, it is useful to examine the salient features of the crowdsourcing job. Prior to launching our job, we estimated that to judge the 32 stories (one HIT) it would take approximately 15 minutes, accounting for the one-off time to read the instructions and the time taken to read each story. Based upon an estimated hourly-rate (amount paid per hour of work completed) of $2, we paid a fixed rate of $0.50 per HIT. Table 2 reports the per-hourly rate paid to workers during each of the six batches. There are two points of interest. Firstly, our hourly-rate is higher than expected ($3.28–$6.06), indicating that workers took less time than estimated to complete each HIT. Secondly, we observe an upward trend in the hourly-rate in later batches. This shows that in general, HITs in these batches took equal to or less time to complete (although there are exceptions). We believe that there are two reasons for this: firstly, between each batch, we iteratively improved the instructions, hence making the task easier, and secondly, we observed a high degree of worker retainment between batches and, as such, the workers had the opportunity to become familiar with the task. Moreover, Fig. 5 reports the number of HITs completed by each of the 96 workers. From this, we observe that the majority of the HITs were completed by only three workers, showing that we retained a core of committed workers.

Table 2 Average amount paid per hour to workers and work composition for each batch of HITs
Fig. 5
figure 5

The number of HITs completed by each of our workers

In terms of the time taken by our batches, we observed a quick uptake by MTurk workers. For each of the 6 batches, the first HITs were often accepted within 10 minutes of launch, whilst the time to complete all HITs in each batch never exceeded 5 hours. Overall, crowdsourcing took a total of 8 working days to accumulate the 24,000 judgements required, including time taken by worker validation and interface improvements.

In general, we conclude that crowdsourcing judgements can be both inexpensive at $0.0156 per judgement and fast to complete. Moreover, we believe that this task may be done 38% cheaper, as we paid above average rates for the work.

7.2 Relevance assessment quality

In-line with best practises in crowdsourcing, we had three individual workers judge each HIT. To determine the quality of our judgements, we measure the agreement between our workers. Table 3 reports the percentage of judgements for each relevance label and the between-worker agreement in terms of Fleiss Kappa (Fleiss 1971), on average, as well as for each of the five news categories. In general, we observe that agreement on average is reasonably high (69%). Indeed, this, our US worker restriction and manual validation strategy all lend confidence to the quality of the judgements produced. However, of interest is that agreement varies markedly between news categories. In particular, the Science/Technology and Sport categories exhibit the highest agreement with 83% and 78% respectively, while the US and World categories show less agreement. Based upon the class distribution for these categories, the disparity in agreement indicates that distinguishing science from non-science stories is easier than for the US or World categories. This is intuitive, as the US and World categories suffer from a much higher story overlap. For example, for the story “President meets world leaders regarding climate change”, it is unclear whether it is a World and/or US story. Hence, workers may disagree whether it should recieve the ‘important’ or ‘wrong category’ label.

Table 3 Judgement distribution and agreement on a per category basis

Overall, we conclude that based upon the high level of agreement observed, the relevance labels produced are of sufficient quality. Indeed, our agreement is greater than that observed in many studies of TREC assessments (Al-Maskari et al. 2008). Hence, crowdsourcing appears to be a viable alternative to traditional TREC assessments for the Blog track top stories task.

7.3 Redundant judgements

We use three workers to judge each HIT as a means to improve the quality of the relevance assessments produced. Using more than three workers per assessment was precluded on cost grounds and was deemed unnecessary given that all HITs were subject to validation. Indeed, previous work has shown that three workers per assessment leads to work of acceptable quality (Snow et al. 2008). However, it is important to determine to what extent redundant judging is necessary, as this is an area where costs can be dramatically decreased. To investigate this, we examine the effect of using only a single judgement on the ranking of participant runs that participated in TREC 2010. If the run ranking changes little, then quite possibly there is no need to have many workers judge each HIT, thereby cutting the cost by two thirds. Table 4 reports the ranking of all 18 runs submitted to the TREC 2010 top stories ranking task, when ranking by the majority of the three workers (the official qrels) and when ranking by the judgements produced by the three redundant workers individually. Furthermore, similarly to (Voorhees 2001), Table 4 also reports Kendall’s τ correlation between the run rankings produced by majority and single worker judgements. Interestingly, we observe that there are significant changes in the relative ranking of runs in terms of their statMAP score. Indeed, this is especially pronounced at the top of the ranking, as the top three systems ranking is not stable. As such, we conclude that redundant judging is necessary for this task.

Table 4 Run rankings using majority of three judgements against single judgements

7.4 2+1 Assessment

It has been proposed that instead of using three workers to assess each HIT, instead, only two workers should initially be used. Only if the first two disagree then a third assessment would be sought as a tie-breaker. This has the advantage of saving judging effort, time and money on assessments that are not required. In this sub-section we investigate how much we could have saved had we chosen a 2+1 assessment approach. In particular, we consider the judgements by the 1st and 2nd meta workers defined in the previous sub-section to contribute the first two assessments, and the 3rd meta worker provides the tie-breaker (if required). We measure how many assessments we could have avoided making and how much the resultant ranking of systems would have been effected as a result.

In particular, by using a 2+1 assessment approach, only 19.96% of topic/story pairs required a tie-breaker. As a result of this, we could have saved $110.06, or 26.7% of the overall job cost. Moreover, the resultant aggregated relevance assessments under the 2+1 assessment approach are identical to those produced under majority voiting over all three judgments. This indicates that 2+1 assessment produces equally accurate assessments to 3-redundancy judging.

7.5 Graded judgements

Recall that we have our crowdsourced workers judge each news story as newsworthy or not for a day of interest, creating a binary judgement. However, one of the advantages of using redundant judgements is that one can infer judgement confidence based on worker agreement. In particular, although not used during TREC 2010, we also created an alternative assessment set, where a news story’s importance was measured on a three level graded scale (Voorhees 2001). In particular, if all workers judged a story important then the story was assigned a new ‘highly important’ label, two out of three workers resulted in an ‘important’ label, while one or no workers resulted in a ‘not important’ label, again following worker majority. This differs from the official binary relevance assessments that distinguish ‘important’ from ‘not important’ only. In this section, we examine how the two level (binary) judgements compare to this three-level graded alternative. We aim to determine whether using this additional agreement evidence adversely affects the ranking of the TREC 2010 participants.

Table 5 reports Kendall’s τ correlation between the ranking of runs under the binary and graded relevance judgements using the statMAP and statMNDCG@10 evaluation measures (Aslam and Pavlu 2007). A high correlation indicates that the participating runs were not affected by the addition of a ‘highly relevant’ category, while a low correlation indicates that some runs favoured highly relevant stories more than others. From Table 5, we observe that the rankings produced by the binary and graded relevance assessments are not particularly well correlated, especially under statMNDCG@10. This indicates that the ranking of runs is affected by the addition of a highly relevant category. An investigation into the reasons for this result, is out of the scope of this paper, as such we leave this for future work.

Table 5 Kendall’s τ correlation between binary and graded relevance judgements under statMAP and statMNDCG@10 measures over the cross-category mean

7.6 Story ranking conclusions

From our evaluation of the TREC Blog track, story ranking sub-task, crowdsourcing remains a viable alternative to using TREC assessors for relevance assessment. Indeed, levels of agreement between our workers are superior to that observed between TREC assessors for previous tasks (Voorhees 2000). Furthermore, we have confirmed the importance of redundant judging as stated in the literature for relevance assessment in a TREC setting. Moreover we have shown that expanding the binary relevance assessments using worker agreement can strongly affect the overall ranking of participating runs and that by using a 2+1 assessment strategy the overall job cost could have been reduced by 26.7% with no change in the resultant assessments.

8 Blog post relevance assessment

In this section, we describe our HIT design and validation for creating relevance assessments for the blog post ranking sub-task of the top news stories identification task. In particular, Sect. 8.1 describes the design of our HIT, while our methodology for validating the work produced is detailed in Sect. 8.2. We provide details regarding the MTurk job setup in Sect. 8.3.

8.1 HIT design for blog post ranking

The evaluation task for blog post ranking is to judge a set of 7,975 blog posts from the Blogs08 corpus as relevant or not to a news story, and as representing zero or more perspectives selected from a pre-defined list. We use Amazon’s online marketplace Mechanical Turk (MTurk) to perform our judging. In particular, each MTurk Human Intelligence Task (HIT) covers between 20 blog posts sampled for one of 68 news stories. For these blog posts, we ask workers to judge each as either: (1) Relevant, (2) Possibly Relevant or (3) Not relevant, and to select zero or more perspectives describing that post.

To develop the HIT for blog post assessment, we followed the iterative design methodology proposed by Alonso and Baeza-Yates (2011). Using a small test set comprised of 200 blog posts returned for two news stories that had been manually assessed, we iteratively developed a prototype HIT, evaluated worker performance and subsequently made improvements. An example of the resulting HIT design is shown in Fig. 6. Importantly, during our iterative design process, we made the following two observations.

Fig. 6
figure 6

A screenshot of the external judging interface shown to workers within the instructions

Firstly, each HIT should encompass multiple judgements to be completed. In particular, we evaluated the overall time taken to judge all 200 blog posts when combining 20 blog posts into a single HIT, and when judging only a blog post per HIT. The amount paid overall remained constant. Our results show that by combining 20 posts per HIT, completion time is reduced by almost 10 times (79.2%) over judging one blog post per HIT. This reduction in time observed stems from both reduced time in judging each post, i.e. reducing the loading/selecting of different HITs, as well as higher levels of worker retention, i.e. workers did multiple of the larger HITs in succession.

Secondly, when judging HTML documents like blog posts, rendering of pages can prove to be difficult. In particular, the blog posts within the Blogs08 corpus were crawled during 2008. However, rendering the content within these posts at a later point in time can result in a badly mangled page. In particular, while the actual page content does not change, any linked files, such as the website template (CSS), images or javascript may have been modified or removed. Indeed, a proportion of our workers left us comments indicating that pages were difficult to read in some cases. For example, Fig. 7a illustrates an example blog post when presented in original HTML form. As we can see, the page template has changed markedly, such that the main page content is no longer visible. To counteract this, as well as to decrease loading delays, by default we show workers a cleaned version of the Web page, created by extracting only the text within <h> or <p> tags. Indeed, Fig. 7b shows the cleaned version of the same page. However, we do provide a full HTML rendering that can be loaded by pressing the ‘Full Post’ button on the left frame of the HIT (see Fig. 6).

Fig. 7
figure 7

An example blog post rendering both when cleaned and as HTML. a HTML. b Cleaned

8.2 Work validation

As prior work in the field of crowdsourcing relevance assessment has shown, it is important to validate the work produced by the crowd (Kittur et al. 2008). To this end, to assure the quality of the relevance assessments produced, we use a form of gold judgement validation. In particular, typical gold standard validation involves the prior creation of a gold-standard judgement set of around 5% of the total evaluation size, with which to test workers. This form of validation has two notable disadvantages. Firstly, gold standard validation requires the excess cost of having workers judge the gold standard. Secondly, dependant on how the gold judgements are selected, the background distribution of answers may make gold judgement validation unreliable, i.e. if the majority of posts are relevant then most of the gold will be as well under a random selection approach.

We believe that it is more effective to perform gold judgement validation after all the documents have been assessed. Indeed, MTurk allows this by withholding payment for tasks until the work has been approved by the requester. The advantage of creating the gold judgement after the workers have completed the task is that we then have knowledge of the judgement each worker assigned to a blog post. We aim to use this to assure that we validate each worker across the range of possible judgements, i.e. for each worker, we sample an equal number of blog posts judged as relevant, possibly relevant and not relevant respectively for use as gold judgements. In this way, we avoid issues with the background answer distributions, as we are taking an even spread across the possible answers.

In particular, for each HIT judged containing 20 blog posts, we selected three posts to be validated against a gold standard, i.e. one judged relevant, one judged possibly relevant and one judged not relevant. For each of these selected posts, the track organisers assessed their relevance to the associated news story, forming the gold standard. If more than one of these did not match this gold standard, then the HIT was rejected as a whole and re-posted for another worker to complete. Notably, this resulted in a gold standard validation of roughly 15% of the full judgement set (1,197/7,975), and took within the region of 8 hours to create. This is naturally longer than it would take to create a normal 5% gold standard set. However, by using a larger and more evenly distributed gold standard, we have greater confidence in the reliability of the gold standard, and hence resulting judgements.

8.3 MTurk job setup

The entire task totals 7,975 blog posts from 68 news stories spread over 433 HIT instances. We paid our workers $0.50 (US dollars) per HIT (20 judgements), totalling $238.70 (including Amazon’s 10% fees). Notably, according to best practices in crowdsourcing, each blog post should be judged by three workers and the majority result taken. However, to do so, the crowdsourcing cost would exceed $600, which was beyond the budget set aside for the task. Instead, we have only a single worker judge each blog post. Later in Sect. 9.2 we empirically evaluate the judgments produced, to determine whether this had an adverse effect on the quality of relevance assessments produced. We did not restrict worker selection based on geography, however only workers with a prior 95% acceptance rate were accepted for this task, although as noted earlier in Sect. 4, whether this has any marked effect on the resulting judgements is contested.

Continuing with an iterative methodology (Alonso et al. 2008), we submitted our HITs in 6 distinct batches, allowing for feedback to be accumulated and HIT improvements to be made. The first five batches were comprised of HITs containing 20 blog posts to be judged. The sixth batch contained all of the remaining blog posts for each news story. As suggested by Le et al. (2010), between each batch, we updated the instructions based upon feedback from the workers. Indeed, we maintained a ‘Guidelines’ section specifically for frequently asked questions by the workers. The majority of the guidelines added were related to corner-cases that were not covered by the instructions. Four examples are given below:

  • Any inappropriate pages (porn) should be marked as not relevant.

  • If a page is not in English mark it as not relevant.

  • A blog post can still be highly relevant even if it was posted before the story, i.e. expectations might still be interesting.

  • You should NOT consider any comments in the post when judging.

We do not expect that guideline additions will markedly affect the judgement quality of later batches, as the occurrence of such corner cases for this task was relatively small. In the next section, we empirically evaluate the story ranking judgements produced. Screenshots of the instructions given to each worker are provided in Appendix 2.

9 Evaluating blog post ranking assessments

In this section, we analyse our crowdsourcing job and the resultant relevance assessments for blog post ranking. We aim to determine how successful crowdsourcing this second task was and areas where improvements can be made. In particular, we wish to examine how cheap and fast crowdsourcing was for assessing blog posts (rather than news stories), in addition to evaluating the quality of the results produced. In each of the following three sub-sections, we investigate a related research question, followed by conclusions in Sect. 9.4. The three research questions are:

  1. 1.

    Is crowdsourcing blog post relevance assessments a fast and cheap alternative to TREC assessments? (Sect. 9.1)

  2. 2.

    Are the resulting relevance assessments of sufficient quality for crowdsourcing to be an alternative to traditional TREC assessments and how do the results compare to the small earlier studies by Alonso and Mizzaro (2009) and Alonso and Baeza-Yates (2011)? (Sect. 9.2)

  3. 3.

    When assessing relevance for news-related topics, are blog posts from particular news categories more difficult to judge than others? (Sect. 9.3)

9.1 Crowdsourcing analysis

It has been suggested that the crowdsourcing of relevance assessments can be completed at little cost, and often very quickly (Alonso et al. 2008). For the story ranking sub-task, we concluded that crowdsourcing was indeed a fast and cheap alternative. However, this may not be the case when assessing blog posts rather than news stories. We begin by investigating whether crowdsourcing is also a cheap and fast alternative for the blog post ranking sub-task of the TREC 2010 Blog track.

Based upon the tests run during the HIT design, we estimated that it would take approximately 45 seconds to judge each blog post. This is roughly double the amount of time estimated to judge each story in the previous task. The reason for such a long period of time stems from the long length of many blog posts. Indeed, to gain sufficient information about a blog post, such that different blog post perspectives can be assessed, a worker must scan through the entire post in many cases. Moreover, as we are dealing with HTML pages, there are delays introduced in loading, particularly if a worker wishes to see the non-cleaned version containing either advertising and/or images. We based HIT payment on an hourly-rate (amount paid per hour of work completed) of $2. As such, given 45 seconds per judgement and 20 judgements per HIT, $0.50 was paid for each HIT. Table 6 reports the per-hourly rate paid to workers during each of the six batches. Batch 6 is separated, as its statistics are different. Indeed, each HIT in batch 6 did not contain a strict 20 blog posts to be judged.

Table 6 Average amount paid per hour to workers and work composition for each batch of HITs

From Table 6, we observe that beyond the first very small batch, the amount paid per worker is higher than anticipated, indeed around $3.50→$4 per hour, higher than our $2 target. Indeed, with similarities to the story ranking task described previously, it appears we over-paid our workers. This indicates that our estimate based upon the iterative design phase underestimated the speed at which workers could complete the HIT. Indeed, this reinforces how difficult it is to estimate the time required to do a task, as small test jobs may not prove to be representative.

Prior work has shown that a high degree of worker retainment is an indicator that the assessments produced are of good quality, as it shows worker commitment in the task (McCreadie et al. 2010). For the previous story ranking task we observed a high degree of worker retainment. Figure 8 shows the distribution of assessments produced by each individual worker. Like for the story ranking task, we observe that the majority of the HITs were completed by only three workers. As such, we see that we have retained a relatively small number of committed workers for the task. This is a positive indicator that our resulting judgements will be of good quality.

Fig. 8
figure 8

The number of HITs completed by each of our workers

We next examine the speed at which each batch of our crowdsourcing task was completed. In particular, we observed an initially high uptake with HITs being accepted within minutes of the job being submitted. Figure 9 shows both the number of judgements and the overall time taken by each batch (in minutes). We observe that in general, the time to complete is roughly dependent on batch size. However, batch 3 is the exception, taking markedly longer. We have no concrete evidence as to why this task took much longer to complete, but from analysis of the data we note that this was the only task submitted during late evening hours in Europe. Due to the time difference in different parts of the world, the large pool of Indian workers would be unavailable at that time, which comprise a meaningful proportion of the overall MTurk workforce.Footnote 10 The total time taken to complete all batches was just under two weeks.

Fig. 9
figure 9

The number of judgements and the overall time taken by each batch (in minutes)

In general, we conclude that crowdsourcing judgements for the blog post ranking task are inexpensive at $0.025 in comparison to using TREC assessors. Indeed, assuming a similar assessment rate between crowdsourced workers and TREC assessors and a conservative $7.25 minimum wage for TREC assessors, crowdsourcing was more than three times less expensive. Moreover, although not as quick as for the story ranking task, two weeks is a reasonable timescale for generating assessments for a TREC task.

9.2 Relevance assessment quality

For the blog post ranking task, we had a single worker judge each blog post. As such, we cannot use inter-worker agreement to estimate final quality. Instead, to determine the quality of our resulting relevance assessments, we compare them against assessments produced by the track organisers. In particular, we randomly sampled 5% of the blog post set judged (360/7,975) and manually assessed each post in terms of its relevancy to the associated news story. Note that this is not the same as the gold standard used to validate the workers during the production of the relevance assessment (see Sect. 8.2), but is a different set used to evaluate quality of the final relevance assessments produced. Table 7 reports the accuracy of the crowdsourced blog post relevance judgements in comparison to the aforementioned track organiser judgements, both overall (All), and in terms of each relevance grade.

Table 7 Accuracy of blog post assessments in comparison to the track organiser judgements

We observe a reasonable overall accuracy of 66.63%, i.e. 2/3 crowdsourced judgements exactly matched our track organiser judgements. This indicates that by using a single crowdsourced worker, we can still generate assessments of reasonable quality using the gold standard validation method described in Sect. 8.2. Furthermore, we see that a disproportionate number of incorrect judgments belonged to ‘Probably Relevant’ category. This is to be expected, as blog posts that fall into this category were difficult to assess for our track organisers. To resolve this, we collapse these three-way graded assessments into a binary form, by combing the ‘Probably Relevant’ and ‘Not Relevant’ categories together. By doing so, we observe a markedly higher worker accuracy of 76.73%. Indeed, based upon this higher level of accuracy and the resilience of Cranfield-style evaluation when using many topics to combat variences in assessment quality (Bailey et al. 2008), we conclude that the relevance labels produced are of sufficient quality for crowdsourcing to be a viable for TREC relevance assessment.

The small earlier studies of crowdsourcing relevance assessments by Alonso and Mizzaro (2009) and Alonso and Baeza-Yates (2011) also examined the relevance assessments produced in terms of agreement at different relevancy levels. In particular, both studies compared the worker agreement in comparison to TREC assessors for both relevant and non relevant documents, concluding that workers disagree more often on non-relevant documents than relevant ones. Although we do not have TREC assessors to compare to, it is reasonable to use the track organiser judgements as a surrogate, as they should be of similar high quality. By doing so, we investigate whether their conclusion, i.e. that workers disagree more often on non-relevant documents than relevant ones, holds for our larger study.

From Table 7, we unexpectedly observe that agreement on the relevant category was 62%, while the non-relevant category was 75%. From this, it does not appear that for the blog post ranking task, that workers disagreed more on the non-relevant blog posts. Indeed, if anything, the opposite appears to be the case. We believe that this discrepancy may be attributed to the fact that relevant blog posts about certain news stories may be particularly easy to miss due to inconsistencies when compared to the story. For example, we observed that stories like ‘Israel kills 52 Palestinians in Gaza, 2 soldiers dead’ are difficult to judge blog posts for, as the story is reported often with incorrect numbers, either due to misinformation or for political reasons.

9.3 Topic analysis

The blog posts judged by our workers were spread over 68 different news stories published in 2008. As we noted above, certain types of news story may be difficult to judge blog posts for. To examine this in more detail, we manually categorised each of the 68 news stories into five news categories (US news, World news, Sport, Business/Finance and Science/Technology). Notably these are the same news categories used earlier for the story ranking task. Table 8 reports the judgement distribution and accuracy between the workers and track organisers for the blog post assessments on each of these news categories. We observe that the workers achieved higher judging accuracy on the Sport and Business categories, indicating that blog posts returned for these categories were easier to assess. Indeed, supporting our earlier observation that the ‘Probably Relevant’ grade was a major source of error, we see that the percentage of ‘Possibly Relevant’ category judgements was lower than the average for these Sport and Business categories.

Table 8 Judgement distribution and accuracy on for the blog post assessments, overall as well as when categorised by topics belonging to a specific news category

9.4 Blog post ranking conclusions

From our evaluation of the TREC Blog track, blog post ranking sub-task, crowdsourcing appears to be both a feasible and cheap method for creating relevance assessments. Indeed, we estimate that crowdsourcing relevance assessments for this task was around three times cheaper than hiring TREC assessors. On the other hand, assessment took markedly longer than expected to create only 7,975 judgements. We also compared our agreement with that reported by earlier studies of crowdsourced TREC assessments. This surprisingly indicated that in difference, crowdsourced workers are not more likely to disagree with our expert track organiser judgements on non-relevant blog posts than relevant ones.

10 Discussion

In this work, we crowdsourced relevance assessments for two different TREC subtasks. In this section, we compare and contrast our observations and with hindsight, discuss what we might have done differently.

10.1 Feasibility

The primary investigation that was undertaken in this paper was to examine whether it was feasible to crowdsource relevance assessments in a real TREC setting, which requires thousands of individual assessments to be made. Our success in terms of the quality of assessments produced and low cost in comparison to hiring TREC assessors attests to the feasibility of crowdsourcing relevance assessments for TREC. Indeed, in point of fact, the assessments generated using the techniques described in this paper were used to compare the participating systems to the TREC 2010 Blog track top news stories identification task. However, in terms of the time taken to produce all judgements, crowdsourcing does not seem to be particularly fast. In particular, for the blog post ranking task, it took two weeks to assess just under 8,000 blog posts. Indeed, we only completed the blog post ranking task assessments during the weeks following the TREC conference.

10.2 HIT size

When crowdsourcing relevance assessments, the HIT design is of paramount importance. From our experience, it is clear that each HIT submitted should span multiple judgements. Not only is judgement consistency improved, but so is worker retainment. We also observed that for the story ranking task, high degrees of worker retainment may also lead to workers learning how to become better at the task, leading to an increased rate of judging.

10.3 External hosting

Furthermore, in this work, we experimented with an externally hosted interface for use with our HITs, based on judging software previously developed. We found that the additional ability to log worker activity was useful as a form of implicit feedback. In particular, counting the number of workers that viewed a task and then did not accept it, was highly useful when determining if the interface should be improved. However, we do note that one needs to be careful when providing external resources, as slow response times may put off workers from completing the tasks. We thoroughly tested our system on the MTurk sandbox from machines outside our office to check for both availability and latency.

10.4 Validation

In this work, we tested two different non-standard methods for validating the work produced by the crowd, post assessment gold judgements and a novel manual technique. The post assessment gold judgement method was designed as means to improve the quality of the relevance assessments when only one worker judges each document. Our experience indicates that when redundant judging is not an option, normally due to excessive cost, then one needs to be stricter in terms of the quality of the work accepted. However, stricter quality control increases reliance on effective validation. As the validation effort available is also limited, it is important to effectively target validation to where it makes the most impact. Creating gold judgements post assessment allows the validator to more effectively target validation efforts on workers who have a high impact on the final assessment set, i.e. those who do the most work, rather than spreading it evenly over all workers.

The manual summary-based validation approach was naturally developed as a function of the judging interface that was equivalent in time required to gold judgement validation. Indeed, at least for tasks spanning hundreds of HITs, manual assessment holds merit in that it allows the requester to better analyse the types of error that workers are making and improve the HIT design accordingly. Moreover, we found that summarising work completed by multiple workers on a per-HIT basis was helpful, as it allows easy identification of workers who are acting outside the norm and hence should be investigated further.

From our experience, the combination of redundancy and summary-based validation is superior to 1-assessment judging even with post assessment gold judgements, and should be used where possible. However, in some cases 1-assessment judging is all that can be afforded, in these cases we believe that creating gold judgements post assessment is better than generating a gold-judgement set beforehand.

10.5 Redundancy

For our experiments, we used three workers to judge each HIT for the story ranking task, but only one worker per HIT for the blog post ranking task on cost grounds. The lack of inter-worker agreement as an evaluation measure made it difficult to determine if the final judgements were of good quality for the blog post ranking task. Indeed, we resorted to manually assessing additional blog posts, at a not inconsiderable cost in time. With hindsight, it may have been a better solution to decrease the amount paid to each worker or to decrease the size of the document pool, thereby releasing funds to spend upon redundancy.

Also of interest is that we have shown that if we have used a 2+1 assessment strategy, as opposed to a 3 assessment approach, the overall job cost could have been reduced by 26.7% with no change in the resultant assessments. Indeed, we could have saved over $110 which could have been used for adding redundancy for the second task.

10.6 Payment

For both tasks, we overestimated how long it would take workers to complete the HITs posted. This resulted in us wasting funds that could have been spent elsewhere. This not only shows the difficulties in estimating completion time taken based upon small test jobs, but highlights the need re-cost HITs.

11 Conclusions and best practices

In this paper, we have described our crowdsourcing approach for creating relevance judgements for the two sub-tasks of the TREC 2010 Blog track top news stories identification task, namely, blog post ranking and news story ranking. Indeed, this is the first example where crowdsourced relevance assessments have been used at TREC. Based upon the high levels of agreement, between either track organiser judgements or between our workers themselves, in addition to the manual validation that we performed, we believe that crowdsourcing is a highly viable alternative to TREC, proving to be both cheap and effective, if not particularly fast.

Furthermore, we have detailed our experiences running this large crowdsourced evaluation, spanning 30,000 relevance assessments and have discussed the lessons learned. In particular, we compared our assessor agreement with that reported by other studies of crowdsourced TREC assessments. This surprisingly indicated that in difference, crowdsourced workers are not more likely to disagree on non-relevant documents than relevant ones. Furthermore, we have confirmed the importance of redundant judging for relevance assessment in a TREC setting and shown that expanding the binary relevance assessments using worker agreement can strongly affect the overall ranking of participating groups.

Based upon this first successful example of crowdsourcing relevance assessments for a TREC task, we recommend the following four best practices in addition to those documented in Kittur et al. (2008), Callison-Burch (2009), McCreadie et al. (2010) and Le et al. (2010) both for organisers of future TREC tracks considering a crowdsourced alternative, participants of the TREC Crowdsourcing track Footnote 11 and also for the wider crowdsourcing community:

  1. 1.

    Don’t be afraid to use larger HITs: As long as the workers perceive that the reward is worth the work, uptake on the jobs will still be high. Indeed, we observed a 10 times speed up by combining 20 judgements per HIT.

  2. 2.

    If you have an existing interface, integrate it with MTurk: There is often no need to build a new evaluation for MTurk, with a few tweaks and sufficient instruction, workers can use existing software.

  3. 3.

    Gold-judgements are not mandatory: While worker validation is essential, there are viable alternatives. We successfully validated all HITs manually with the aid of colour-coded summaries.

  4. 4.

    Re-cost your HITs as necessary: As workers become familiar with the task they will become more proficient and will take less time. You may wish to revise the cost of your HITs accordingly if cost is an issue.

  5. 5.

    Use 2+1 assessment: A 2+1 assessment strategy can provide you with similar performance to 3-redundancy assessment while reducing the number of judgements required.