1 Introduction

In the last few years, crowdsourcing has been used for a wide range of applications, including information retrieval relevance evaluation (Alonso and Mizzaro 2012), machine translation (Callison-Burch 2009) and natural language processing (Snow et al. 2008), just to name a few. The lower cost of running experiments in conjunction with the flexibility of the editorial approach at a larger scale, makes this approach very attractive for quickly testing new ideas. It is also possible to introduce experimentation early in the system development cycle.

Now that crowdsourcing is being adopted by industry and academia, people are noticing that its deployment in practice is not that simple. Tasks have to be designed carefully with special emphasis on the interface and instructions; quality control is crucial. Like any emerging field, there is a lack of best practices and methodologies for leveraging crowdsourcing.

There are a lot of different experiments that we can design and implement depending on which components of the information retrieval system require evaluation. Our approach to implement crowdsourcing-based experimentation is similar to software development, with focus on both design and operational issues. However, there is one important distinction: tasks are performed by human computers not machines. Also, due to the nature of crowdsourcing platforms, the pool of workers may vary on any given day.

Although the experiments differ (e.g., classification, relevance assessment, ranking, spelling correction, etc.), they share a common workflow that includes defining what to test, data sampling, experimental design, execution, and data analysis.

In the case of information retrieval, the goals of experimentation are to evaluate the performance of a new technique and to construct training sets. Producers and consumers of such data need to design experiments and gather results with care, since the data they produce will be used for decision making and input to other components.

One can perform an analysis of previous experiments with crowdsourcing and identify a number of commonalities among published research. There are four main beneficial properties of using crowdsourcing: speed, cost, quality, and diversity:

  • Well-designed experiments tend to go very fast, usually getting initial results in less than 24 h.

  • The cost of implementing and running experiments is usually cheap. You can pay a few cents per task and end up spending $25 for the whole experiment. Even if the experiment has deficiencies and needs to be improved, one can afford the cost of debugging and testing.

  • The output is usually of good quality. This does not mean that there is no need to deal with workers that are not reliable, but with some quality control mechanisms in place one can yield good results.

  • Diversity of workers is a desired property when evaluating IR systems that are used by millions of people.

In the specific case of quality control, most of the current research uses an adversarial approach. That is, identification of bad workers that are gaming the system to earn money without performing good work. While the presence of spammers is a reality in any system that involves a monetary incentive, assuming that an experiment design has no defects is also a major flaw.

We are interested in tasks that a developer or experimenter designer needs to run continuously over extended periods of time in an industrial environment rather than proof of concepts. In a commercial product, search evaluation and experimentation are on-going activities where measurement is conducted periodically. Data quality is king for constructing training sets and for using test data sets in evaluating system performance. Having a state of the art infrastructure and processes for extracting and using high quality data are key components in industrial settings.

In this paper, we explore a number of aspects that influence the outcome of a task and, if treated with quality in mind as a whole, can produce better results in terms of data quality. Some of these aspects may look obvious to the reader but crowdsourcing involves many areas so it is essential to understand our own limitations and be open to collaborate with other experts when help is needed.

The contributions of this paper are threefold: (a) provide a development framework and process for implementing crowdsourcing experiments in the context of information retrieval with an industrial setting in mind, (b) identify factors that may influence the results and (c) provide recommendations that should be useful to designers regardless of the crowdsourcing platform of choice when there is a need for human computers (workers or hired editors).

In the rest of this paper we use relevance assessment as an example of the many experiments that are conducted on a daily basis to assess the overall quality of information retrieval systems. We present examples of IR tasks using a commercially available platform. Similar studies and experiments were conducted using the proposed framework with an in-house tool for large data sets.

This paper is organized as follows. First, in Sect. 2 we present an overview of the related work in this area. Second, we describe our development framework in Sect. 3. We examine items for experiments in production in Sect. 4. Section 5 covers the experimental design including survey design, user interface guidelines, and metadata considerations. Then, we address issues quality control in Sect. 6. We discuss content presentation and worker feedback in Sect. 7. We end with some final remarks and research directions in Sect. 8.

2 Related work

Relevance evaluation in IR and related applications are areas in which crowdsourcing is being used by companies, researchers, and start-ups to test and evaluate different techniques. Although in its infancy, crowdsourcing for IR is a very active area; workshops have been organized at premier conferences to discuss the latest developments and research findings (Chandrasekar et al. 2010; Lease et al. 2010, 2011; Lease and Yilmaz 2011). The new crowdsourcing task at TREC shows a lot of promise to continue testing different techniques for assessment and quality control (Lease and Kazai 2011).

The notion of using people for computing tasks is not new and goes back many years as presented in the excellent book by Grier (2006). The problems of managing people, incentives, attrition, and quality are still relevant today but in a Web context.

One of the first user studies with crowdsourcing that provided initial findings in terms of quality was performed by Kittur et al. (2008) using a very small set of Wikipedia pages. A more large scale in the NLP context is the research work by Snow et al. (2008) that shows the quality of workers in the context of four different NLP tasks, namely, affect recognition, word similarity, textual entailment, and event temporal ordering. Machine translation is an area that relies heavily on multilingual expertise and the work by Callison-Burch’s team showed how crowdsourcing can be used for evaluating machine translation (Bloodgood and Callison-Burch 2010; Callison-Burch 2009; Zaidand and Callison-Burch 2011).

There is previous work on using crowdsourcing for information retrieval evaluation. Alonso and Mizzaro (2012) reported a series of experiments using twelve topics from TREC-7 and TREC-8. Besides validating crowdsourcing for relevance assessment, they found that workers were as good as the original assessors, and in some cases they were able to detect errors in the gold standard. In a related approach, Alonso and Baeza-Yates (2011) concentrate on design and implementation issues with a fixed budget of $100 using TREC as example. Blanco et al. (2011) propose to use crowdsourcing for new evaluation campaigns such as Web object retrieval and show evidence that the approach is repeatable and reliable. Carterette and Soboroff (2011) study assessor errors in the context of crowdsourcing for large collections.

Book search has been an area with lots of crowdsourcing usage. Kazai et al. (2009) propose a method for gathering relevance assessments by promoting communication between assessors and self re-assessment for individual assessors. A study of different parameters (worker quality, required effort and monetary reward) are presented in Kazai (2010). More recent results on crowdsourcing and book search are reported in Kazai (2011).

There is a lot of research on quality for crowdsourcing platforms (Ipeirotis et al. 2010; Kapelner and Chandler 2010; Kern et al. 2010; Law et al. 2011; Sheng et al. 2008; Smucker and Jethani 2011). Grady and Lease (2010) focused their work in human factors for crowdsourcing assessments. Testing that a task is crowsourcesable is a new research direction (Eickhoff and de Vries 2011). Smucker uses detection theory to assess work quality by measuring high sensitivity (good ability to discriminate) and low sensitivity (poor ability) and producing statistics on correct and incorrect recognition (Smucker and Jethani 2011; Smucker 2011). We discuss some of the most popular approaches in Sect. 6.

Mason and Watts (2009) found that increased financial incentives increase the quantity, but not the quality of work performed by the workers. They explain the difference as an “anchoring” effect: people who were paid more also perceived the value of their work to be greater, and thus were no more motivated than people that were paid less.

General techniques for conducting experiments in Mechanical Turk, a crowdsourcing platform, are presented by Mason and Suri (2011) in the context of behavioral research. Paolacci et al. (2010) reports a demographic study of Mechanical Turk and Chilton et al. (2010) studied the search behavior of such workers. In more peripheral applications, Tang and Sanderson (2010) used crowdsourcing to evaluate user preferences on spatial diversity.

Finally, there are couple of recent tutorials that provide the state of the art on the many techniques, experiment results, platforms, tools, and current research (Alonso and Lease 2011; Ipeirotis and Paritosh 2011).

3 Development framework

Using the same terminology that is presented in Law and von Ahn (2011), we define human computation as a computation (or task) that is performed by a human. A human computation system is a system that organizes human efforts to carry out computation and crowdsourcing is a tool that a human computation system can use to distribute tasks.

Most of the approaches for crowdsourcing-based relevance experimentation are based on Amazon Mechanical Turk (MTurk, AMT). MTurk is the most popular commercial crowdsourcing platform and users are fairly familiar with the service. The individual or organization who has work to be performed is known as the requester. A person who wants to sign up to perform work is described in the system as a worker. The unit of work to be performed is called a HIT (Human Intelligence Task). A detailed description of features, documentation, and other technical details can be found in the developers’ website. Footnote 1

A task may also enforce some rules to be considered completely done. Tasks are very diverse in size and nature, requiring from seconds to minutes to complete. On the other hand, the typical compensation ranges from one cent to less than a dollar per task and is usually correlated to the task complexity.

Crowdsourcing is not a synonym for Mechanical Turk. Some of the terminology introduced by MTurk like HIT or requester have been adopted by other similar services. Competition in this area is growing rapidly and there are a number of emerging companies that provide similar functionality or implement wrappers around MTurk. Feature comparison among different services is out of scope for this paper. However, these services are all fairly similar at the conceptual level. The CrowdConf Footnote 2 conference tracks the latest industrial activity.

Current commercial crowdsourcing platforms have limited tools. This is, at some level, expected because the area is new and down the road much better functionality should be available. Like with any emerging technology, developers have to build components to leverage work on demand in the best way possible. If an ad-hoc process is used, the implementation could be expensive and very difficult to maintain in the long run.

We introduce our development framework in the context of continuous evaluation. That is running repeatable tasks over longer periods of time. This is very close to the typical industrial setting where search quality evaluation and monitoring are on-going activities.

The proposed structure is currently used in several IR evaluation experiments and it is independent of the crowdsourcing platform of choice, workforce, and task type. At a high level, we describe a sequence of actions over time that can be seen as an incremental series of steps for running experiments as presented in Fig. 1. Early stages of the method emphasize design; later stages focus more closely on work quality. We now describe each part in more detail.

Fig. 1
figure 1

Stages in the development framework. After each phase there is a quality checkpoint q before moving to the next one or repetitions. We start with an emphasis on the design and, over time, the focus changes to work quality

3.1 Prototype development

The very first step is to select a particular crowdsourcing platform, sign up and do some work for a period of time. The goal here is for the developer or experiment designer to get an idea of the kind of tasks that are available and to experience what an average worker has to do to complete work. As a worker, we can observe a number of things: some tasks do not require a lot of knowledge and/or training; some are ill-defined; the pay is low for the workload; the allocated time to complete the task is not sufficient; and so on. These are important considerations that can shape the overall result of an experiment, so it is important to do work for other requesters and take note of issues encountered. Familiarity with available functionality and hands-on experience can give the designer a good overview of what to expect from future workers and how to design the experiment.

The next step is to select a task like document assessment and prepare a small data set. The data set preparation is straightforward: choose a subset of a document collection like TREC or Wikipedia, select the topics, and for each topic, select a few documents. For the task, we can start with a high-level goal like “we would like to assess if a document is relevant to a given topic” and then describe the question(s) and flow in more detail. We can think of this step as the task specification.

Next, we need to implement the task specification in the task description language of the selected crowdsourcing platform. Usually all platforms provide a web form-like mechanism to create tasks, so all that is needed is to take the specification document and create the appropriate version. Similar to traditional software development, the experiment designer implements the task and tests the functionality using the sample data set.

Once the task is ready, it is time to engage with the first type of crowd: the internal team. The goal of this step is to test and debug the experiment with experts that are most knowledgeable in the area. The internal team performs the work and the experimenter designer takes notes about things that are working and which parts of the experiment are difficult. For example, the experiment is hard to read, the respondent has problem understanding object (e.g., instructions, task, examples, question), the respondent has trouble providing answer to question, and other problems. This exercise is very important so it is desirable that all members of the team provide suggestions and improvements to the entire experience.

Following the completion of the test case, we should calculate inter-agreement among the internal team and go over the feedback. The designer needs to revise questions that cause difficulty and fix all other issues. If the changes are substantial, the same team needs to re-test the experiment to ensure that the new version has improved. This step should yield a higher agreement and consensus that the experiment satisfies the designer’s goals. It is important to note that, like in programming, more bugs can appear over time. As we will see later, worker feedback is a very useful tool to identify issues in the task; therefore having an optional open-ended question is usually valuable.

3.2 Early stage production

The goal of this phase is to test the experiment with a much bigger crowd and use the results for calibration purposes and other adjustments.

At the beginning we use the same experiment that was developed in the previous phase with the same data set and involve an external crowd. Typically, we can use a simple set up such as three unique workers per task and a minimum payment as incentive. In this first run, the main goal is to gather data points to answer a few questions: how long does the task take to complete? Do workers understand the instructions? Do the results look as expected?

One way to measure the performance of both crowds (internal and external) is to compute inter-rater agreement, label distributions and other metrics that we are interested in. We examine the results by hand and look for outliers, performing an error analysis to identify cases in which the answers are wrong. In many cases, the problem is not that the workers are wrong, but rather that there are problems with the task itself. As in the previous step, the instructions may not be clear or the examples may not be representative.

Small changes may impact the understanding of the task, so having a technical writer review the instructions and examples is usually a good idea in this stage. Those items should be easy to fix and confirm through another run. If the task requires more effort than anticipated or if workers mention via feedback that the incentive is low, the payment can be increased. The focus on this step is making sure that the experiment can be done without problems by people that are not experts.

3.3 Repeatable production

The previous two stages were helpful in the design, development, and debugging of the task. Now we are in production mode where the emphasis is on monitoring, work quality, making sure that tasks get done on time, and that attrition is low.

One mechanism that helps improve quality is to use qualification tests. A less invasive alternative is to create questions with known answers in advance (also known as honey pots) and make sure that workers answer these pre-defined questions correctly. Workers who tend to fail on the honey pots should be discarded and marked as not suitable for this task. It is also possible to combine qualification tests and honey pots. In Sect. 6 we discuss quality control in more detail.

So far, we have been testing and adjusting our task with small data sets (e.g, 50 documents, 100 query-url pairs, etc.). A small data set allows us to iterate faster and save money. However, in practice, much larger data sets are usually needed; requesters should be careful to scale data size and workers independently, focusing first on data size since workers are the scarcer resource. Producing a much bigger data set is usually straightforward; but finding a large pool of workers for our experiment may pose a problem. For example, if we tested 100 documents with 3 workers per assignment, the next step would be to increase the size to 1,000 documents while keeping the same 3 workers. We then increase the number of workers. Early work by Snow et al. showed that 5 workers per assignment should be enough to achieve good performance in MTurk (Snow et al. 2008). However, certain tasks are more difficult than others so more workers per assignment are needed regardless of the crowdsourcing platform of choice.

4 Operational considerations for experiments in production

When to schedule a task is very important if large data sets are involved. In any platform at a given time, there are probably thousands of tasks being executed. When is the best time to schedule a task and how large should it be in terms of data size and duration?

Similarly to a web search engine where competing pages try to rank higher on the search results page, we would like to place our experiment on the first page of the crowdsourcing platform of choice. Initially we can launch and monitor very carefully the first task and see how workers react. Using the dashboard (or whatever interactive mechanism is available) we can see after a few minutes if workers have accepted work and submitted a few samples. This attention effect is extremely important as it can be used to sense if we will interest enough workers in our task. If the task fails to attract workers initially, then is unlikely that it will finish. The problem may be that the incentive is insufficient, that the instructions are not clear, or that the task is boring (among other things). If workers are slow to accept the task, we should stop the experiment immediately and return to the design.

To showcase the importance of the first few assignments, Fig. 2 presents the completion time (submission time) for the same INEX relevance evaluation tasks. The first chart demonstrates that workers were not very interested in the task; the few completed assignments were submitted hours after the creation time. This lag in task uptake may be attributed to a low incentive, coupled with poorly written instructions. The second chart shows an improvement. Assignments were completed very quickly. In this second modification, the incentive was increased and the instructions were improved. However, the second experiment did not have a qualification test so when such quality control was introduced (see the third chart), assignments were completed more slowly.

Fig. 2
figure 2

Completion time for the first batch for three different versions of the same INEX relevance task. The chart on top shows that there is almost no interest in the task. The chart in the middle shows a task that received a lot of attention; the first hundreds or so were completed in under 3 min. The chart on the bottom shows worker interest in the presence of a qualification test; note that completion was over an order of magnitude slower. An assignment specifies how many people can work on a HIT

Depending on the end goal, the experimenter must make work quality and completion time trade-offs. There is a delicate balance between compensation and filtering, with regards to the experiment’s time. Low compensation and/or a strict filtering procedure will drastically reduce the number of interested workers and hence will significantly increase an experiment’s completion time.

One solution is to split long tasks and submit the smaller tasks in parallel. This approach has several advantages. First, the waiting time will decrease even though the total time spent on the tasks may not. Second, the overall time spent may also decrease because shorter tasks usually attract more workers. So in summary, as common sense suggests, it is better to have many small tasks in the system than one very large task.

To schedule experiments effectively, it is better to submit shorter tasks first. This way, important implicit or explicit feedback from the experiment can be used to re-design larger experiments. This strategy is also helpful for debugging the experiment in the long run. Also, if something goes wrong, deleting tasks from a crowdsourcing platform is usually expensive in terms of both time and money. From a time perspective, the system needs to eliminate transactions based on the work that has been allocated. Furthermore, there are financial implications because the requester must pay for the portion of the work that has been completed.

When managing large data sets, it is useful to check if the same object needs to be re-labeled. For some applications, gathering fresh labels on duplicates may be useful. In others, re-assessing the same query-url pair is a waste of time and money.

Finally, there is a performance issue. Some tasks may take longer than expected or in the worst case, may not attract workers. This might be an indication that a particular task is not crowdsourcable in its current form.

5 Experiment design

The quality of workers’ answers depends on the quality of the experiment. The design is the most important part of the experiment; no amount of statistical analysis can compensate for a poorly designed survey or questionnaire (Lohr 1999). A growing number of researchers use crowdsourcing for evaluation, but they do not provide enough details about how their experiments were conducted. This lack of details makes it impossible to assess the quality of the results. Experimentation must be repeatable.

In this section we present guidelines on how to ask questions so instructions are clearly understood. At first, this looks trivial but we have to remember that workers are not domain experts, so the task designer should provide clear instructions that state how to perform the given task. Unlike machines, human computers may answer the same question differently (Law and von Ahn 2011). In the context of relevance evaluation, workers’ preferences, cultural background, and knowledge play a major role in assessing quality. Also, we would like to retain workers across tasks so a first impression that we are serious and professional does matter. This step should not be taken lightly. Early work on the Cranfield paradigm contains a lot of evidence on the importance of the questions, scales, and labels (Cleverdon 1970; Harman 2011).

We can think of the task description and instructions as the source code of the experiment. It is fine to try different versions and measure which one gets better results or attracts more workers.

5.1 Asking questions

A key aspect of the experimental design process is how to ask the right questions in the best way possible. Most of the work on survey design stems from the social sciences and market research, so we need to adapt some procedures to our needs. But in general, the core principles are the same. Fowler (1995) presents standard guidelines for survey and questionnaire design.

Workers need to understand questions consistently. Furthermore, what constitutes a good answer should be communicated to them. A necessary step is ensuring that all workers have a shared, common understanding of the meaning of the question. We have to ask questions that workers are able to answer in the terms required by the question. A question should be as specific as possible and should use words that virtually all workers will understand. There is no need to use special terms unless all workers would be expected to know them or the term is explained in the question.

IR research often uses a Likert scale or something similar (e.g., perfect, excellent, good, fair, bad) to elicit answers. However, the scale is domain dependent, so it is desirable to use one that fits best the experiment. Cultural differences may make labels confusing; sometimes it is preferable to replace labels with a numeric scale.

Adding a separate category, “I do not know”, allows a worker to say that he or she does not have the background or experience to answer the question. In case of relevance assessment, this option is useful since we are interested in an honest answer and not a guess. Obviously, the number of “I do not know” responses for a given experiment should be low.

Equally important is standardizing the responses across workers. That means clearly defining the dimension or continuum respondents are to use in their rating task and giving them a reasonable way to place themselves, or whatever else they are rating, on that continuum. Generally speaking, the more categories respondents are asked to use, the better. However, research has shown that most subjects cannot reliably distinguish among more than 6 or 7 levels of response. This can create undesirable cognitive overhead; more options increase the number of decisions a worker must make, and may influence how long it takes to complete the task (Fowler 1995).

5.2 Interface design

Workers interact with a website so it is important to use established usability techniques for presenting information in a user interface, such as those proposed by Nielsen (1994). Analyzing these user interface guidelines is out of the scope of this paper, but some basic design principles include:

  • Present clear and consistent instructions to the workers on what they have to do.

  • Show examples.

  • Use highlighting, bold, italics, typefaces, and color when needed to improve the content presentation. Relevance experiments require reading text so instructions and content have to be legible. Always make the text clear based on the size, typeface, and spacing of the characters used.

  • Minimize the effort to accomplish a task.

In TREC, relevance assessments are performed by hired editors; the instructions are very technical (Harman 2011; Voorhees and Harman 2005). If we look at one track like Ad-hoc, the original instructions are four pages long, making it somewhat difficult to present them in a web user interface.

Figure 3 shows instructions for assessing the relevance of a document to a given query using the Ad-hoc instructions as baseline. The scale is graded and the worker must justify the answer by entering a comment.

Fig. 3
figure 3

Example of design for graded document relevance evaluation

5.3 Metadata and internationalization

In a crowdsourcing marketplace we compete with other requesters for workers. A clear title, description, and keywords allow potential workers to find experiments and preview the content before accepting tasks. It is important then to generate meaningful metadata so workers can search and identify a task in the platform. It is desirable to use a common set of keywords for all the experiments that use the same collection and then specific terms depending on a given run or data subset. Table 1 shows such examples for three different collections.

Table 1 Metadata for different types of experiments

When there is a need to perform relevance experiments in other languages it is better to fine tune the experiment in English first and then localize it to the target language. The first question is then, how do we know if we have enough workers who know a given language? A way to answer that question is to design a simple experiment with some popular content such as Wikipedia with a short topic description in the target language.

In the particular case of MTurk and compared to previous experience running English relevance experiments, these tests are slower. For a similar experiment in English that usually takes 24 h. to complete, a different language (e.g., Spanish, French, German) finalizes in 5 days. One potential reason for this lag is that there may be fewer workers who are proficient in languages other than English. Another reason is the current limitation on payments from Amazon that may impact potential workers (Alonso and Baeza-Yates 2011). Different platforms use different payment systems, which may influence both task completion speed and worker availability.

6 Quality control

Managing the quality of the work and overall worker performance is difficult. In several cases, the problem is that workers may not have the expertise required to perform the task.

A crowdsourcing platform should provide some level of spam detection and basic worker performance statistics. While these are useful at the beginning, they may not be sufficient to enforce good work.

In general, if a gold standard exists it is possible to compare the performance of workers against experts in such set. For those cases where no gold standard set is available, worker aggregation and majority voting should work. We can also verify worker noise by using a tier approach for assessing quality.

When should we enforce quality? Quality control is an on-going activity so the answer is always: before we start the main tasks by recruiting and qualifying the right workers, during the time workers are executing on the task, and after the task is completed by computing accuracy and correctness. Payment should be done after verifying work quality.

In this section we outline mechanisms that have proven useful for managing quality control. Some of the techniques are platform independent and can be used in different scenarios. In other cases, specific features are available as part of the service. That being said, the list is not exhaustive since much crowdsourcing research focuses on work quality.

6.1 Worker qualification

How do we qualify a worker? This recruiting step is usually well supported by most of the platforms.

A possible filter for selecting good workers is to use the approval rate. The approval rate is a metric provided by MTurk that measures the overall rating of each worker in the system. Amazon defines the metric as the percentage of assignments the worker has submitted that were subsequently approved by the requester, over all assignments the worker has submitted. However, using very high approval rates decreases the worker population available and may increase the time necessary to complete the evaluation. Recently Amazon introduced the notion of master worker. Masters are a group of workers who have demonstrated superior performance while completing thousands of HITs.

It is possible to control work quality by using qualification tests. A qualification test is a set of questions (like a HIT) that the worker must answer to qualify and therefore work on the assignments. After seeing the preview, workers can choose to accept the task, where optionally, they must pass a qualification exam to be officially assigned the task.

A qualification test is a much better quality filter but also involves more development cycles. In the case of relevance evaluation is somewhat difficult to test “relevance”. What we propose is to generate questions about the topics so workers can get familiar with the content before performing the tasks, even if they search online for a particular answer. A suggestion is to have a qualification test with ten multiple choice questions with 10 points value each and 80 % passing grade.

It is possible, however, for a worker to pass the test and then be lazy for the rest of the experiment. A way to detect this potential spammer would be to see how frequently the user disagrees with the majority. A drawback of using qualification tests is that workers may not feel like performing work that requires a test. Tasks that require a qualification test also take longer to complete. Finally, there is a hidden cost: the test has to be developed and maintained over time.

Another alternative is to ask workers to choose tasks for which they have expertise, interest and confidence. Preliminary results using this approach are reported in Law et al. (2011).

6.2 Work quality

Instead of using qualification tests, a requester may interleave assignments in which the correct answer is already known, so it is easy to detect workers who are randomly selecting the answers. This already mentioned technique, called honey pots, is useful for checking if the workers are paying attention to the task. For example, when testing a topic one could include a document or web page that is completely unrelated to the question and expect the worker to answer correctly. If they failed to pass, then there is an indication that they are not following instructions correctly or just spamming. There is a clear advantage to using honey pots instead of qualification tests. Tasks are completed faster and we can identify workers who perform poorly very quickly.

A similar approach is to use variations of the Captcha approach. Examples would be to ask a worker what’s the title of the web page or how many images are on the document that is being assessed (Kittur et al. 2008). In other words, make the task difficult to workers who are interested in cheating only. A variation called Kapcha that forces a slow down in responses is presented in Kapelner and Chandler (2010).

Popular approaches by most requesters rely on redundancy like an odd number of workers in conjunction with majority vote to identify correct answers. Sheng et al. (2008) introduced the repeated labeling strategy that consists of increasing the number of instances for those answers (labels) that are noisy. This strategy, called “get another label”, works well in practice. However, redundancy is not the perfect solution because it may increase the cost of crowdsourcing, thus making it comparable to hired editors. Ipeirotis et al. developed a technique based on expectation maximization that assigns a score to each worker, which corresponds to the quality of the assigned labels (Ipeirotis et al. 2010).

The weighted majority vote that leverages a group decision of multiple workers is introduced in Kern et al. (2010). A technique based on trusting high quality workers and bypassing the voting procedure is presented in Law et al. (2011). The idea of “bypassing” is a very promising approach but requires work history. Continuous evaluation can definitely benefit from this technique. That being said, a random honey pot won’t hurt to maintain trust in bypassing.

Another alternative is to use workers to check for correctness. Examples are the “find-fix-verify” pattern (Bernstein et al. 2010) and the workflow that checks the accuracy of translation quality from Urdu to English (Bloodgood and Callison-Burch 2010; Zaidand and Callison-Burch 2011). In practice, this pattern is very useful and can be easily adapted to different crowds and platforms. In our specific case, we have used MTurk for the find-fix part and the internal tool for verification.

Most of the platforms provide features for blocking workers from experiments and rejecting individual assignments. That being said, it is easier to blame workers for incorrect answers when probably the interface was confusing or instructions not clear. Rejecting work can create attrition problems and hurt the requester reputation (see Feedback analysis). A possible solution for borderline workers (not obvious robots) would be to pay the minimum and exclude them in future work within the specific task in question. Another incentive is to pay a bonus on top of the minimum for workers who perform really good work.

Poorly designed tasks may be at the root of bad results, rather than worker malice.

6.3 Agreement metrics

It is difficult to get agreement when assessing relevance. Relevance is a multi-dimensional concept so it is expected to have disagreements among raters. The following are the main statistics that researchers and practitioners have been using to describe inter-rater reliability. The survey by Artstein and Poesio (2008) covers each of them and variations in more detail.

  • Percentage agreement. This method calculates the cases that received the same rating by two judges and divides the number by the total number of cases rated by the two judges. An advantage is that it is easy to compute and intuitive. The main disadvantage is that we may get inflated numbers if the values fall under a particular category of the rating scale

  • Cohen’s kappa (κ). This technique was designed to estimate the degree of consensus between two judges by correcting if they are operating by chance alone.

  • Fleiss’ kappa. This is a generalization of Cohen to n raters instead of just two.

  • Krippendorff’s alpha. The coefficient α is calculated by looking at the overall distribution of judges regardless on which judge produced the judgments (Krippendorff 2004).

What does all the above mean and which one we should use? In our experience we suggest using two measures such percentage agreement as a quick way to discard a data set if the agreements are very low, and kappa to look at the agreements in more detail. Note that measuring agreement in each data set is part of the quality checkpoint presented in Fig. 1. Monitoring the inter-rater agreement metric of choice over time is also part of overall quality.

As many researchers point out, κ is a very conservative measure of agreement and only appropriate for biomedical and educational research but not for content analysis (Krippendorff 2004). It is also difficult to interpret its meaning. Nevertheless is the most popular agreement measure and it is supported in most modern statistical software. The practical recommendation is to use an inter-rater statistic that fits best the task and the designer’s goals. The agreement measures described above are not the only ones available in the literature, but they are common among behavioral researchers.

In some cases it is possible to have borderline cases where it is not clear if there is a majority within the external crowd. A solution is to involve workers with more expertise or internal people as tie breakers.

6.4 Requester quality

If a requester is going to be very strict about worker qualification and overall performance, we should expect the same about workers’ opinions of a requester. Unfortunately there is little infrastructure available for that except for a few early prototypes and websites that track bad requesters. We should treat our workers with respect and professionally: pay on time, pay what we think the task is worth rather than the minimum, and provide them with clear instructions and content so they can complete the task the best way possible. While it is possible that some people may try to game the system, we believe that, in general, workers act on good faith.

7 Content aspects

In this section we examine content characteristics such as presentation, difficulty, and worker feedback on both task and content. Demographics studies mention entertainment or enjoyment as motivation for working on tasks. It makes sense then to try to design interesting tasks with high quality content. Marshall and Shipman (2011) use a similar approach by constructing a realistic set of scenarios for conducting user studies using crowdsourcing.

7.1 Presentation and readability

One important factor is the effect of the user interface on the quality of relevance assessments. To study that, we compared two different interfaces. One helped the users by highlighting the terms in the query and another one just showed the plain text. In this short experiment we did not use a qualification test. The data preparation for this experiment consisted of taking the original document and producing two identical versions: plain and highlighted. The plain version contains no format and has the visual effect of a continuous line. The highlighted version contains the topic title (up to 3 terms) highlighted in black with a yellow background. The data set consisted of presenting the plain and highlighted version at random to the worker to test if presentation and readability would affect relevance.

Figure 4 shows the results of this experiment. In the figure we show a set of relevant documents. For each document there is the TREC vote (1 = relevant), and the number of votes for the highlighted and plain version. As we can see, except for two cases, highlighted versions of the document were perceived to be more relevant to plain versions. In this particular experiment the number of documents are not that significant, however the results indicate the presence of keywords (in our case highlighted) impacts assessment. This is similar to the finding that generalist workers may rely on a good document presentation when assessing relevance (Kinney et al. 2008).

Fig. 4
figure 4

Relevance votes on highlighted versus plain versions of the same documents

7.2 Topic difficulty

In some cases workers may find that a particular topic or data set requires more expertise to correctly answer. In other words, we should also think of certain tasks are more difficult than others, therefore workers may have problems providing answers. To test this, we prepared an experiment using a TREC data set that required workers to rate the topic on a scale 1–5 (1 = easy, 5 = very difficult).

Figure 5 shows topics sorted by the average rating by workers (5 workers per assignment). The x-axis shows the topic labels and y-axis the difficulty scale. We can see that for topics like airport security, inventions and tourism workers found the topic easy. Others themes like Greek, philosophy, and foreign minorities usually require more background so it is expected to see that the increase on perceived difficulty of the topic. This information can be used to assign weights to the relevance scores if the workers are more or less confident with the respect to a particular topic.

Fig. 5
figure 5

Topics arranged by workers in increasing level of difficulty

7.3 Feedback analysis

One way to request feedback is to ask an optional open-ended question at the end of the task, which is often a justification for the answer provided. Our interest in exploiting this feature is that we can know for free why a document is relevant or not. It is important to note that none of the TREC assessments contains any information about why a particular document was relevant or not.

Overall in our experience, less than 30 % of the workers provided a justification in all experiments. The average comment length varies depending on the task and the incentive. If there is a bonus offered for a comment, then more workers may provide justification. That being said, requesters should not make comments mandatory as workers may not feel like writing meaningful feedback. Our recommendation is to have comments as optional even if a bonus payment is offered.

User feedback fell in three main categories: justification, operational, and general communication.

Justification In the case of binary relevance evaluation, the worker has to answer relevant or not relevant, or skip the document and look at the next one. By looking at some feedback we can observe that, in certain cases, a binary scale may not be suitable and a graded version should be applied. Examples that reflect this intuition:

  • Highly relevant: focuses on drug routes and mentions the Golden Triangle in particular.

  • About a creative endeavor but not specifically about creativity itself.

It is also interesting to look at the way workers agree when the document is relevant or not. Certain patterns like “about [topic]”, “this document discuss …”, etc. are clear descriptions of agreement where “no mention to [topic]”, “about [other topic]”, etc. are not.

For the topic “behavioral genetics”, the following are examples of how people justify that a document is not relevant.

  • This article deals with environmental stress factors in the workplace.

  • Nothing about genetics.

  • This document discusses workplace stress, not behavioral genetics.

For the topic “Cuba sugar exports”, the following are examples how people justify that a document is relevant.

  • It’s about Cuba exporting sugar to China.

  • The document describes not only sugar exports but also other industries like steel. It should be better to present more information of sugar exports

  • This document is relevant because it provides information regarding Cuba’s sugar trade with China.

Operational This feedback involves workers mentioning issues with the live experiment. Examples are comments about browser issues, non-English content, broken link or that a server is down. As anecdotal evidence, if one of the experiments had a broken link and a number of workers mention that the site was down while one marked as “relevant”. The latter was obviously a spammer or robot so we decided to keep it as a sanity check that workers are paying attention to the task. However, as good practice, we are not suggesting to include broken links.

General communication Feedback about general communication between worker and requester is usually about a positive (or negative) experience doing the experiment. Workers tend to work for high reputable requesters so those negative concerns must be resolved promptly. Good customer service will help build a reputation over time.

7.4 Other considerations

When data sets are very large, even with special filters it is possible to show adult or inappropriate content by accident. A warning note on the instructions that such content may appear is also part of a good design.

Other behaviors to look for is the presence of the exposure effect (“familiarity leads to liking”). A repeated exposure is sufficient to enhance attitude toward a stimulus. For example, asking workers to assess the same data set (queries or documents) over and over again.

Topic randomization is also useful for avoiding worker fatigue. Most of the relevance tasks involve a lot of reading so changing the topics and making the effort to design tasks with interesting content can be very helpful. As more training set construction is crowd-based understanding human errors is also very important (Carterette and Soboroff 2011). A final comment on content is that all information presented in a crowdsourcing experiment is available on the web, therefore experiments should not involve sensitive information.

8 Concluding remarks and outlook

Crowdsourcing offers flexibility to design and implement different types of information retrieval experiments. Having full control on the experimental design and results quality are important advantages and we believe they should be a core part of the system development process.

There is evidence from previous work and our own experience that crowdsourcing is effective for relevance-based experimentation. That being said, experiments have to be designed carefully to achieve good results. In this paper we present a development framework to help developers use crowdsourcing for different information retrieval experiments. We also discuss factors that influence the overall quality of a crowdsourcing task.

Quality control is an important part of the experiment. It should be applied across all aspects of experiment design, and not just limited to choosing workers and vetting their work. For example, seemingly simple factors such as the quality of the instructions and the presentation of documents have significant impact on an experiment’s results. The user interface should make workers’ tasks easier not more difficult. As workers go through the experiments, diversity of topics in a single run can help avoid stalling the experiment due to lack of interest. While it is always possible to detect bad workers, at the same time, the requester can be acquiring a bad reputation among workers.

Adding a feedback loop by using an open-ended question has been very valuable. It is possible to detect potential errors in the data set, operational issues and learn how workers justify differently type of answers. We also found it useful to use the feedback as a way of fine tuning the experiment template to keep improving the user experience.

Besides the monetary incentive, workers are motivated to work if the task is fun, interesting, or educational. It makes sense to spend effort on providing high quality content and design experiments that are sleek.

In terms of information retrieval evaluation, crowdsourcing provides new opportunities to try new ways to collect labels and run experiments. The TREC crowdsourcing track (Lease and Kazai 2011) is an excellent example of how interesting this topic has been. The most important open questions for research are: what are the tasks suitable for crowdsourcing and what is the best way to perform experimentation for crowdsourcing? So far, we have just started exploring this area and more exciting research work lies ahead of us.