Skip to main content
Log in

A game-based framework for crowdsourced data labeling

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Note that the matching criterion is the same product model and the same manufacture, without considering specifications like color and storage.

  2. https://code.google.com/p/word2vec/

  3. http://research.signalmedia.co/newsir16/signal-dataset.html

  4. http://wiki.dbpedia.org/

  5. See how \(\mathtt{Snorkel} \) uses crowdsourcing in Section 7.4

References

  1. Abad, A., Nabi, M., Moschitti, A.: Self-crowdsourcing training for relation extraction. In: ACL pp. 518–523 (2017)

  2. Bishop, C.M.: Pattern Recognition and Machine Learning, Information Science and Statistics, 5th edn. Springer, Berlin (2007)

    Google Scholar 

  3. Bowman, K., Shenton, L.: Parameter estimation for the beta distribution. J. Stat. Comput. Simul. 43(3–4), 217–228 (1992)

    Article  Google Scholar 

  4. Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)

  5. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)

    Google Scholar 

  6. Das, S., P. S. G. C., Doan, A., Naughton, J. F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD, pp. 1431–1446 (2017)

  7. Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)

    Article  Google Scholar 

  8. Fan, J., Li, G.: Human-in-the-loop rule learning for data integration. IEEE Data Eng. Bull. 41(2), 104–115 (2018)

    Google Scholar 

  9. Fan, J., Li, G., Ooi, B. C., Tan, K., Feng, J.: icrowd: An adaptive crowdsourcing framework. In SIGMOD, pp. 1015–1030 (2015)

  10. Fan, J., Lu, M., Ooi, B.C., Tan, W., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. ICDE 2014, 976–987 (2014)

    Google Scholar 

  11. Fan, J., Zhang, M., Kok, S., Lu, M., Ooi, B.C.: Crowdop: Query optimization for declarative crowdsourcing systems. IEEE Trans. Knowl. Data Eng. 27(8), 2078–2092 (2015)

    Article  Google Scholar 

  12. Franklin, M. J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)

  13. Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612 (2014)

  14. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)

  15. Haas, D., Wang, J., Wu, E., Franklin, M.J.: Clamshell: Speeding up crowds for low-latency data labeling. PVLDB 9(4), 372–383 (2015)

    Google Scholar 

  16. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., Weld, D. S.: Knowledge-based weak supervision for information extraction of overlapping relations. In: Association for Computational Linguistics ACL, pp. 541–550 (2011)

  17. Joglekar, M., Garcia-Molina, H., Parameswaran, A.: Comprehensive and reliable crowd assessment algorithms. In: Gehrke, J., Lehner, W., Shim, K., Cha, S.K., Lohman, G.M. (eds) ICDE. IEEE Computer Society, pp. 195–206. (2015) https://doi.org/10.1109/ICDE.2015.7113284

  18. Khan, A.R., Garcia-Molina, H.: Attribute-based crowd entity resolution. In: CIKM, pp. 549–558 (2016)

  19. Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. ICML 2015, 957–966 (2015)

    Google Scholar 

  20. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  21. Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)

    Google Scholar 

  22. Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)

  23. Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)

    Article  Google Scholar 

  24. Liu, A., Soderland, S., Bragg, J., Lin, C.H., Ling, X., Weld, D.S.: Effective crowd annotation for relation extraction. In: NAACL HLT, pp. 897–906 (2016)

  25. Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)

    Google Scholar 

  26. Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Demonstration of qurk: a query processor for humanoperators. SIGMOD 2011, 1315–1318 (2011)

    Google Scholar 

  27. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, (2013)

  28. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)

  29. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. ACL 2009, 1003–1011 (2009)

    Google Scholar 

  30. Parisi, F., Strino, F., Nadler, B., Kluger, Y.: Ranking and combining multiple predictors without labeled data. Proc. Natl. Acad. Sci. USA 111(4), 1253–8 (2014)

    Article  MathSciNet  Google Scholar 

  31. Park, H., Pang, R., Parameswaran, A.G., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: a system for declarative crowdsourcing. PVLDB 5(12), 1990–1993 (2012)

    Google Scholar 

  32. Ratner, A., Bach, S.H., Ehrenberg, H.R., Fries, J.A., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. PVLDB 11(3), 269–282 (2017)

    Google Scholar 

  33. Ratner, A.J., Sa, C.D., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. NIPS 2016, 3567–3575 (2016)

    Google Scholar 

  34. Roth, B., Klakow, D.: Combining generative and discriminative model scores for distant supervision. In: EMNLP, pp. 24–29 (2013)

  35. Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. ICCV, IEEE Computer Society, pp. 59–66 (1998). https://doi.org/10.1109/ICCV.1998.710701

  36. Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: SIGKDD, pp. 614–622. ACM (2008)

  37. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. CoRR, arXiv:1707.02968 (2017)

  38. Takamatsu, S., Sato, I., Nakagawa, H.: Reducing wrong labels in distant supervision for relation extraction. In: Meeting of the Association for Computational Linguistics: Long Papers, pp. 721–729 (2012)

  39. Tong, Y., Chen, L., Zhou, Z., Jagadish, H.V., Shou, L., Lv, W.: Slade: a smart large-scale task decomposer in crowdsourcing. IEEE Trans. Knowl. Data Eng. 30(8), 1588–1601 (2018)

    Article  Google Scholar 

  40. Tong, Y., She, J., Ding, B., Wang, L., Chen, L.: Online mobile micro-task allocation in spatial crowdsourcing. In: ICDE, pp. 49–60 (2016)

  41. Verroios, V., Garcia-Molina, H., Papakonstantinou, Y.: Waldo: An adaptive human interface for crowd entity resolution. In: SIGMOD, pp. 1133–1148 (2017)

  42. Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: PVLDB (2014)

  43. Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. In: PVLDB (2012)

  44. Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)

  45. Wang, J., Yu, L., Zhang, W., Gong, Y., Xu, Y., Wang, B., Zhang, P., Zhang, D.: Irgan: a minimax game for unifying generative and discriminative information retrieval models. In: SIGIR, pp. 515–524. ACM (2017)

  46. Wang, S., Xiao, X., Lee, C.: Crowd-based deduplication: an adaptive approach. In: SIGMOD, pp. 1263–1277 (2015)

  47. Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)

    Google Scholar 

  48. Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: International Conference on Neural Information Processing Systems, pp. 1260–1268 (2014)

  49. Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? PVLDB 10(5), 541–552 (2017)

    Google Scholar 

  50. Zheng, Y., Wang, J., Li, G., Cheng, R., Feng, J.: QASCA: a quality-aware task assignment system for crowdsourcing applications. In: SIGMOD, pp. 1031–1046 (2015)

Download references

Acknowledgements

This work was supported by NSF of China (61632016, 61925205, U1711261, 61832017, 61972401, 61932001), Huawei, TAL education, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, Research Funds of Renmin University of China (18XNLG18, 18XNLG21), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ju Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proofs

1.1 Proof of Theorem 1

We can prove NP-hardness of the problem by a reduction from the k maximum coverage (KMC) problem, which is known to be NP-hard.

Recall that an instance of the KMC problem (E,\(\mathcal {S}\),k) consists of a universe of elements E = {\(s_1,s_2,\cdots ,s_n\)}, a collection of subsets of the universe E, i.e., \(\mathcal {S}\) = {\(S_1,S_2,\cdots ,S_m\)} where any \(S_i \in \mathcal {S}\) satisfies \(S_i \subseteq E\), and a number k. The objective is to select k subsets from \(\mathcal {S}\), denoted by \(\mathcal {S}^\prime \), so that the number of covered tuples \(\left| \bigcup _{S\in \mathcal {S}^\prime } S \right| \) is maximized.

An instance of our problem consists of a set of tuples \(\mathcal {E}\), a set of rules \(\mathcal {R}\), and a number b. The optimization objective is to select b rules from \(\mathcal {R}\) so that the expected rule selection criterion, according to Eq. 6, is maximized.

The reduction from KMC to our problem. We show next that for any instance (E,\(\mathcal {S}\),k) of KMC, we can create a corresponding instance of our problem based on (E,\(\mathcal {S}\),k) in polynomial time.

  • We translate the set E of elements into the set \(\mathcal {E}=\{e_1,e_2,\cdots ,e_n\}\) of tuples in our problem.

  • Given an element \(s_j\) in E, if \(s_j \in S_i\), we set add a tuple \(e_j\) into the rule \(r_i\) whose accuracy and validation probability is 1. We set the parameter \(\gamma \) to 0.5. The gain of objective \(\mathcal {J} ^{r_i}\) calculates to 1 if \(s_j \in S_i\), and 0 otherwise. Thus, each set \(S_i\) in the KMC problem corresponds to the rule \(r_i\) and the elements covered by \(S_i\) correspond to the tuples covered by \(r_i\).

  • We translate number k in KMC into b in our problem.

Equivalence of optimization objectives. We show the optimization objectives of the two problems are equivalent:

  • Since in our instance the probability that an individual rule r passes the validation is 1, the validated rule set \(\mathcal {R} ^\mathtt{\surd }\) is equivalent to the selected rule set \(\mathcal {R}^{(t)}_q\), and \(P(\mathcal {R} ^\mathtt{\surd })=P(\mathcal {R}^{(t)}_q)=1\).

  • Since in our instance the rule accuracy is 1 and the parameter \(\gamma \) is 0.5, based on Equation 4, we know that \(\mathcal {J} ^{\mathcal {R}^{(t)}_q}=\left| \mathcal {C}(\mathcal {R}^{(t)}_q) \right| \).

With \(\mathcal {R} ^\mathtt{\surd }=\emptyset \), the expected rule selection criterion therefore becomes \(\left| \mathcal {C}(\mathcal {R}^{(t)}_q) \right| \). Since our problem is to find the b best rules, \(\mathcal {R}^{(t)}_q\), that maximizes the expected criterion, this is equivalent to finding b best sets that maximize the set of covered elements.

1.2 Proof of Lemma 1

Consider two rule sets \(\mathcal {R} _{1} \subseteq \mathcal {R} _{2}\); we first prove the monotonicity as follows. For simplicity, we use \(\varGamma \) to denote \(\frac{1-2\gamma }{1-\gamma }\) in this proof:

$$\begin{aligned}&\varDelta g (\mathcal {R} _{2}|\mathcal {J}) - \varDelta g ( \mathcal {R} _{1}| \mathcal {J}) = \sum _{\mathcal {R} ^\mathtt{\surd }_{2}}P(\mathcal {R} ^\mathtt{\surd }_{2})\sum _{e _{i}}{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }}\\&\quad - \sum _{\mathcal {R} ^\mathtt{\surd }_{1}}P(\mathcal {R} ^\mathtt{\surd }_{1})\sum _{e _{i}}{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }} \end{aligned}$$

Since \(\mathcal {R} _{1} \subseteq \mathcal {R} _{2}\), for simplicity, we introduce \(\mathcal {R} _3 = \mathcal {R} _2-\mathcal {R} _1\). Then, for any \(\mathcal {R} ^\mathtt{\surd }_{2}\), we can find a \(\mathcal {R} ^\mathtt{\surd }_{1}\) and \(\mathcal {R} ^\mathtt{\surd }_{3}\) such that \(\mathcal {R} ^\mathtt{\surd }_{2}=\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3}\). Based on this, we have

$$\begin{aligned}&\varDelta g (\mathcal {R} _{2}|\mathcal {J}) - \varDelta g ( \mathcal {R} _{1}| \mathcal {J})=\sum _{\mathcal {R} ^\mathtt{\surd }_{1}}P(\mathcal {R} ^\mathtt{\surd }_{1})\big [\ \sum _{\mathcal {R} ^\mathtt{\surd }_{3}}P(\mathcal {R} ^\mathtt{\surd }_{3}) \\&\big ( \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \big ) \big ]. \end{aligned}$$

It is not difficult to know \(\sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \ge 0\), and we prove monotonicity.

We next prove that \(\varDelta g ( \mathcal {R} | \mathcal {J})\) is submodular. Given any rule \(r \), using the previous equation, we have

$$\begin{aligned}&\varDelta g (\mathcal {R} \cup \{r \}|\mathcal {J}) - \varDelta g ( \mathcal {R} | \mathcal {J}) = \sum _{\mathcal {R} ^\mathtt{\surd }}P(\mathcal {R} ^\mathtt{\surd }) P(r^\mathtt{\surd }) \nonumber \\&\big ( \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }\cup \{r^\mathtt{\surd } \})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd })}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \big ) \nonumber \\&= -\varGamma + \sum _{\mathcal {R} ^\mathtt{\surd }}P(\mathcal {R} ^\mathtt{\surd })P(r^\mathtt{\surd }) \big ( \sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd })-\mathcal {C}(\mathcal {R} ^\mathtt{\surd })}{\hat{\lambda }} \nonumber \\&+ \sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd }) \cap \mathcal {C}(\mathcal {R} ^\mathtt{\surd })}\max \{\hat{\lambda }-\max {\varLambda ^{\mathcal {R} ^{(i)}}},0\} \big ) \end{aligned}$$
(13)

From the equation, we can see that the above margin depends on the following two factors under each cases corresponding to \(P(\mathcal {R} ^\mathtt{\surd })P(r^\mathtt{\surd })\):

  • Improvement on “additional” tuples covered by \(r^\mathtt{\surd } \), i.e., \(\sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd })-\mathcal {C}(\mathcal {R} ^\mathtt{\surd })}{\hat{\lambda }}\).

  • Improvement on the tuples already covered by \(\mathcal {R} ^\mathtt{\surd }\).

Now, let us consider a rule set \(\mathcal {R} _{1} \subseteq \mathcal {R} _{2}\). It is not difficult to see that both of the above factors corresponding to \(\mathcal {R} _{2}\) will not be greater than that of \(\mathcal {R} _{1}\). Thus, we have \(\varDelta g (\mathcal {R} _{1}\cup \{r \}|\mathcal {J})-\varDelta g (\mathcal {R} _{1}|\mathcal {J}) \ge \varDelta g (\mathcal {R} _{2}\cup \{r \}|\mathcal {J})-\varDelta g (\mathcal {R} _{2}|\mathcal {J})\), which proves the submodularity. Hence, we prove the lemma.

1.3 Proof of Theorem 2

Fig. 16
figure 16

Illustration of Theorem 2 proof

To prove Theorem 2, let us consider a special case of the RuleRef task selection problem, as shown in Fig. 16. Each rule has the same accuracy \(\hat{\lambda } _{j}=\lambda \), and each tuple has the same refute probability \(P(e ^\mathtt{\times }_{i})=1.0\). Moreover, we consider the “strict” refuting strategy used in Example 2: One counterexample is enough to refute all rules covering the tuple. And we consider the weight \(\gamma =0.5\). In this case, refuting a tuple, say \(e _{1}\), will remove all the rules covering the tuple, say \(\{r _{1}, r _{2}, r _{3}\}\). However, the removed rules cannot induce any impact defined in Section 4.2, as the tuples covered by \(\{r _{1}, r _{2}, r _{3}\}\) are still covered by other un-refuted rules, and thus the maximum accuracy associated with these tuples is still \(\lambda \). Suppose that we refute \(e _{5}\), and then, we have an impact \(\lambda \) as maximum rule accuracy associated with \(e _{6}\) becomes 0. Based on these examples, it is not difficult to see this special case of RuleRef task selection problem is equivalent to the following maximum isolated node problem:

Definition 7

(Maximum Isolated Node Problem) Given a bipartite graph over a rule node set \(\mathcal {R} \) and a tuple node set \(\mathcal {E} \), consider the following removal conditions: (1) If a tuple node is removed, then all the rule nodes connected to the tuple node as well as the edges associated with the rule nodes are removed; (2) a tuple node is called “isolated node” iff there is no edge associated with the tuple node. The problem finds k tuple nodes \(\mathcal {E} ^{\prime } \subseteq \mathcal {E} \) such that the number of isolated nodes after the removal is maximized.

For example, in Fig. 16, after removing \(\{e _{2}, e _{3}\}\), there is no isolated tuple nodes. On the contrary, after removing \(\{e _{1}, e _{2}\}\), \(e _{3}\) and \(e _{4}\) become isolated tuple nodes.

We can prove the maximum isolated node problem is NP-hard by a reduction from the minimum vertex cover (MVC) problem, which is known as NP-hard. Recall that an instance of the MVC problem consists of a graph \(G^\prime =(V,E)\) of vertex set V and edge set E. The problem aims to find the minimum vertex subset \(V^{\prime } \subseteq V\) such that every edge \(e \in E\) has at least one endpoint in \(V^{\prime }\).

Next, we show the reduction from the MVC problem to our maximum isolated node problem. Given any instance of the MVC problem \(G^\prime =(V,E)\), we create a tuple node set \(\mathcal {E} \), each of which corresponds to a vertex in V, and a rule node set \(\mathcal {R} \), each of which corresponds to an edge in E.

Then, suppose that our maximum isolated node problem is solved; given any number k, we can find a subset \(\mathcal {E} ^{\prime } \subseteq \mathcal {E} \) of tuple nodes that the number of isolated nodes is maximized. So we can vary k from 1 to \(|\mathcal {E} |\) to find the minimum k that satisfies all nodes in \(\mathcal {E}-\mathcal {E} ^{\prime }\) are isolated. Given the above reduction, we can see that this actually solves the MVC problem, because isolating all tuple nodes is equivalent to find a vertex subset \(V^{\prime }\) that covers all edge E in the MVC problem.

Thus, we prove that the maximum isolated node problem can be solved only if the MVC problem is solved. As the MVC problem is NP-hard, the maximum isolated node problem is NP-hard. Moreover, since the maximum isolated node problem is a special case of our RuleRef task selection problem formalized in Definition 5, we prove Theorem 2.

Table 11 Examples of crowd-validated rules

Examples of Labeling Rules

We also provide some examples to better understand the intuition behind our method. Table 11 shows some high-quality rules validated by the crowd on four datasets. Take the rule \((\mathtt{Sony}, \mathtt{Toshiba})\) on the \(\mathtt{Abt} \)-\(\mathtt{Buy} \) dataset as an example: We can observe that applying a rule is equivalent to annotating over 2000 samples. Selecting these high-quality rules forms the basis for CrowdGame. For EM tasks, such good rules usually contain brand names, product names, the product functions, properties, abbreviations, and so on. For spouse relation dataset, the good rules usually consist of words related with kinship.

Extension of Labeling Rule

We discuss a more general case that some rules in the candidates \(\mathcal {R} ^\mathtt{C}\) annotate label \(L_1=-1\) (called \(L_1\) rules for simplicity), while others annotate \(L_2=1\) (called \(L_2\) rules). Consider our \(\mathtt{spouse} \) relation extraction example that annotates \(L_2=1\) if entities have spouse relation or \(L_1=-1\) otherwise. In this case, a tuple, e.g., entity pair \((\mathtt{Michelle~Obama}, \mathtt{Barack~Obama})\), could be covered by conflicting rules (textual patterns), e.g., a \(L_2\) rule “\(\mathtt{married~with} \)” and a \(L_1\) rule “\(\mathtt{meets} \).”

CrowdGame devises a simple extension from Algorithm 1 by taking \(L_1\) and \(L_2\) rules independently. More specifically, let \(\mathcal {R} _\mathtt{q}^{L_1}\) (\(\mathcal {R} _\mathtt{q}^{L_2}\)) denote the set of \(L_1\) (\(L_2\)) rules selected by RuleGen for crowdsourcing. Recall that \(\mathcal {E} _\mathtt{q} \) is the set of tuples selected by RuleRef for crowdsourcing. First, we extend the overall minimax optimization objective, denoted by \(\tilde{\mathcal {J}}\), as a combination of objectives of \(L_1\) and \(L_2\) rules, i.e., \(\tilde{\mathcal {J}} = \mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} + \mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} \), where \(\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} \) (\(\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} \)) is defined in Eq. (5). Then, we run the iterative crowdsourcing framework in Algorithm 1. We present how RuleGen and RuleRef work in each iteration as follows:

  • RuleGen only slightly extends the computation of rule selection criterion \(\varDelta g (\mathcal {R} |\mathcal {J})\) as the summation of 1) the expected improvement of \(L_1\) rules \(\mathcal {R} ^{L_1}\) in \(\mathcal {R} \) over \(\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} \) and 2) the expected improvement of \(\mathcal {R} ^{L_2}\) over \(\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} \), where the expected expectation is computed using Eq. (6). Then, RuleGen uses the greedy strategy to find an optimal rule set \(\mathcal {R} ^{*}\) that maximizes the criterion \(\varDelta g (\mathcal {R} |\mathcal {J})\).

  • RuleRef extends the notation of \(e ^\mathtt{\times }_{i}\) to \(e _{i}^{L_1}\) (or \(e _{i}^{L_2}\)), which, respectively, means tuple \(e _{i}\) is checked and annotated with \(L_1\) (or \(L_2\)). Then, given a checked tuple \(e _{i}^{L_1}\) (or \(e _{i}^{L_2}\)), RuleRef considers it to refute the \(L_2\) part (or the \(L_1\) part) of objective \(\tilde{\mathcal {J}}\) using Eq. (7). Based on this, given a tuple set \(\mathcal {E} \), we consider every possible case of \((\mathcal {E} ^{L_1}, \mathcal {E} ^{L_2})\) where \(\mathcal {E} ^{L_1} \cup \mathcal {E} ^{L_2} = \mathcal {E} \) and \(\mathcal {E} ^{L_1} \cap \mathcal {E} ^{L_2} = \emptyset \), and revise Eq. (8) to \(\varDelta f (\mathcal {E} |\mathcal {J}) = - \sum _{\mathcal {E} ^{L_1}, \mathcal {E} ^{L_2}}P(\mathcal {E} ^{L_1})P(\mathcal {E} ^{L_2}) \cdot (\mathcal {I} (\mathcal {E} ^{L_1})+\mathcal {I} (\mathcal {E} ^{L_2}))\). Then, RuleRef utilizes this criterion for selecting tuples.

Using the above method, CrowdGame obtains a rule set \(\mathcal {R} _\mathtt{q}\) returned by Algorithm 1. Then, let us use \(\mathcal {R} _\mathtt{q}^{i} \subseteq \mathcal {R} _\mathtt{q}\) as the set of rules covering a tuple \(e _{i}\). CrowdGame labels \(e _{i}\) using label of the rule in \(\mathcal {R} _\mathtt{q}^{i}\) with the maximum accuracy.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, J., Fan, J., Wei, Z. et al. A game-based framework for crowdsourced data labeling. The VLDB Journal 29, 1311–1336 (2020). https://doi.org/10.1007/s00778-020-00613-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-020-00613-w

Keywords

Navigation