A game-based framework for crowdsourced data labeling

Yang, Jingru; Fan, Ju; Wei, Zhewei; Li, Guoliang; Liu, Tongyu; Du, Xiaoyong

doi:10.1007/s00778-020-00613-w

A game-based framework for crowdsourced data labeling

Regular Paper
Published: 19 May 2020

Volume 29, pages 1311–1336, (2020)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

Jingru Yang¹,
Ju Fan¹,
Zhewei Wei¹,
Guoliang Li²,
Tongyu Liu¹ &
…
Xiaoyong Du¹

708 Accesses
6 Citations
Explore all metrics

Abstract

Data labeling, which assigns data with multiple classes, is indispensable for many applications, such as machine learning and data integration. However, existing labeling solutions either incur expensive cost for large datasets or produce noisy results. This paper introduces a cost-effective labeling approach and focuses on the labeling rule generation problem that aims to generate high-quality rules to largely reduce the labeling cost while preserving quality. To address the problem, we first generate candidate rules and then devise a game-based crowdsourcing approach CrowdGame to select high-quality rules by considering coverage and accuracy. CrowdGame employs two groups of crowd workers: One group answers rule validation tasks (whether a rule is valid) to play a role of rule generator, while the other group answers tuple checking tasks (whether the label of a data tuple is correct) to play a role of rule refuter. We let the two groups play a two-player game: Rule generator identifies high-quality rules with large coverage, while rule refuter tries to refute its opponent rule generator by checking some tuples that provide enough evidence to reject rules with low accuracy. This paper studies the challenges in CrowdGame. The first is to balance the trade-off between coverage and accuracy. We define the loss of a rule by considering the two factors. The second is rule accuracy estimation. We utilize Bayesian estimation to combine both rule validation and tuple checking tasks. The third is to select crowdsourcing tasks to fulfill the game-based framework for minimizing the loss. We introduce a minimax strategy and develop efficient task selection algorithms. We also develop a hybrid crowd-machine method for effective label assignment under budget-constrained crowdsourcing settings. We conduct experiments on entity matching and relation extraction, and the results show that our method outperforms state-of-the-art solutions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

Fig. 11

Fig. 12

Fig. 13

Fig. 14

Fig. 15

Improving crowd labeling using Stackelberg models

Article 26 January 2021

Wenjun Yang & Chaoqun Li

A partial-order-based framework for cost-effective crowdsourced entity resolution

Article 12 June 2018

Chengliang Chai, Guoliang Li, … Jianhua Feng

Effective Solution for Labeling Candidates with a Proper Ration for Efficient Crowdsourcing

Notes

Note that the matching criterion is the same product model and the same manufacture, without considering specifications like color and storage.
https://code.google.com/p/word2vec/
http://research.signalmedia.co/newsir16/signal-dataset.html
http://wiki.dbpedia.org/
See how $\mathtt{Snorkel} $ uses crowdsourcing in Section 7.4

References

Abad, A., Nabi, M., Moschitti, A.: Self-crowdsourcing training for relation extraction. In: ACL pp. 518–523 (2017)
Bishop, C.M.: Pattern Recognition and Machine Learning, Information Science and Statistics, 5th edn. Springer, Berlin (2007)
Google Scholar
Bowman, K., Shenton, L.: Parameter estimation for the beta distribution. J. Stat. Comput. Simul. 43(3–4), 217–228 (1992)
Article Google Scholar
Chai, C., Li, G., Li, J., Deng, D., Feng, J.: Cost-effective crowdsourced entity resolution: a partial-order approach. In: SIGMOD, pp. 969–984 (2016)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16(1), 22–29 (1990)
Google Scholar
Das, S., P. S. G. C., Doan, A., Naughton, J. F., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V., Park, Y.: Falcon: Scaling up hands-off crowdsourced entity matching to build cloud services. In: SIGMOD, pp. 1431–1446 (2017)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)
Article Google Scholar
Fan, J., Li, G.: Human-in-the-loop rule learning for data integration. IEEE Data Eng. Bull. 41(2), 104–115 (2018)
Google Scholar
Fan, J., Li, G., Ooi, B. C., Tan, K., Feng, J.: icrowd: An adaptive crowdsourcing framework. In SIGMOD, pp. 1015–1030 (2015)
Fan, J., Lu, M., Ooi, B.C., Tan, W., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. ICDE 2014, 976–987 (2014)
Google Scholar
Fan, J., Zhang, M., Kok, S., Lu, M., Ooi, B.C.: Crowdop: Query optimization for declarative crowdsourcing systems. IEEE Trans. Knowl. Data Eng. 27(8), 2078–2092 (2015)
Article Google Scholar
Franklin, M. J., Kossmann, D., Kraska, T., Ramesh, S., Xin, R.: Crowddb: answering queries with crowdsourcing. In: SIGMOD, pp. 61–72 (2011)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: Hands-off crowdsourcing for entity matching. In: SIGMOD, pp. 601–612 (2014)
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: NIPS, pp. 2672–2680 (2014)
Haas, D., Wang, J., Wu, E., Franklin, M.J.: Clamshell: Speeding up crowds for low-latency data labeling. PVLDB 9(4), 372–383 (2015)
Google Scholar
Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., Weld, D. S.: Knowledge-based weak supervision for information extraction of overlapping relations. In: Association for Computational Linguistics ACL, pp. 541–550 (2011)
Joglekar, M., Garcia-Molina, H., Parameswaran, A.: Comprehensive and reliable crowd assessment algorithms. In: Gehrke, J., Lehner, W., Shim, K., Cha, S.K., Lohman, G.M. (eds) ICDE. IEEE Computer Society, pp. 195–206. (2015) https://doi.org/10.1109/ICDE.2015.7113284
Khan, A.R., Garcia-Molina, H.: Attribute-based crowd entity resolution. In: CIKM, pp. 549–558 (2016)
Kusner, M.J., Sun, Y., Kolkin, N.I., Weinberger, K.Q.: From word embeddings to document distances. ICML 2015, 957–966 (2015)
Google Scholar
LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, G.: Human-in-the-loop data integration. PVLDB 10(12), 2006–2017 (2017)
Google Scholar
Li, G., Chai, C., Fan, J., Weng, X., Li, J., Zheng, Y., Li, Y., Yu, X., Zhang, X., Yuan, H.: CDB: optimizing queries with crowd-based selections and joins. In: SIGMOD, pp. 1463–1478 (2017)
Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)
Article Google Scholar
Liu, A., Soderland, S., Bragg, J., Lin, C.H., Ling, X., Weld, D.S.: Effective crowd annotation for relation extraction. In: NAACL HLT, pp. 897–906 (2016)
Liu, X., Lu, M., Ooi, B.C., Shen, Y., Wu, S., Zhang, M.: CDAS: a crowdsourcing data analytics system. PVLDB 5(10), 1040–1051 (2012)
Google Scholar
Marcus, A., Wu, E., Karger, D.R., Madden, S., Miller, R.C.: Demonstration of qurk: a query processor for humanoperators. SIGMOD 2011, 1315–1318 (2011)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. ACL 2009, 1003–1011 (2009)
Google Scholar
Parisi, F., Strino, F., Nadler, B., Kluger, Y.: Ranking and combining multiple predictors without labeled data. Proc. Natl. Acad. Sci. USA 111(4), 1253–8 (2014)
Article MathSciNet Google Scholar
Park, H., Pang, R., Parameswaran, A.G., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: a system for declarative crowdsourcing. PVLDB 5(12), 1990–1993 (2012)
Google Scholar
Ratner, A., Bach, S.H., Ehrenberg, H.R., Fries, J.A., Wu, S., Ré, C.: Snorkel: rapid training data creation with weak supervision. PVLDB 11(3), 269–282 (2017)
Google Scholar
Ratner, A.J., Sa, C.D., Wu, S., Selsam, D., Ré, C.: Data programming: creating large training sets, quickly. NIPS 2016, 3567–3575 (2016)
Google Scholar
Roth, B., Klakow, D.: Combining generative and discriminative model scores for distant supervision. In: EMNLP, pp. 24–29 (2013)
Rubner, Y., Tomasi, C., Guibas, L.J.: A metric for distributions with applications to image databases. ICCV, IEEE Computer Society, pp. 59–66 (1998). https://doi.org/10.1109/ICCV.1998.710701
Sheng, V.S., Provost, F., Ipeirotis, P.G.: Get another label? improving data quality and data mining using multiple, noisy labelers. In: SIGKDD, pp. 614–622. ACM (2008)
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. CoRR, arXiv:1707.02968 (2017)
Takamatsu, S., Sato, I., Nakagawa, H.: Reducing wrong labels in distant supervision for relation extraction. In: Meeting of the Association for Computational Linguistics: Long Papers, pp. 721–729 (2012)
Tong, Y., Chen, L., Zhou, Z., Jagadish, H.V., Shou, L., Lv, W.: Slade: a smart large-scale task decomposer in crowdsourcing. IEEE Trans. Knowl. Data Eng. 30(8), 1588–1601 (2018)
Article Google Scholar
Tong, Y., She, J., Ding, B., Wang, L., Chen, L.: Online mobile micro-task allocation in spatial crowdsourcing. In: ICDE, pp. 49–60 (2016)
Verroios, V., Garcia-Molina, H., Papakonstantinou, Y.: Waldo: An adaptive human interface for crowd entity resolution. In: SIGMOD, pp. 1133–1148 (2017)
Vesdapunt, N., Bellare, K., Dalvi, N.N.: Crowdsourcing algorithms for entity resolution. In: PVLDB (2014)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: Crowdsourcing entity resolution. In: PVLDB (2012)
Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240 (2013)
Wang, J., Yu, L., Zhang, W., Gong, Y., Xu, Y., Wang, B., Zhang, P., Zhang, D.: Irgan: a minimax game for unifying generative and discriminative information retrieval models. In: SIGIR, pp. 515–524. ACM (2017)
Wang, S., Xiao, X., Lee, C.: Crowd-based deduplication: an adaptive approach. In: SIGMOD, pp. 1263–1277 (2015)
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)
Google Scholar
Zhang, Y., Chen, X., Zhou, D., Jordan, M.I.: Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: International Conference on Neural Information Processing Systems, pp. 1260–1268 (2014)
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: is the problem solved? PVLDB 10(5), 541–552 (2017)
Google Scholar
Zheng, Y., Wang, J., Li, G., Cheng, R., Feng, J.: QASCA: a quality-aware task assignment system for crowdsourcing applications. In: SIGMOD, pp. 1031–1046 (2015)

Download references

Acknowledgements

This work was supported by NSF of China (61632016, 61925205, U1711261, 61832017, 61972401, 61932001), Huawei, TAL education, Beijing Outstanding Young Scientist Program NO. BJJWZYJH012019100020098, Research Funds of Renmin University of China (18XNLG18, 18XNLG21), and the Fundamental Research Funds for the Central Universities.

Author information

Authors and Affiliations

Renmin University of China, Beijing, 100872, China
Jingru Yang, Ju Fan, Zhewei Wei, Tongyu Liu & Xiaoyong Du
Tsinghua University, Beijing, 100084, China
Guoliang Li

Authors

Jingru Yang
View author publications
You can also search for this author in PubMed Google Scholar
Ju Fan
View author publications
You can also search for this author in PubMed Google Scholar
Zhewei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Guoliang Li
View author publications
You can also search for this author in PubMed Google Scholar
Tongyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoyong Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ju Fan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Proofs

1.1 Proof of Theorem 1

We can prove NP-hardness of the problem by a reduction from the k maximum coverage (KMC) problem, which is known to be NP-hard.

Recall that an instance of the KMC problem (E,$\mathcal {S}$,k) consists of a universe of elements E = {$s_1,s_2,\cdots ,s_n$}, a collection of subsets of the universe E, i.e., $\mathcal {S}$ = {$S_1,S_2,\cdots ,S_m$} where any $S_i \in \mathcal {S}$ satisfies $S_i \subseteq E$, and a number k. The objective is to select k subsets from $\mathcal {S}$, denoted by $\mathcal {S}^\prime $, so that the number of covered tuples $\left| \bigcup _{S\in \mathcal {S}^\prime } S \right| $ is maximized.

An instance of our problem consists of a set of tuples $\mathcal {E}$, a set of rules $\mathcal {R}$, and a number b. The optimization objective is to select b rules from $\mathcal {R}$ so that the expected rule selection criterion, according to Eq. 6, is maximized.

The reduction from KMC to our problem. We show next that for any instance (E,$\mathcal {S}$,k) of KMC, we can create a corresponding instance of our problem based on (E,$\mathcal {S}$,k) in polynomial time.

We translate the set E of elements into the set $\mathcal {E}=\{e_1,e_2,\cdots ,e_n\}$ of tuples in our problem.
Given an element $s_j$ in E, if $s_j \in S_i$, we set add a tuple $e_j$ into the rule $r_i$ whose accuracy and validation probability is 1. We set the parameter $\gamma $ to 0.5. The gain of objective $\mathcal {J} ^{r_i}$ calculates to 1 if $s_j \in S_i$, and 0 otherwise. Thus, each set $S_i$ in the KMC problem corresponds to the rule $r_i$ and the elements covered by $S_i$ correspond to the tuples covered by $r_i$.
We translate number k in KMC into b in our problem.

Equivalence of optimization objectives. We show the optimization objectives of the two problems are equivalent:

Since in our instance the probability that an individual rule r passes the validation is 1, the validated rule set $\mathcal {R} ^\mathtt{\surd }$ is equivalent to the selected rule set $\mathcal {R}^{(t)}_q$, and $P(\mathcal {R} ^\mathtt{\surd })=P(\mathcal {R}^{(t)}_q)=1$.
Since in our instance the rule accuracy is 1 and the parameter $\gamma $ is 0.5, based on Equation 4, we know that $\mathcal {J} ^{\mathcal {R}^{(t)}_q}=\left| \mathcal {C}(\mathcal {R}^{(t)}_q) \right| $.

With $\mathcal {R} ^\mathtt{\surd }=\emptyset $, the expected rule selection criterion therefore becomes $\left| \mathcal {C}(\mathcal {R}^{(t)}_q) \right| $. Since our problem is to find the b best rules, $\mathcal {R}^{(t)}_q$, that maximizes the expected criterion, this is equivalent to finding b best sets that maximize the set of covered elements.

1.2 Proof of Lemma 1

Consider two rule sets $\mathcal {R} _{1} \subseteq \mathcal {R} _{2}$; we first prove the monotonicity as follows. For simplicity, we use $\varGamma $ to denote $\frac{1-2\gamma }{1-\gamma }$ in this proof:

$$\begin{aligned}&\varDelta g (\mathcal {R} _{2}|\mathcal {J}) - \varDelta g ( \mathcal {R} _{1}| \mathcal {J}) = \sum _{\mathcal {R} ^\mathtt{\surd }_{2}}P(\mathcal {R} ^\mathtt{\surd }_{2})\sum _{e _{i}}{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }}\\&\quad - \sum _{\mathcal {R} ^\mathtt{\surd }_{1}}P(\mathcal {R} ^\mathtt{\surd }_{1})\sum _{e _{i}}{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }} \end{aligned}$$

Since $\mathcal {R} _{1} \subseteq \mathcal {R} _{2}$, for simplicity, we introduce $\mathcal {R} _3 = \mathcal {R} _2-\mathcal {R} _1$. Then, for any $\mathcal {R} ^\mathtt{\surd }_{2}$, we can find a $\mathcal {R} ^\mathtt{\surd }_{1}$ and $\mathcal {R} ^\mathtt{\surd }_{3}$ such that $\mathcal {R} ^\mathtt{\surd }_{2}=\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3}$. Based on this, we have

$$\begin{aligned}&\varDelta g (\mathcal {R} _{2}|\mathcal {J}) - \varDelta g ( \mathcal {R} _{1}| \mathcal {J})=\sum _{\mathcal {R} ^\mathtt{\surd }_{1}}P(\mathcal {R} ^\mathtt{\surd }_{1})\big [\ \sum _{\mathcal {R} ^\mathtt{\surd }_{3}}P(\mathcal {R} ^\mathtt{\surd }_{3}) \\&\big ( \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \big ) \big ]. \end{aligned}$$

It is not difficult to know $\sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1} \cup \mathcal {R} ^\mathtt{\surd }_{3})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }_{1})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \ge 0$, and we prove monotonicity.

We next prove that $\varDelta g ( \mathcal {R} | \mathcal {J})$ is submodular. Given any rule $r $, using the previous equation, we have

$$\begin{aligned}&\varDelta g (\mathcal {R} \cup \{r \}|\mathcal {J}) - \varDelta g ( \mathcal {R} | \mathcal {J}) = \sum _{\mathcal {R} ^\mathtt{\surd }}P(\mathcal {R} ^\mathtt{\surd }) P(r^\mathtt{\surd }) \nonumber \\&\big ( \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd }\cup \{r^\mathtt{\surd } \})}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} - \sum _{e _{i}\in \mathcal {C}(\mathcal {R} ^\mathtt{\surd })}\{\max _{r _{j}}{\hat{\lambda } _{j}-\varGamma }\} \big ) \nonumber \\&= -\varGamma + \sum _{\mathcal {R} ^\mathtt{\surd }}P(\mathcal {R} ^\mathtt{\surd })P(r^\mathtt{\surd }) \big ( \sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd })-\mathcal {C}(\mathcal {R} ^\mathtt{\surd })}{\hat{\lambda }} \nonumber \\&+ \sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd }) \cap \mathcal {C}(\mathcal {R} ^\mathtt{\surd })}\max \{\hat{\lambda }-\max {\varLambda ^{\mathcal {R} ^{(i)}}},0\} \big ) \end{aligned}$$

(13)

From the equation, we can see that the above margin depends on the following two factors under each cases corresponding to $P(\mathcal {R} ^\mathtt{\surd })P(r^\mathtt{\surd })$:

Improvement on “additional” tuples covered by $r^\mathtt{\surd } $, i.e., $\sum _{e _{i}\in \mathcal {C}(r^\mathtt{\surd })-\mathcal {C}(\mathcal {R} ^\mathtt{\surd })}{\hat{\lambda }}$.
Improvement on the tuples already covered by $\mathcal {R} ^\mathtt{\surd }$.

Now, let us consider a rule set $\mathcal {R} _{1} \subseteq \mathcal {R} _{2}$. It is not difficult to see that both of the above factors corresponding to $\mathcal {R} _{2}$ will not be greater than that of $\mathcal {R} _{1}$. Thus, we have $\varDelta g (\mathcal {R} _{1}\cup \{r \}|\mathcal {J})-\varDelta g (\mathcal {R} _{1}|\mathcal {J}) \ge \varDelta g (\mathcal {R} _{2}\cup \{r \}|\mathcal {J})-\varDelta g (\mathcal {R} _{2}|\mathcal {J})$, which proves the submodularity. Hence, we prove the lemma.

1.3 Proof of Theorem 2

To prove Theorem 2, let us consider a special case of the RuleRef task selection problem, as shown in Fig. 16. Each rule has the same accuracy $\hat{\lambda } _{j}=\lambda $, and each tuple has the same refute probability $P(e ^\mathtt{\times }_{i})=1.0$. Moreover, we consider the “strict” refuting strategy used in Example 2: One counterexample is enough to refute all rules covering the tuple. And we consider the weight $\gamma =0.5$. In this case, refuting a tuple, say $e _{1}$, will remove all the rules covering the tuple, say $\{r _{1}, r _{2}, r _{3}\}$. However, the removed rules cannot induce any impact defined in Section 4.2, as the tuples covered by $\{r _{1}, r _{2}, r _{3}\}$ are still covered by other un-refuted rules, and thus the maximum accuracy associated with these tuples is still $\lambda $. Suppose that we refute $e _{5}$, and then, we have an impact $\lambda $ as maximum rule accuracy associated with $e _{6}$ becomes 0. Based on these examples, it is not difficult to see this special case of RuleRef task selection problem is equivalent to the following maximum isolated node problem:

Definition 7

(Maximum Isolated Node Problem) Given a bipartite graph over a rule node set $\mathcal {R} $ and a tuple node set $\mathcal {E} $, consider the following removal conditions: (1) If a tuple node is removed, then all the rule nodes connected to the tuple node as well as the edges associated with the rule nodes are removed; (2) a tuple node is called “isolated node” iff there is no edge associated with the tuple node. The problem finds k tuple nodes $\mathcal {E} ^{\prime } \subseteq \mathcal {E} $ such that the number of isolated nodes after the removal is maximized.

For example, in Fig. 16, after removing $\{e _{2}, e _{3}\}$, there is no isolated tuple nodes. On the contrary, after removing $\{e _{1}, e _{2}\}$, $e _{3}$ and $e _{4}$ become isolated tuple nodes.

We can prove the maximum isolated node problem is NP-hard by a reduction from the minimum vertex cover (MVC) problem, which is known as NP-hard. Recall that an instance of the MVC problem consists of a graph $G^\prime =(V,E)$ of vertex set V and edge set E. The problem aims to find the minimum vertex subset $V^{\prime } \subseteq V$ such that every edge $e \in E$ has at least one endpoint in $V^{\prime }$.

Next, we show the reduction from the MVC problem to our maximum isolated node problem. Given any instance of the MVC problem $G^\prime =(V,E)$, we create a tuple node set $\mathcal {E} $, each of which corresponds to a vertex in V, and a rule node set $\mathcal {R} $, each of which corresponds to an edge in E.

Then, suppose that our maximum isolated node problem is solved; given any number k, we can find a subset $\mathcal {E} ^{\prime } \subseteq \mathcal {E} $ of tuple nodes that the number of isolated nodes is maximized. So we can vary k from 1 to $|\mathcal {E} |$ to find the minimum k that satisfies all nodes in $\mathcal {E}-\mathcal {E} ^{\prime }$ are isolated. Given the above reduction, we can see that this actually solves the MVC problem, because isolating all tuple nodes is equivalent to find a vertex subset $V^{\prime }$ that covers all edge E in the MVC problem.

Thus, we prove that the maximum isolated node problem can be solved only if the MVC problem is solved. As the MVC problem is NP-hard, the maximum isolated node problem is NP-hard. Moreover, since the maximum isolated node problem is a special case of our RuleRef task selection problem formalized in Definition 5, we prove Theorem 2.

Table 11 Examples of crowd-validated rules

Full size table

Examples of Labeling Rules

We also provide some examples to better understand the intuition behind our method. Table 11 shows some high-quality rules validated by the crowd on four datasets. Take the rule $(\mathtt{Sony}, \mathtt{Toshiba})$ on the $\mathtt{Abt} $-$\mathtt{Buy} $ dataset as an example: We can observe that applying a rule is equivalent to annotating over 2000 samples. Selecting these high-quality rules forms the basis for CrowdGame. For EM tasks, such good rules usually contain brand names, product names, the product functions, properties, abbreviations, and so on. For spouse relation dataset, the good rules usually consist of words related with kinship.

Extension of Labeling Rule

We discuss a more general case that some rules in the candidates $\mathcal {R} ^\mathtt{C}$ annotate label $L_1=-1$ (called $L_1$ rules for simplicity), while others annotate $L_2=1$ (called $L_2$ rules). Consider our $\mathtt{spouse} $ relation extraction example that annotates $L_2=1$ if entities have spouse relation or $L_1=-1$ otherwise. In this case, a tuple, e.g., entity pair $(\mathtt{Michelle~Obama}, \mathtt{Barack~Obama})$, could be covered by conflicting rules (textual patterns), e.g., a $L_2$ rule “$\mathtt{married~with} $” and a $L_1$ rule “$\mathtt{meets} $.”

CrowdGame devises a simple extension from Algorithm 1 by taking $L_1$ and $L_2$ rules independently. More specifically, let $\mathcal {R} _\mathtt{q}^{L_1}$ ($\mathcal {R} _\mathtt{q}^{L_2}$) denote the set of $L_1$ ($L_2$) rules selected by RuleGen for crowdsourcing. Recall that $\mathcal {E} _\mathtt{q} $ is the set of tuples selected by RuleRef for crowdsourcing. First, we extend the overall minimax optimization objective, denoted by $\tilde{\mathcal {J}}$, as a combination of objectives of $L_1$ and $L_2$ rules, i.e., $\tilde{\mathcal {J}} = \mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} + \mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} $, where $\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} $ ($\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} $) is defined in Eq. (5). Then, we run the iterative crowdsourcing framework in Algorithm 1. We present how RuleGen and RuleRef work in each iteration as follows:

RuleGen only slightly extends the computation of rule selection criterion $\varDelta g (\mathcal {R} |\mathcal {J})$ as the summation of 1) the expected improvement of $L_1$ rules $\mathcal {R} ^{L_1}$ in $\mathcal {R} $ over $\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_1},\mathcal {E} _\mathtt{q}} $ and 2) the expected improvement of $\mathcal {R} ^{L_2}$ over $\mathcal {J}^{\mathcal {R} _\mathtt{q}^{L_2},\mathcal {E} _\mathtt{q}} $, where the expected expectation is computed using Eq. (6). Then, RuleGen uses the greedy strategy to find an optimal rule set $\mathcal {R} ^{*}$ that maximizes the criterion $\varDelta g (\mathcal {R} |\mathcal {J})$.
RuleRef extends the notation of $e ^\mathtt{\times }_{i}$ to $e _{i}^{L_1}$ (or $e _{i}^{L_2}$), which, respectively, means tuple $e _{i}$ is checked and annotated with $L_1$ (or $L_2$). Then, given a checked tuple $e _{i}^{L_1}$ (or $e _{i}^{L_2}$), RuleRef considers it to refute the $L_2$ part (or the $L_1$ part) of objective $\tilde{\mathcal {J}}$ using Eq. (7). Based on this, given a tuple set $\mathcal {E} $, we consider every possible case of $(\mathcal {E} ^{L_1}, \mathcal {E} ^{L_2})$ where $\mathcal {E} ^{L_1} \cup \mathcal {E} ^{L_2} = \mathcal {E} $ and $\mathcal {E} ^{L_1} \cap \mathcal {E} ^{L_2} = \emptyset $, and revise Eq. (8) to $\varDelta f (\mathcal {E} |\mathcal {J}) = - \sum _{\mathcal {E} ^{L_1}, \mathcal {E} ^{L_2}}P(\mathcal {E} ^{L_1})P(\mathcal {E} ^{L_2}) \cdot (\mathcal {I} (\mathcal {E} ^{L_1})+\mathcal {I} (\mathcal {E} ^{L_2}))$. Then, RuleRef utilizes this criterion for selecting tuples.

Using the above method, CrowdGame obtains a rule set $\mathcal {R} _\mathtt{q}$ returned by Algorithm 1. Then, let us use $\mathcal {R} _\mathtt{q}^{i} \subseteq \mathcal {R} _\mathtt{q}$ as the set of rules covering a tuple $e _{i}$. CrowdGame labels $e _{i}$ using label of the rule in $\mathcal {R} _\mathtt{q}^{i}$ with the maximum accuracy.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, J., Fan, J., Wei, Z. et al. A game-based framework for crowdsourced data labeling. The VLDB Journal 29, 1311–1336 (2020). https://doi.org/10.1007/s00778-020-00613-w

Download citation

Received: 16 September 2019
Revised: 30 January 2020
Accepted: 12 April 2020
Published: 19 May 2020
Issue Date: November 2020
DOI: https://doi.org/10.1007/s00778-020-00613-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A game-based framework for crowdsourced data labeling

Abstract

Access this article

Similar content being viewed by others

Improving crowd labeling using Stackelberg models

A partial-order-based framework for cost-effective crowdsourced entity resolution

Effective Solution for Labeling Candidates with a Proper Ration for Efficient Crowdsourcing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs

1.1 Proof of Theorem 1

1.2 Proof of Lemma 1

1.3 Proof of Theorem 2

Definition 7

Examples of Labeling Rules

Extension of Labeling Rule

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A game-based framework for crowdsourced data labeling

Abstract

Access this article

Similar content being viewed by others

Improving crowd labeling using Stackelberg models

A partial-order-based framework for cost-effective crowdsourced entity resolution

Effective Solution for Labeling Candidates with a Proper Ration for Efficient Crowdsourcing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

Proofs

1.1 Proof of Theorem 1

1.2 Proof of Lemma 1

1.3 Proof of Theorem 2

Definition 7

Examples of Labeling Rules

Extension of Labeling Rule

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation