Skip to main content
Log in

A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

We deal with the parameter estimation problem for probability density models with latent variables. For this problem traditionally the expectation maximization (EM) algorithm has been broadly used. However, it suffers from bad local maxima, and the quality of the estimator is sensitive to the initial model choice. Recently, an alternative density estimator has been proposed that is based on matching the moments between sample averaged and model averaged. This moment matching estimator is typically used as the initial iterate for the EM algorithm for further refinement. However, there is actually no guarantee that the EM-refined estimator still yields the moments close enough to the sample-averaged one. Motivated by this issue, in this paper we propose a novel estimator that takes merits of both worlds: we do likelihood maximization, but the moment discrepancy score is used as a regularizer that prevents the model-averaged moments from straying away from those estimated from data. On some crowd-sourcing label prediction problems, we demonstrate that the proposed approach yields more accurate density estimates than the existing estimators.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

Notes

  1. Here h is marginalized out, namely \(P_{\theta }(x) = {\sum }_{h} P_{\theta }(x,h)\).

  2. As a feature vector ϕ(z), we take the one used by the MM estimator, namely ϕ(z) comprised of x 1, x 1x 2, x 1x 3, and x 1x 2x j for \(j=3,\dots ,m\) where x j is the K-dimension one-hot vector for z j . This feature representation is shown to be able to identify the model parameters via inverse mapping [3].

  3. We report results up to m = 25 since having m larger than 25 resulted in almost perfect prediction for most estimators (e.g., less than 1% errors).

  4. http://www.daviddlewis.com/resources/testcollections/reuters21578/.

  5. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/.

References

  1. Anandkumar A, Foster DP, Hsu D, Kakade SM, Liu YK (2015) A spectral algorithm for latent Dirichlet allocation. Algorithmica 72(1):193–214

    Article  MathSciNet  MATH  Google Scholar 

  2. Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M (2014) Tensor decompositions for learning latent variable models. J Mach Learn Res 15:2773–2832

    MathSciNet  MATH  Google Scholar 

  3. Anandkumar A, Hsu D, Kakade SM (2012) A method of moments for mixture models and hidden Markov models. In: 25th annual conference on learning theory

  4. Belkin M, Sinha K (2015) Polynomial learning of distribution families. SIAM J Comput 44(4):889–911

    Article  MathSciNet  MATH  Google Scholar 

  5. Bishop C (2007) Pattern recognition and machine learning. Springer, Berlin

    MATH  Google Scholar 

  6. Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of world wide web conference

  7. Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C 20–28

  8. Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing

  9. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38

    MathSciNet  MATH  Google Scholar 

  10. Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE international conference on computer vision and pattern recognition

  11. Deng ZH, Tang SW, Yang DQ, Li MZLY, Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597

    Article  Google Scholar 

  12. Diamond S, Boyd S (2016) Cvxpy: a python-embedded modeling language for convex optimization. J Mach Learn Res 17(83):1–5

    MathSciNet  MATH  Google Scholar 

  13. Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators? Crowdsourcing abuse detection in user-generated content. In Proceedings of the ACM conference on electronic commerce

  14. Hsu D, Kakade SM (2013) Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In: Proceedings of the 4th conference on innovations in theoretical computer science

  15. Liu Q, Peng J, Ihler AT (2012) Variational inference for crowdsourcing. In: Advances in neural information processing systems

  16. Lofberg J (2004) YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the IEEE international symposium on computed aided control systems design

  17. Moitra A, Valiant G (2010) Settling the polynomial learnability of mixtures of Gaussians. In: 51st annual IEEE symposium on foundations of computer science

  18. Raghunathan A, Frostig R, Duchi J, Liang P (2016) Estimation from indirect supervision with linear moments. In: International conference on machine learning

  19. Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322

    MathSciNet  Google Scholar 

  20. Sarkar P, Siddiqi SM, Gordon GJ (2007) A latent space approach to dynamic embedding of co-occurrence data. In: Proceedings of the 11th international conference on artificial intelligence and statistics

  21. Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast? But is it good?: Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing

  22. Sorensen DC (1982) Newton’s method with a model trust region modification. SIAM J Numer Anal 19 (2):409–426

    Article  MathSciNet  MATH  Google Scholar 

  23. Wang Y, Xie B, Song L (2016) Isotonic Hawkes processes. In: International conference on machine learning

  24. Xiang Yuan Y (2015) Recent advances in trust region algorithms. Math Program 151(1):249–281

    Article  MathSciNet  MATH  Google Scholar 

  25. Zhang Y, Chen X, Zhou D, Jordan MI (2014) Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in neural information processing systems

  26. Zhou D, Liu Q, Platt JC, Meek C (2014) Aggregating ordinal labels from crowds by minimax conditional entropy. In: International conference on machine learning

  27. Zhou D, Platt JC, Basu S, Mao Y (2012) Learning from the wisdom of crowds by minimax entropy. In: Advances in neural information processing systems

Download references

Acknowledgements

This work is supported by National Research Foundation of Korea (NRF-2016R1A1A1A05921948)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Minyoung Kim.

Ethics declarations

Conflict of interests

The authors have no conflict of interest.

Consent for Publication

This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.

Appendix: Moment matching estimation for Dawid-Skene models:

Appendix: Moment matching estimation for Dawid-Skene models:

This appendix provides the detailed derivation for the moment matching estimation based on the theorems in [3, 25].

The assumption here is that the true model parameters satisfy: i) w y > 0 and ii) rank(μ y ) = K (i.e., full-rank) for all \(y=1,\dots ,K\). As we described in Section 2, the moments for the three types of features are: one-hot vectors, their pairwise products, and triple (tensor) products. For convenience in the exposition, we fix three-index set {1, 2, 3} while one can replace it with any subset of cardinality three from \(\{1,\dots ,K\}\). That is, the moments are defined and considered as: \(M_{1}:=\mathbb {E}[x_{1}]\), \(M_{12}:= \mathbb {E}[x_{1} \otimes x_{2}]\), and \(M_{123}:= \mathbb {E}[x_{1} \otimes x_{2} \otimes x_{3}]\), where all expectations are with respect to P θ (⋅) (we have dropped it for notational simplicity). They are (K × 1) vector, (K × K) matrix, and (K × K × K) tensor, respectively. First we have the following analytic formulas for the moments:

$$\begin{array}{@{}rcl@{}} M_{1} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y]] = \mathbb{E}_{P(y)}[(\mu_{1})_{y}]\\ &=& \sum\limits_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} = \mu_{1}^{\top} \cdot w. \end{array} $$
(21)
$$\begin{array}{@{}rcl@{}} M_{12} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1} \otimes x_{2}|y]]\\ &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y] \otimes \mathbb{E}[x_{2}|y]] \end{array} $$
(22)
$$\begin{array}{@{}rcl@{}} &=& \mathbb{E}_{P(y)}[(\mu_{1})_{y} \otimes (\mu_{2})_{y}]\\ &=& \sum\limits_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} \otimes (\mu_{2})_{y} \end{array} $$
(23)
$$\begin{array}{@{}rcl@{}} &=& \mu_{1} \cdot \text{diag}(w) \cdot \mu_{2}^{\top}, \end{array} $$
(24)

where diag(w) is the (K × K) diagonal matrix with entries of w in the diagonal.

$$\begin{array}{@{}rcl@{}} M_{123} &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1} \otimes x_{2} \otimes x_{3}|y]]\\ &=& \mathbb{E}_{P(y)}[\mathbb{E}[x_{1}|y] \otimes \mathbb{E}[x_{2}|y] \otimes \mathbb{E}[x_{3}|y]] \end{array} $$
(25)
$$\begin{array}{@{}rcl@{}} &=& \mathbb{E}_{P(y)}[(\mu_{1})_{y} \otimes (\mu_{2})_{y} \otimes (\mu_{3})_{y}] \end{array} $$
(26)
$$\begin{array}{@{}rcl@{}} &=& {\sum}_{y=1}^{K} w_{y} \cdot (\mu_{1})_{y} \otimes (\mu_{2})_{y} \otimes (\mu_{3})_{y}. \end{array} $$
(27)

In the second equalities in (22) and (25), we use the conditional independency assumed in the model (i.e., \(P(z|y) = {\prod }_{j=1}^{m} P(z_{j}|y)\)).

For the (K × K × K) tensor M 123, we often use the following (K × K) projected matrix notation on an arbitrary vector \(\eta \in \mathbb {R}^{K}\):

$$\begin{array}{@{}rcl@{}} M_{123}(\eta) &:=& \mathbb{E}[(x_{1} \otimes x_{2}) \cdot (x_{3}^{\top} \eta)] \end{array} $$
(28)
$$\begin{array}{@{}rcl@{}} &=& \mu_{1} \cdot (\text{diag}(w) \cdot \text{diag}(\mu_{3}^{\top} \cdot \eta)) \cdot \mu_{2}^{\top}. \end{array} $$
(29)

To find the inverse mapping (i.e., determine μ and w from the observed sample moments M’s), we first pick three arbitrary (K × K) matrices U k (for k = 1, 2, 3) such that \(U_{k}^{\top } \mu _{k}\) is invertible. This can be done by letting U k be the matrix of the left singular vectors of μ k . Although we do not know the true μ k at this moment, one can use the left singular vectors of M 12 instead, using the fact that the column spaces of μ 1 and M 12 coincide from (24) and non-singularity of \(\text {diag}(w) \cdot \mu _{2}^{\top }\).

Now we have the following three lemmas.

Lemma 1

\(U_{1}^{\top } M_{12} U_{2}\) is invertible.

Proof

\(U_{1}^{\top } M_{12} U_{2} = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(w) \cdot (U_{2}^{\top } \mu _{2})^{\top }\),which is the product of all non-singular terms. □

Lemma 2

Let \(B_{123}(\eta ) := (U_{1}^{\top } M_{123}(\eta ) U_{2}) \cdot (U_{1}^{\top }M_{12} U_{2})^{-1}\).Then \(B_{123}(\eta ) = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(\mu _{3}^{\top } \eta ) \cdot (U_{1}^{\top } \mu _{1})^{-1}\).

Proof

Using (24) and (29),

$$\begin{array}{@{}rcl@{}} &&(U_{1}^{\top} M_{123}(\eta) U_{2}) \cdot (U_{1}^{\top} M_{12} U_{2})^{-1}\\ &=& (U_{1}^{\top} \mu_{1}) \cdot (\text{diag}(w) \cdot \text{diag}(\mu_{3}^{\top} \eta)) \cdot (U_{2}^{\top} \mu_{2})^{\top} \cdot \\ && ((U_{1}^{\top} \mu_{1}) \cdot \text{diag}(w) \cdot (U_{2}^{\top} \mu_{2})^{\top} )^{-1} \end{array} $$
(30)
$$\begin{array}{@{}rcl@{}} &=& (U_{1}^{\top} \mu_{1}) \cdot \text{diag}(\mu_{3}^{\top} \eta) \cdot (U_{1}^{\top} \mu_{1})^{-1}. \end{array} $$
(31)

Lemma 3

η ⋅ (μ 3) j for \(j=1,\dots ,K\) are the eigenvalues of B 123(η).

Proof

It immediately follows from (31) where it has the well-known diagonalization form. □

Note that one can easily compute the sample estimate of B 123(η) for any vector η using the empirical moments M 12 and M 123. The Lemma 3 implies that the eigenvalues of B 123(η) give us partial information about the true model parameters μ 3. One simple recipe to retrieve μ 3 is as follows. We choose η to be the i-th row vector (U 3) i of U 3 (for \(i=1,\dots ,K\)), and let \([\lambda _{i,1}, \dots , \lambda _{i,K}]^{\top }\) be the eigenvalues of B 123((U 3) i ). Let L be the (K × K) matrix whose (i, j) entry is L i, j = λ i, j . From Lemma 3, it is straightforward that \(L = U_{3}^{\top } \mu _{3}\), and we get μ 3 = (U 3)−⊤ L. The other μ j ’s can be recovered in a similar manner.

During this process, to be more rigorous, one has to deal with the remaining issue of eigenvalue ordering. However, this can be handled easily by ordering/matching the eigenvectors that are shared among different recovery indices (details can be found in [3, 25]). Finally, the prior label multinomial parameter vector w can be identified using (21). That is, with the empirical estimate M 1, we have:

$$ w = \mu_{1}^{\dagger} M_{1}, $$
(32)

where A is the Moore-Penrose pseudo-inverse of A.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, M. A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction. Appl Intell 48, 381–389 (2018). https://doi.org/10.1007/s10489-017-0985-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-0985-1

Keywords

Navigation