Abstract
We deal with the parameter estimation problem for probability density models with latent variables. For this problem traditionally the expectation maximization (EM) algorithm has been broadly used. However, it suffers from bad local maxima, and the quality of the estimator is sensitive to the initial model choice. Recently, an alternative density estimator has been proposed that is based on matching the moments between sample averaged and model averaged. This moment matching estimator is typically used as the initial iterate for the EM algorithm for further refinement. However, there is actually no guarantee that the EM-refined estimator still yields the moments close enough to the sample-averaged one. Motivated by this issue, in this paper we propose a novel estimator that takes merits of both worlds: we do likelihood maximization, but the moment discrepancy score is used as a regularizer that prevents the model-averaged moments from straying away from those estimated from data. On some crowd-sourcing label prediction problems, we demonstrate that the proposed approach yields more accurate density estimates than the existing estimators.
Similar content being viewed by others
Notes
Here h is marginalized out, namely \(P_{\theta }(x) = {\sum }_{h} P_{\theta }(x,h)\).
As a feature vector ϕ(z), we take the one used by the MM estimator, namely ϕ(z) comprised of x 1, x 1 ⊗ x 2, x 1 ⊗ x 3, and x 1 ⊗ x 2 ⊗ x j for \(j=3,\dots ,m\) where x j is the K-dimension one-hot vector for z j . This feature representation is shown to be able to identify the model parameters via inverse mapping [3].
We report results up to m = 25 since having m larger than 25 resulted in almost perfect prediction for most estimators (e.g., less than 1% errors).
References
Anandkumar A, Foster DP, Hsu D, Kakade SM, Liu YK (2015) A spectral algorithm for latent Dirichlet allocation. Algorithmica 72(1):193–214
Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M (2014) Tensor decompositions for learning latent variable models. J Mach Learn Res 15:2773–2832
Anandkumar A, Hsu D, Kakade SM (2012) A method of moments for mixture models and hidden Markov models. In: 25th annual conference on learning theory
Belkin M, Sinha K (2015) Polynomial learning of distribution families. SIAM J Comput 44(4):889–911
Bishop C (2007) Pattern recognition and machine learning. Springer, Berlin
Dalvi N, Dasgupta A, Kumar R, Rastogi V (2013) Aggregating crowdsourced binary ratings. In: Proceedings of world wide web conference
Dawid AP, Skene AM (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. J R Stat Soc Ser C 20–28
Debole F, Sebastiani F (2003) Supervised term weighting for automated text categorization. In: Proceedings of the ACM symposium on Applied computing
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the em algorithm. J R Stat Soc Ser B 39(1):1–38
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE international conference on computer vision and pattern recognition
Deng ZH, Tang SW, Yang DQ, Li MZLY, Xie KQ (2004) A comparative study on feature weight in text categorization. Advanced Web Technologies and Applications. Lect Notes Comput Sci 3007:588–597
Diamond S, Boyd S (2016) Cvxpy: a python-embedded modeling language for convex optimization. J Mach Learn Res 17(83):1–5
Ghosh A, Kale S, McAfee P (2011) Who moderates the moderators? Crowdsourcing abuse detection in user-generated content. In Proceedings of the ACM conference on electronic commerce
Hsu D, Kakade SM (2013) Learning mixtures of spherical Gaussians: moment methods and spectral decompositions. In: Proceedings of the 4th conference on innovations in theoretical computer science
Liu Q, Peng J, Ihler AT (2012) Variational inference for crowdsourcing. In: Advances in neural information processing systems
Lofberg J (2004) YALMIP: a toolbox for modeling and optimization in MATLAB. In: Proceedings of the IEEE international symposium on computed aided control systems design
Moitra A, Valiant G (2010) Settling the polynomial learnability of mixtures of Gaussians. In: 51st annual IEEE symposium on foundations of computer science
Raghunathan A, Frostig R, Duchi J, Liang P (2016) Estimation from indirect supervision with linear moments. In: International conference on machine learning
Raykar VC, Yu S, Zhao LH, Valadez GH, Florin C, Bogoni L, Moy L (2010) Learning from crowds. J Mach Learn Res 11:1297–1322
Sarkar P, Siddiqi SM, Gordon GJ (2007) A latent space approach to dynamic embedding of co-occurrence data. In: Proceedings of the 11th international conference on artificial intelligence and statistics
Snow R, O’Connor B, Jurafsky D, Ng AY (2008) Cheap and fast? But is it good?: Evaluating non-expert annotations for natural language tasks. In: Proceedings of the conference on empirical methods in natural language processing
Sorensen DC (1982) Newton’s method with a model trust region modification. SIAM J Numer Anal 19 (2):409–426
Wang Y, Xie B, Song L (2016) Isotonic Hawkes processes. In: International conference on machine learning
Xiang Yuan Y (2015) Recent advances in trust region algorithms. Math Program 151(1):249–281
Zhang Y, Chen X, Zhou D, Jordan MI (2014) Spectral methods meet EM: a provably optimal algorithm for crowdsourcing. In: Advances in neural information processing systems
Zhou D, Liu Q, Platt JC, Meek C (2014) Aggregating ordinal labels from crowds by minimax conditional entropy. In: International conference on machine learning
Zhou D, Platt JC, Basu S, Mao Y (2012) Learning from the wisdom of crowds by minimax entropy. In: Advances in neural information processing systems
Acknowledgements
This work is supported by National Research Foundation of Korea (NRF-2016R1A1A1A05921948)
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interests
The authors have no conflict of interest.
Consent for Publication
This research does not involve human participants nor animals. Consent to submit this manuscript has been received tacitly from the authors’ institution, Seoul National University of Science & Technology.
Appendix: Moment matching estimation for Dawid-Skene models:
Appendix: Moment matching estimation for Dawid-Skene models:
This appendix provides the detailed derivation for the moment matching estimation based on the theorems in [3, 25].
The assumption here is that the true model parameters satisfy: i) w y > 0 and ii) rank(μ y ) = K (i.e., full-rank) for all \(y=1,\dots ,K\). As we described in Section 2, the moments for the three types of features are: one-hot vectors, their pairwise products, and triple (tensor) products. For convenience in the exposition, we fix three-index set {1, 2, 3} while one can replace it with any subset of cardinality three from \(\{1,\dots ,K\}\). That is, the moments are defined and considered as: \(M_{1}:=\mathbb {E}[x_{1}]\), \(M_{12}:= \mathbb {E}[x_{1} \otimes x_{2}]\), and \(M_{123}:= \mathbb {E}[x_{1} \otimes x_{2} \otimes x_{3}]\), where all expectations are with respect to P θ (⋅) (we have dropped it for notational simplicity). They are (K × 1) vector, (K × K) matrix, and (K × K × K) tensor, respectively. First we have the following analytic formulas for the moments:
where diag(w) is the (K × K) diagonal matrix with entries of w in the diagonal.
In the second equalities in (22) and (25), we use the conditional independency assumed in the model (i.e., \(P(z|y) = {\prod }_{j=1}^{m} P(z_{j}|y)\)).
For the (K × K × K) tensor M 123, we often use the following (K × K) projected matrix notation on an arbitrary vector \(\eta \in \mathbb {R}^{K}\):
To find the inverse mapping (i.e., determine μ and w from the observed sample moments M’s), we first pick three arbitrary (K × K) matrices U k (for k = 1, 2, 3) such that \(U_{k}^{\top } \mu _{k}\) is invertible. This can be done by letting U k be the matrix of the left singular vectors of μ k . Although we do not know the true μ k at this moment, one can use the left singular vectors of M 12 instead, using the fact that the column spaces of μ 1 and M 12 coincide from (24) and non-singularity of \(\text {diag}(w) \cdot \mu _{2}^{\top }\).
Now we have the following three lemmas.
Lemma 1
\(U_{1}^{\top } M_{12} U_{2}\) is invertible.
Proof
\(U_{1}^{\top } M_{12} U_{2} = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(w) \cdot (U_{2}^{\top } \mu _{2})^{\top }\),which is the product of all non-singular terms. □
Lemma 2
Let \(B_{123}(\eta ) := (U_{1}^{\top } M_{123}(\eta ) U_{2}) \cdot (U_{1}^{\top }M_{12} U_{2})^{-1}\).Then \(B_{123}(\eta ) = (U_{1}^{\top } \mu _{1}) \cdot \text {diag}(\mu _{3}^{\top } \eta ) \cdot (U_{1}^{\top } \mu _{1})^{-1}\).
Proof
□
Lemma 3
η ⊤⋅ (μ 3) j for \(j=1,\dots ,K\) are the eigenvalues of B 123(η).
Proof
It immediately follows from (31) where it has the well-known diagonalization form. □
Note that one can easily compute the sample estimate of B 123(η) for any vector η using the empirical moments M 12 and M 123. The Lemma 3 implies that the eigenvalues of B 123(η) give us partial information about the true model parameters μ 3. One simple recipe to retrieve μ 3 is as follows. We choose η to be the i-th row vector (U 3) i of U 3 (for \(i=1,\dots ,K\)), and let \([\lambda _{i,1}, \dots , \lambda _{i,K}]^{\top }\) be the eigenvalues of B 123((U 3) i ). Let L be the (K × K) matrix whose (i, j) entry is L i, j = λ i, j . From Lemma 3, it is straightforward that \(L = U_{3}^{\top } \mu _{3}\), and we get μ 3 = (U 3)−⊤ L. The other μ j ’s can be recovered in a similar manner.
During this process, to be more rigorous, one has to deal with the remaining issue of eigenvalue ordering. However, this can be handled easily by ordering/matching the eigenvectors that are shared among different recovery indices (details can be found in [3, 25]). Finally, the prior label multinomial parameter vector w can be identified using (21). That is, with the empirical estimate M 1, we have:
where A † is the Moore-Penrose pseudo-inverse of A.
Rights and permissions
About this article
Cite this article
Kim, M. A maximum-likelihood and moment-matching density estimator for crowd-sourcing label prediction. Appl Intell 48, 381–389 (2018). https://doi.org/10.1007/s10489-017-0985-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-0985-1