Abstract
In a multinomial model, the sample space is partitioned into a disjoint union of cells. The partition is usually immutable during sampling of the cell counts. In this paper, we extend the multinomial model to the incomplete multinomial model by relaxing the constant partition assumption to allow the cells to be variable and the counts collected from non-disjoint cells to be modeled in an integrated manner for inference on the common underlying probability. The incomplete multinomial likelihood is parameterized by the complete-cell probabilities from the most refined partition. Its sufficient statistics include the variable-cell formation observed as an indicator matrix and all cell counts. With externally imposed structures on the cell formation process, it reduces to special models including the Bradley–Terry model, the Plackett–Luce model, etc. Since the conventional method, which solves for the zeros of the score functions, is unfruitful, we develop a new approach to establishing a simpler set of estimating equations to obtain the maximum likelihood estimate (MLE), which seeks the simultaneous maximization of all multiplicative components of the likelihood by fitting each component into an inequality. As a consequence, our estimation amounts to solving a system of the equality attainment conditions to the inequalities. The resultant MLE equations are simple and immediately invite a fixed-point iteration algorithm for solution, which is referred to as the weaver algorithm. The weaver algorithm is short and amenable to parallel implementation. We also derive the asymptotic covariance of the MLE, verify main results with simulations, and compare the weaver algorithm with an MM/EM algorithm based on fitting a Plackett–Luce model to a benchmark data set.
Similar content being viewed by others
References
Agresti, A.: Categorical Data Analysis, 2nd edn. Wiley, New York (2003)
Bradley, R.A., Terry, M.E.: Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39(3/4), 324–345 (1952)
Caron, F., Doucet, A.: Efficient Bayesian inference for generalized Bradley–Terry models. J. Comput. Graph. Stat. 21(1), 174–196 (2012)
Chen, T., Fienberg, S.E.: The analysis of contingency tables with incompletely classified data. Biometrics 32(1), 133–144 (1976)
Connor, R.J., Mosimann, J.E.: Concepts of independence for proportions with a generalization of the Dirichlet distribution. J. Am. Stat. Assoc. 64(325), 194–206 (1969)
Cox, D.A., Little, J., O’Shea, D.: Ideals, Varieties, and Algorithm: An Introduction to Computational Algebraic Geometry and Commutative Algebra, 3rd edn. Springer, New York (2007)
David, H.A.: The Method of Paired Comparisons, 2nd edn. Oxford University Press, Oxford (1988)
Davidson, R., Farquhar, P.: A bibliography on the method of paired comparisons. Biometrics 32, 241–252 (1976)
Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 39(1), 1–38 (1977)
Diaconis, P.: In: Gupta, S.S. (ed.) Group Representations in Probability and Statistics, Lecture Notes-Monograph Series, vol. 11. Institute of Mathematical Statistics Hayward, CA. https://projecteuclid.org/euclid.lnms/1215467407 (1988)
Dickey, J.M., Jiang, J.M., Kadane, J.B.: Bayesian methods for censored categorical data. J. Am. Stat. Assoc. 82(399), 773–781 (1987)
Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the 10th International Conference on World Wide Web, pp. 613–622. ACM (2001)
Ford, L.R.J.: Solution of a ranking problem from binary comparisons. Am. Math. Mon. 64(8), 28–33 (1957)
Gordon, L.: Successive sampling in large finite populations. Ann. Stat. 11(2), 702–706 (1983)
Gormley, I.C., Murphy, T.B.: Exploring voting blocs within the irish electorate: a mixture modeling approach. J. Am. Stat. Assoc. 103(483), 1014–1027 (2008)
Guiver, J., Snelson, E.: Bayesian inference for Plackett-Luce ranking models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 377–384. ACM, Pittsburgh (2009)
Haberman, S.J.: Product models for frequency tables involving indirect observation. Ann. Stat. 5(6), 1124–1147 (1977)
Hankin, R.K.S.: A generalization of the Dirichlet distribution. J. Stat. Softw. 33(11), 1–18 (2010)
Hartley, H.O., Hocking, R.R.: The analysis of incomplete data. Biometrics 27(4), 783–823 (1971)
Hastie, T., Tibshirani, R.: Classification by pairwise coupling. Ann. Stat. 26(2), 451–471 (1998)
Heiser, W.J.: Convergent computing by iterative majorization: theory and applications in multidimensional data analysis. In: Krzanowski, W.J. (ed.) Recent Advances in Descriptive Multivariate Analysis, pp. 157–189. Clarendon Press, Oxford (1995)
Huang, T.K., Weng, R.C., Lin, C.J.: Generalized Bradley–Terry models and multi-class probability estimates. J. Mach. Learn. Res. 7, 85–115 (2006)
Hunter, D.R.: MM algorithms for generalized Bradley–Terry models. Ann. Stat. 32(1), 384–406 (2004)
Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)
Jech, T.: The ranking of incomplete tournaments: a mathematician’s guide to popular sports. Am. Math. Mon. 90(4), 246–266 (1983)
Kernighan, B.W., Ritchie, D.M.: In: Ritchie, D.M. (ed.) The C Programming Language, 2nd edn. Prentice Hall Professional Technical Reference, Upper Saddle River (1988)
Lagarias, J., Reeds, J., Wright, M., Wright, P.: Convergence properties of the Nelder–Mead simplex method in low dimensions. SIAM J. Optim. 9(1), 112–147 (1998)
Laird, N.: Nonparametric maximum likelihood estimation of a mixing distribution. J. Am. Stat. Assoc. 73(364), 805–811 (1978)
Lange, K.: Optimization, 2nd edn. Springer, New York (2013)
Lange, K., Zhou, H.: MM algorithms for geometric and signomial programming. Math. Program. 143(1–2), 339–356 (2014)
Lange, K., Hunter, D.R., Yang, I.: Optimization transfer using surrogate objective functions. J. Comput. Graph. Stat. 9(1), 1–59 (2000)
Loève, M.: Probability Theory I, 4th edn. Springer, New York (1977)
Loève, M.: Probability Theory II, 4th edn. Springer, New York (1978)
Luce, R.D.: Individual Choice Behavior: A Theoretical Analysis. Wiley, New York (1959)
Luce, R.D.: The choice axiom after twenty years. J. Math. Psychol. 15, 215–223 (1977)
Marden, J.I.: Analyzing and Modeling Rank Data. Chapman & Hall/CRC, Boca Raton (1996)
MathWorks: Matlab documentation. URL https://www.mathworks.com/help/matlab/ref/profile.html (2017)
McLachlan, G., Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, New York (2008)
Nelder, J.A., Mead, R.: A simplex method for function minimization. Comput. J. 7(4), 308–313 (1965)
Ng, K.W., Tian, G.L., Tang, M.L.: Dirichlet and Related Distributions: Theory, Methods and Applications. Wiley, New York (2011)
NVIDIA: CUDA Toolkit Documentation v8.0. URL http://docs.nvidia.com/cuda/index.html (2017)
Pistone, G., Riccomagno, E., Wynn, H.P.: Algebraic Statistics: Computational Commutative Algebra in Statistics. Chapman & Hall/CRC, Boca Raton (2000)
Plackett, R.L.: The analysis of permutations. Appl. Stat. 24, 193–202 (1975)
Sattath, S., Tversky, A.: Unite and conquer: a multiplicative inequality for choice probabilities. Econometrica 44(1), 79–89 (1976)
Suppes, P., Krantz, D.H., Luce, R.D., Tversky, A.: Foundations of Measurement: Geometrical, Threshold, and Probabilistic Representations. Academic Press, New York (1971)
Tanner, M.A.: Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. Springer, New York (1996)
Thurstone, L.L.: Psychophysical analysis. Am. J. Psychol. 38(3), 368–389 (1927)
Turnbull, B.W.: The empirical distribution function with arbitrarily grouped, censored and truncated data. J. R. Stat. Soc. Ser. B (Methodol.) 38(3), 290–295 (1976)
Tversky, A.: Elimination by aspects: a theory of choice. Psychol. Rev. 79, 281–299 (1972)
Wu, C.F.J.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)
Yan, T., Yang, Y., Xu, J.: Sparse paired comparisons in the Bradley–Terry model. Statistica Sinica 22(3), 1305–1318 (2012)
Zermelo, E.: Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift 29(1), 436–460 (1929)
Acknowledgements
The authors are grateful to the two referees, Associate Editor, and Editor for their insightful comments that have significantly improved the article. Yin’s research was supported in part by a grant (17326316) from the Research Grants Council of Hong Kong.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A: Proof of Lemma 1
Proof
(Work with \(x_{i}/a_{i}\) and connect to the weighted AM–GM inequality, with its equality condition). Rewrite the target inequality as
By substituting \(y_{i}\) for \(x_{i}/a_{i}\) and taking the \(\left( {\sum \limits _{i=1}^{n}{a_{i}}}\right) \)-th root on both sides, we have
After a further substitution of \(w_{i}=a_{i}/\sum _{i=1}^{n}a_{i}\), we arrive at
which is the weighted AM-GM inequality. It is crucial that we now check and confirm that all equalities can hold jointly if and only if \(x_{i}/a_{i}=\tau \) for all i, given the existence of such a uniform constant \(\tau \) which must be positive. \(\square \)
Appendix B: Examples and Corollaries of Lemma 1
Example 5
\(\left( x_{1}+x_{2}\right) ^{5}\geqslant \frac{5^{5}}{3^{3}2^{2}}x_{1}^{3}x_{2}^{2}\). This inequality holds because
where the equality is attained if and only if \((x_{1},x_{2})\) is colinear with (3, 2).
Example 6
\(\left( x_{1}+x_{2}\right) ^{7}x_{3}^{3}x_{4}^{5} \leqslant \frac{{3^{3}}{5^{5}}{7^{7}}}{{15}^{15}} \left( {x_{1}}+{x_{2}}+{x_{3}}+{x_{4}}\right) ^{15}\). This inequality holds because
where the equality is attained if and only if \((x_{1}+x_{2},\,x_{3},\,x_{4})\) is colinear with (7, 3, 5). More importantly, together with the inequality in the previous example, the two equalities are jointly attained if and only if \((x_{1},\,x_{2},\,x_{3},\,x_{4})\) is colinear with (21, 14, 15, 25).
Corollary 1
If we require \(\sum _{i=1}^{n}{x_{i}}=\sum _{i=1}^{n}{a_{i}}=1\) in Lemma 1, then
and the equalities are attained if and only if \(x_{i}=a_{i}\) for \(i=1,\ldots ,n\).
Corollary 2
Let \(\varvec{x}\in (0,+\infty )^{n}\) be a vector of n positive reals. Let \(\varvec{\delta }\in \{0,1\}^{n}\) be a vector of n bits. Let \(\varvec{\beta }\in [0,+\infty )^{n}\) be a nonzero vector of n nonnegative reals such that \(\beta _{j}=0\) if \(\delta _{j}=0\). Let \(b=\sum _{i=1}^{n}{\beta _{i}}>0\). Define \(0^{0}=1\). Then
where the equality is attained if and only if there exists a positive k such that \(x_{i}/\beta _{i}=k\) for each of the i’s having \(\delta _{i}=1\).
Example 7
Let \(n=5\), \(\varvec{\delta }=(1,0,1,0,1)^{\intercal }\), \(\varvec{\beta }=(3,0,4,0,6)^{\intercal }\), \(b=3+0+4+0+6=13\). Then \(\forall \varvec{x}\in (0,+\infty )^{n}\), we have
which attains the equality if and only if \(x_{1}:x_{3}:x_{5}=3:4:6\).
Corollary 3
If we rescale each \(x_{i}\) by an independent positive constant \(c_{i}\), then we have the a seemingly more general but rather equivalent formulation of Lemma 1,
which attains the equality if and only if there exists some positive constant k such that \({{c_{i}}{x_{i}}}/{a_{i}}=k\) for all i.
Example 8
Let \(n=3\), \(a=(1,2,3)\), \(c=(4,5,6)\), then we have
Therefore,
which attains equality if and only if \(4{x_{1}}={5{x_{2}}}/2={6{x_{3}}}/3\) or \({x_{1}}:{x_{2}}:{x_{3}}=5:8:10\).
Corollary 4
Generalizing Corollary 3 to a linear transform \(\varvec{U}\) on vector \(\varvec{x}\),
which attains the equality if and only if
where k is a constant and can be solved explicitly under an extra constraint such as an affine constraint on \(\varvec{x}\).
Example 9
Let \(x_{1}=2y_{1}+y_{2}\) and \(x_{2}=y_{1}+2y_{2}\) in the first case of Example 5, we have
which attains equality if and only if \(y_{1}=4y_{2}\). By requiring the constraint \(y_{1}+y_{2}=1\) on the solution, it follows
and the unique maximum of \({\left( {2{y_{1}}+{y_{2}}}\right) ^{3}} {\left( {{y_{1}}+2{y_{2}}}\right) ^{2}}\) attained is \({{2^{2}}{3^{8}}}/{5^{5}}=8.398\).
We recursively apply the inequality to the objective, as this inequality transforms the maximization problem into a set of equality attainment conditions, which becomes a system of simple equations.
Appendix C: Proof of the ascent property and the linear rate of convergence of the weaver algorithm when s is sufficiently large
We instead maximize the log-likelihood with a Lagrange multiplier term to incorporate the equality constraint,
where the Lagrange multiplier is the known constant
not adding an extra unknown.
The derivative of \(\ell (\varvec{p})\) with respect to \(p_{i}\) at iteration k is given by
Combining the weaver steps 1 and 2, \(p_{i}^{(k)}\) is updated according to
We seek to establish the positivity of the quantity
where
It is now clear the condition for the last quantity to be positive is \(v^{\left( k\right) }>0\). Then, under this condition, every step of the iteration increases \(\ell (\varvec{p})\). Since \(\ell (\varvec{p})\) is clearly bounded from above, the iteration converges.
Next, we show the rate of convergence is linear. We denote the ith component of the solution as \(p_{i}^{(*)}\) and use the simpler symbol g to denote the derivative function \( g\left( p_{i}\right) \equiv \frac{\partial \ell (\varvec{p})}{\partial p_{i}}, \) hence \(g\left( p_{i}^{\left( *\right) }\right) =0\). We assume \(\ell (\varvec{p})\) is locally concave at \(\varvec{p}^{\left( *\right) }\) and assume g to be Lipschitz continuous, viz. there exists a positive constant L such that, for all pairs of \(\left( p,q\right) \) in the domain, \(\left| g\left( p\right) -g\left( q\right) \right| \le L\left| p-q\right| \). Then, we have
and further,
If \(p_{i}^{\left( k\right) }<p_{i}^{\left( *\right) }\), then \(g\left( {p_{i}^{\left( k\right) }}\right) >0\) and . Therefore,
If \(p_{i}^{\left( k\right) }>p_{i}^{\left( *\right) }\), then \(g\left( {p_{i}^{\left( k\right) }}\right) <0\). Therefore, \(\frac{{g\left( {p_{i}^{\left( k\right) }}\right) }}{{p_{i}^{\left( k\right) }-p_{i}^{\left( *\right) }}}<0\) and
In both cases, the numerator is smaller than the denominator, hence \(\left| \frac{{p_{i}^{\left( {k+1}\right) } -p_{i}^{\left( *\right) }}}{{p_{i}^{\left( k\right) } -p_{i}^{\left( *\right) }}}\right| <1\) and the rate of convergence is linear.
Appendix D: Ranking results of the car racing data
See Table 4.
Rights and permissions
About this article
Cite this article
Dong, F., Yin, G. Maximum likelihood estimation for incomplete multinomial data via the weaver algorithm. Stat Comput 28, 1095–1117 (2018). https://doi.org/10.1007/s11222-017-9782-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-017-9782-2
Keywords
- Bradley–Terry model
- Contingency table
- Count data
- Density estimation
- Incomplete multinomial model
- Plackett–Luce model
- Random partition
- Ranking
- Weaver algorithm