Skip to main content
Log in

An Extension of Iterative Scaling for Decision and Data Aggregation in Ensemble Classification

  • Published:
The Journal of VLSI Signal Processing Systems for Signal, Image, and Video Technology Aims and scope Submit manuscript

Abstract

Improved iterative scaling (IIS) is an algorithm for learning maximum entropy (ME) joint and conditional probability models, consistent with specified constraints, that has found great utility in natural language processing and related applications. In most IIS work on classification, discrete-valued “feature functions” are considered, depending on the data observations and class label, with constraints measured based on frequency counts, taken over hard (0–1) training set instances. Here, we consider the case where the training (and test) set consist of instances of probability mass functions on the features, rather than hard feature values. IIS extends in a natural way for this case. This has applications (1) to ME classification on mixed discrete-continuous feature spaces and (2) to ME aggregation of soft classifier decisions in ensemble classification. Moreover, we combine these methods, yielding a method, with proved learning convergence, that jointly performs (soft) decision-level and feature-level fusion in making ensemble decisions. We demonstrate favorable comparisons against standard Adaboost.M1, input-dependent boosting, and other supervised combining methods, on data sets from the UC Irvine Machine Learning repository.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1

Similar content being viewed by others

Notes

  1. Smoothed estimates are also used [6]. However, these are still based on “hard” frequency counts.

  2. We still assume there are hard instances for the class label. However, it is also possible to consider probabilistic (soft) class labels.

  3. Discrete and mixed discrete-continuous feature spaces can also be handled. Restriction here to purely continuous features is simply for clarity, without loss of generality.

  4. Here, as one example, we are considering the case of a (single) mixture model for each vector \(\underline{A}_i\).

  5. Higher order constraints, which encode dependencies between base classifiers, are also possible. However they would entail greater complexity, and a large training set to accurately measure the constraints.

  6. This search was done with \(N_e\) fixed at 10. The selected number of hidden units was then used for all ensemble sizes.

References

  1. Y. H. Abdel-Haleem, S. Renals, and N. D. Lawrence. “Acoustic Space Dimensionality Selection and Combination Using the Maximum Entropy Principle,” IEEE ICASSP, 2004.

  2. E. Alpaydin, “Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms,” Neural Comput., vol. 11, no. 8, 1999, pp. 1885–1892.

    Article  Google Scholar 

  3. A. Berger, The Improved Iterative Scaling Algorithm: A Gentle Introduction Tut. (Available from http://www.cs.cmu.edu/~aberger/maxent.html).

  4. A. L. Berger, S. Della Pietra, and V. J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Comput. Linguist., vol. 22, no. 1, 1996, pp. 39–71.

    Google Scholar 

  5. S. Boyd and L. Vandenberghe, “Convex Optimization,” Cambridge University Press, 2004 (March).

  6. S. F. Chen and R. Rosenfeld, “A Survey of Smoothing Techniques for ME Models,” IEEE Trans. Speech Audio Process., vol. 8, 2000, pp. 37–50.

    Article  Google Scholar 

  7. M. Collins, R. Schapire, and Y. Singer, “Logistic Regression, AdaBoost and Bregman Distances,” Proc. of 13 Annual Conf. on Comput. Learn. Theory, 2000, pp. 158–169.

  8. J. N. Darroch and D. Ratcliff, “Generalized Iterative Scaling for Log-linear Models,” Ann. Math. Stat., vol. 43, 1972, pp. 1470–1480.

    Google Scholar 

  9. S. Della Pietra, V. Della Pietra, and J. Lafferty, “Inducing Features of Random Fields,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, 1996, pp. 380–393.

    Article  Google Scholar 

  10. Y. Freund and R. Schapire, “Experiments with a New Boosting Algorithm,” Proc. ICML, 1996, pp. 148–156.

  11. N. Friedman, M. Goldszmidt, and T. J. Lee, “Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting,” Proc. ICML, 1998, pp. 179–187.

  12. E. T. Jaynes, “Papers on Probability, Statistics and Statistical Physics,” Reidel, Dordrecht, 1982.

    Google Scholar 

  13. R. Jin, Y. Liu, L. Si, J. Carbonell, and A. Hauptmann, “A New Boosting Algorithm Using Input-dependent Regularizer,” Proc. ICML, 2003.

  14. H. Kang, K. Kim, and J. Kim, “Optimal Approximation of Discrete Probability Distribution with kth-order Dependency and its Application to Combining Multiple Classifiers,” Pattern Recogn. Lett., vol. 18, no. 6, 1997, pp. 515–523.

  15. J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On Combining Classifiers,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 20, no. 3, 1998, pp. 226–239.

    Article  Google Scholar 

  16. R. Kohavi and M. Sahami, “Error-based and Entropy-based Discretization of Continuous Features,” in Proc. of the 2nd International Conference KDD, 1996, pp. 114–119.

  17. R. Lau and M. Sahami, “Adaptive Language Modelling Using the Maximum Entropy Approach,” in Procs. of the ARPA Human Lang. Tech. Workshop, 1993, pp. 110–113.

  18. G. Lebanon and J. Lafferty. “Boosting and Maximum Likelihood for Exponential Models,” NIPS, vol. 15, 2001.

  19. R. Malouf, “A Comparison of Algorithms for Maximum Entropy Parameter Estimation,” in Procs of the sixth conf. on Natural Language Learning, 2002, pp. 49–55.

  20. J. Jeon and R. Manmatha, “Using Maximum Entropy for Automatic Image Annotation,” Image and Video Retrieval: Third Intl. Conf., CIVR, 2004.

  21. R. Meir, R. El-Yaniv, and S. Ben-David, “Localized Boosting,” in Proc. Conf. on Comput. Learning Theory, 2000, pp. 190–199.

  22. D. J. Miller and L. Yan, “Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-order Probability Constraints: Application to Statistical Classification,” Neural Comput., vol. 12, no. 9, 2000, pp. 2175–2207.

  23. D. J. Miller and S. Pal, “Transductive Methods for the Distributed Ensemble Classification Problem,” Neural Comput (in press).

  24. D. J. Miller and L. Yan, “An Approximate Maximum Entropy Method for Classification and more General Inference: Relation to other Maxent Methods and to Naive Bayes,” CISS, 2000.

  25. S. J. Phillips, M. Dudik, and R. E. Schapire, “A Maximum Entropy Approach to Species Distribution Modeling,” ICML, 2004.

  26. A. Schwaighofer, “SVM Toolbox for Matlab,” Available from http://ida.first.fraunhofer.de/~anton/software.html.

  27. G. Schwarz, “Estimating the Dimension of a Model,” Ann. Statist., vol. 6, no. 2, 1978, pp. 461–464.

    MATH  Google Scholar 

  28. P. Smyth, “Clustering Using Monte Carlo Cross-validation,” KDD, 1996, pp. 126–133.

  29. L. Tjen-Sien, L. Wei-Yin, and S. Yu-Shan, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms,” Mach. Learn., 2000, pp. 203–229.

  30. N. Ueda and R. Nakano, “Combining Discriminant-based Classifiers Using the Minimum Classification Error Discriminant,” IEEE W. on NNSP, 1997, pp. 365–374.

  31. S. Wang, D. Schuurmans, and Y. Zhao, “The Latent Maximum Entropy Principle,” IEEE Trans. on Inf. Theory, 2002. (Submitted).

  32. L. Yan and D. J. Miller, “Critic-driven Ensemble Classification via a Learning Method Akin to Boosting,” in Intell. Eng. Sys. Through ANN 1, 2001, pp. 27–32.

  33. L. Yan and D. J. Miller, “General Statistical Inference for Discrete and Mixed Spaces by an Approximate Application of the Maximum Entropy Principle,” IEEE Trans. on NN, vol. 11, no. 3, 2000, pp. 558–573.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to David J. Miller.

Appendices

Appendix A

For an arbitrary parameter vector \(\underline{\gamma}\), the conditional log-likelihood is

$$L{\left( {\underline{\gamma } } \right)} = {\sum\limits_{t = 1}^T {\ln } }{\left[ {\frac{{e^{{{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}\gamma {\left( {C = c^{{{\left( t \right)}}} ,F_{i} = j} \right)}} }} }}} }}{{{\sum\limits_{k = 1}^{N_{c} } {e^{{{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}\gamma {\left( {C = k,F_{i} = j} \right)}} }} }}} } }}}} \right]}$$
(22)

For a change in the parameter vector \(\underline{\Delta\gamma}\), the change in log-likelihood is \(L(\underline{\gamma}+\underline{\Delta\gamma})-L(\underline{\gamma})\). Using the identity \(-\ln(\alpha)\geq 1-\alpha\), we obtain the lower bound

$$\begin{array}{*{20}l} {{L{\left( {\underline{\gamma } + \underline{{\Delta \gamma }} } \right)} - L{\left( {\underline{\gamma } } \right)} \geqslant {\sum\limits_{t = 1}^T {{\left[ {{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}\Delta \gamma {\left( {C = c^{{{\left( t \right)}}} ,F_{i} = j} \right)} + 1} }} }} \right]}} }} \hfill} \\ {{ - {\sum\limits_{t = 1}^T {{\sum\limits_{k = 1}^{N_{c} } {P{\left[ {C = k\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]} \times e^{{{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}} }} }\Delta \gamma {\left( {C = k,F_{i} = j} \right)}}} \widehat{ = }B{\left( {\underline{{\Delta \gamma }} \left| {\underline{\gamma } } \right.} \right)}} }} }} \hfill} \\ \end{array} $$

With \(\underline{\Delta\gamma}\) chosen so that \(B(\underline{\Delta\gamma}|\underline{\gamma}) \geq 0\) there is improvement in log-likelihood. The obvious approach then is to maximize \(B(\underline{\Delta\gamma}|\underline{\gamma})\). However, there is coupled dependence in \(B(\underline{\Delta\gamma}|\underline{\gamma})\) on the individual components \(\left\{\Delta\gamma(C=c,{F}_{i}=j)\right\}\), which would necessitate a complicated joint optimization. Thus, we seek an auxiliary function that decouples this dependence. We first rewrite

$$\begin{array}{*{20}l} {{B{\left( {\underline{{\Delta \gamma }} + \underline{\gamma } } \right)} = {\sum\limits_{t = 1}^T {{\left[ {{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}\Delta \gamma {\left( {C = c^{{{\left( t \right)}}} ,F_{i} = j} \right)} + 1} }} }} \right]}} }} \hfill} \\ {{ - {\sum\limits_{t = 1}^T {{\sum\limits_{k = 1}^{N_{c} } {P{\left[ {C = k\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]} \times e^{{{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {\frac{{P{\left[ {F_{i} = j\left| t \right.} \right]}}}{{N_{d} }}} }} }{\left[ {N_{d} \Delta \gamma } \right]}{\left( {C = k,F_{i} = j} \right)}}} } }} }} \hfill} \\ \end{array} $$
(23)

Then, we note that \(\sum\nolimits_{i=1}^{N_{d}}\sum\nolimits_{j\in{\cal A}_{i}}\frac{P[{F}_{i}=j|t]}{N_{d}}=1\) i.e., \(\left\{\frac{1}{N_d}P[{F}_{i}=j|t], \hspace{0.05in} i=1,...,N_d,\hspace{0.05in} j\in{\cal A}_{i}\right\} \) is an instance of the joint pmf \(P[I,F_I]\), associated with first selecting a feature \(i \in \left\{1,...,N_d\right\}\) and then a feature value \(f_i \in {\cal A}_{i}\). Applying Jensen’s inequality \(e^{\sum_{x} p(x)q(x)} \leq \sum_{x} p(x) e^{q(x)}\) to the right hand side of Eq. (23), we have

$$\begin{array}{*{20}l} {{B{\left( {\underline{{\Delta \gamma }} + \underline{\gamma } } \right)} \geqslant T + {\sum\limits_{t = 1}^T {{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {F_{i} = j\left| t \right.} \right]}\Delta \gamma {\left( {C = c^{{{\left( t \right)}}} ,F_{i} = j} \right)}} }} }} }} \hfill} \\ {{ - \frac{1}{{N_{d} }}{\sum\limits_{t = 1}^T {{\sum\limits_{k = 1}^{N_{c} } {{\sum\limits_{i = 1}^{N_{d} } {{\sum\limits_{j \in \mathcal{A}_{i} } {P{\left[ {C = k\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}P{\left[ {F_{i} = j\left| t \right.} \right]}e^{{N_{d} \Delta \gamma {\left( {C = k,F_{i} = j} \right)}}} } }\widehat{ = }A{\left( {\underline{{\Delta \gamma }} \left| {\underline{\gamma } } \right.} \right)}} }} }} }.} \hfill} \\ \end{array} $$
(24)

Thus \(L(\underline{\gamma}+\underline{\Delta\gamma})-L(\underline{\gamma}) \geq B(\underline{\Delta\gamma}|\underline{\gamma}) \geq A(\underline{\Delta\gamma}|\underline{\gamma})\), i.e., we have a new, not as tight, lower bound. Let \(\underline{\Delta \gamma^{\ast}} = \arg\max_{\underline{\Delta \gamma}} A(\cdot)\). Since it is easy to verify that \(A(\underline{0}|\underline{\gamma})=0\), it must be true that \(A(\underline{\Delta \gamma^{\ast}}|\underline{\gamma}) \geq 0\), i.e., property A1 in Section 2.3 is satisfied by this function. Moreover, \(A(\underline{\Delta\gamma}|\underline{\gamma})\) can be additively decoupled into individual terms each depending on a single \(\Delta\gamma(C=k,{F}_{n}=q)\). Differentiating \(A(\underline{\Delta\gamma}|\underline{\gamma})\) with respect to \(\Delta\gamma(C=k,{F}_{n}=q)\) and equating to 0 we get the choice of \(\underline{\Delta\gamma}\) to maximize \(A(\underline{\Delta\gamma}|\underline{\gamma})\), i.e.,

$$\begin{array}{*{20}l} {{\Delta \gamma ^{ * } {\left( {C = k,F_{n} = q} \right)} = \frac{1}{{N_{d} }}\ln {\left[ {\frac{{{\sum\limits_{t = 1;C = k}^T {P{\left[ {F_{n} = q\left| t \right.} \right]}} }}}{{{\sum\limits_{t = 1}^T {P{\left[ {F_{n} = q\left| t \right.} \right]}P{\left[ {C = k\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}} }}}} \right]}} \hfill} \\ {{ = \frac{1}{{N_{d} }}\ln \frac{{P_{g} {\left[ {C = k,F_{n} = q} \right]}}}{{P_{m} {\left[ {C = k,F_{n} = q} \right]}}},\forall k,n,q.} \hfill} \\ \end{array} $$
(25)

Appendix B

Theorem 1

Consider the function \(A(\underline{\Delta\gamma}|\underline{\gamma})\) defined in Eq. (24). Let \(\underline{\Delta \gamma^{\ast}} = \arg\max_{\underline{\Delta \gamma}} A(\underline{\Delta \gamma}|\underline{\gamma})\). Then, \(A(\underline{\Delta\gamma^{\ast}}|\underline{\gamma})=0\) \(ff\;\underline{{\Delta \gamma ^{ * } }} = \underline{0} \). When this occurs the constraints (4) are all met.

Proof

Setting \( \dfrac{\partial A(\underline{\Delta\gamma}|\underline{\gamma})}{\partial\Delta\gamma(F_{i}=j,C=c)}=0\) gives the solution in Eq. (25). Moreover, this is a maximum since \(\frac{{\partial ^{2} A{\left( {\underline{{\Delta \gamma }} \left| {\underline{\gamma } } \right.} \right)}}}{{\partial \Delta \gamma ^{2} {\left( {F_{i} = j,C = c} \right)}}} < 0\forall i,j,c\). Thus, \(\underline{\Delta \gamma^{\ast}} = \left\{ \Delta\gamma^{\ast}(F_{i}=j,C=c) \hspace{0.05in} \forall i,\hspace{0.05in} \forall j \in {\cal A}_i, \hspace{0.05in} c \in C \right\}.\)

Plugging the solution for \(\underline{\Delta \gamma^{\ast}}\) back into \(A(\underline{\Delta\gamma}|\underline{\gamma})\) in Eq. (24) and simplifying gives

$$\begin{array}{*{20}l} {{A{\left( {\underline{{\Delta \gamma ^{ * } }} \left| {\underline{\gamma } } \right.} \right)} = T + {\sum\limits_t^T {{\sum\limits_i {{\sum\limits_j {{\sum\limits_c {P{\left[ {F_{i} = j\left| t \right.} \right]}} }} }} }} }} \hfill} \\ {{\frac{1}{{N_{d} }}\ln {\left[ {\frac{{{\sum\limits_{t = 1:c^{{{\left( t \right)}}} = c}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}} }}}{{{\sum\nolimits_{t = 1}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}} }}}} \right]}} \hfill} \\ {{ - \frac{1}{{N_{d} }}{\sum\limits_i {{\sum\limits_j {{\sum\limits_{c = 1}^C {{\sum\limits_{t = 1:c^{{{\left( t \right)}}} = c} {P{\left[ {F_{i} = j\left| t \right.} \right]}} }} }} }} }} \hfill} \\ \end{array} $$
(26)

Noting that the third term equals T, we have

$$\begin{array}{*{20}l} {{A{\left( {\underline{{\Delta \gamma ^{ * } }} \left| {\underline{\gamma } } \right.} \right)} = \frac{1}{{N_{d} }}{\sum\limits_t^T {{\sum\limits_i {{\sum\limits_j {{\sum\limits_c {P{\left[ {F_{i} = jt} \right]}} }} }} }} }\ln {\left[ {\frac{{{\sum\limits_{t = 1:c^{{{\left( t \right)}}} = c}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}} }}}{{{\sum\limits_{t = 1}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}} }}}} \right]}} \hfill} \\ {{ = - \frac{1}{{N_{d} }}{\sum\limits_t^T {{\sum\limits_i {{\sum\limits_j {{\sum\limits_c {P{\left[ {F_{i} = j\left| t \right.} \right]}\ln {\left[ {\frac{{\frac{1}{T}{\sum\limits_{t = 1}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}} }}}{{\frac{1}{T}{\sum\limits_{t = 1:c^{{{\left( t \right)}}} = c}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}} }}}} \right]}} }} }} }} }} \hfill} \\ \end{array} $$

Now, since \(E(-\ln(\cdot)) \ge -\ln(E(\cdot))\) by Jensen’s inequality, we have

$$A{\left( {\underline{{\Delta \gamma ^{ * } }} \left| {\underline{\gamma } } \right.} \right)} \geqslant - \frac{T}{{N_{d} }}{\sum\limits_i {\ln } }{\left[ {{\sum\limits_{j,c} {\frac{1}{T}{\sum\limits_{t = 1:c^{{{\left( t \right)}}} = c}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}{\left[ {\frac{{\frac{1}{T}{\sum\limits_{t\prime = 1}^T {P{\left[ {F_{i} = j\left| {t\prime } \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( {t\prime } \right)}}} }} } \right.} \right]}} }}}{{\frac{1}{T}{\sum\limits_{t\prime = 1:c^{{t\prime }} = c}^T {P{\left[ {F_{i} = j\left| {t\prime } \right.} \right]}} }}}} \right]}} }} }} \right]}$$
$$\begin{array}{*{20}c} { = - \frac{T}{{N_{d} }}{\sum\limits_i {\ln } }{\left[ {{\sum\limits_j {{\sum\limits_c {\frac{1}{T}{\sum\limits_{t\prime = 1}^T {P{\left[ {F_{i} = j\left| {t\prime } \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( {t\prime } \right)}}} }} } \right.} \right]}} }} }} }} \right]}} \\ { = - \frac{T}{{N_{d} }}{\sum\limits_i {\ln {\left( 1 \right)} = 0} }} \\ \end{array} $$

i.e., \(A(\underline{\Delta \gamma^{\ast}}|\underline{\gamma}) \ge 0\). Finally by strict convexity of \(\ln\left(\cdot\right)\), equality is achieved iff the argument of the \(\ln\left(\cdot\right)\) is 1. But this occurs iff

$${\sum\limits_{t = 1}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}P{\left[ {C = c\left| {P_{{\underline{f} ^{{{\left( t \right)}}} }} } \right.} \right]}} } = {\sum\limits_{t = 1;C = c}^T {P{\left[ {F_{i} = j\left| t \right.} \right]}\forall i,j,c} }$$

i.e., by Eq. (25), iff \(\underline{\Delta \gamma^{\ast}} = \underline{0}\). Clearly, when this occurs, the constraints are met.▪

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pal, S., Miller, D.J. An Extension of Iterative Scaling for Decision and Data Aggregation in Ensemble Classification. J VLSI Sign Process Syst Sign Im 48, 21–37 (2007). https://doi.org/10.1007/s11265-006-0009-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-006-0009-6

Keywords

Navigation