Abstract
Improved iterative scaling (IIS) is an algorithm for learning maximum entropy (ME) joint and conditional probability models, consistent with specified constraints, that has found great utility in natural language processing and related applications. In most IIS work on classification, discrete-valued “feature functions” are considered, depending on the data observations and class label, with constraints measured based on frequency counts, taken over hard (0–1) training set instances. Here, we consider the case where the training (and test) set consist of instances of probability mass functions on the features, rather than hard feature values. IIS extends in a natural way for this case. This has applications (1) to ME classification on mixed discrete-continuous feature spaces and (2) to ME aggregation of soft classifier decisions in ensemble classification. Moreover, we combine these methods, yielding a method, with proved learning convergence, that jointly performs (soft) decision-level and feature-level fusion in making ensemble decisions. We demonstrate favorable comparisons against standard Adaboost.M1, input-dependent boosting, and other supervised combining methods, on data sets from the UC Irvine Machine Learning repository.
Similar content being viewed by others
Notes
Smoothed estimates are also used [6]. However, these are still based on “hard” frequency counts.
We still assume there are hard instances for the class label. However, it is also possible to consider probabilistic (soft) class labels.
Discrete and mixed discrete-continuous feature spaces can also be handled. Restriction here to purely continuous features is simply for clarity, without loss of generality.
Here, as one example, we are considering the case of a (single) mixture model for each vector \(\underline{A}_i\).
Higher order constraints, which encode dependencies between base classifiers, are also possible. However they would entail greater complexity, and a large training set to accurately measure the constraints.
This search was done with \(N_e\) fixed at 10. The selected number of hidden units was then used for all ensemble sizes.
References
Y. H. Abdel-Haleem, S. Renals, and N. D. Lawrence. “Acoustic Space Dimensionality Selection and Combination Using the Maximum Entropy Principle,” IEEE ICASSP, 2004.
E. Alpaydin, “Combined 5 × 2 cv F Test for Comparing Supervised Classification Learning Algorithms,” Neural Comput., vol. 11, no. 8, 1999, pp. 1885–1892.
A. Berger, The Improved Iterative Scaling Algorithm: A Gentle Introduction Tut. (Available from http://www.cs.cmu.edu/~aberger/maxent.html).
A. L. Berger, S. Della Pietra, and V. J. Della Pietra, “A Maximum Entropy Approach to Natural Language Processing,” Comput. Linguist., vol. 22, no. 1, 1996, pp. 39–71.
S. Boyd and L. Vandenberghe, “Convex Optimization,” Cambridge University Press, 2004 (March).
S. F. Chen and R. Rosenfeld, “A Survey of Smoothing Techniques for ME Models,” IEEE Trans. Speech Audio Process., vol. 8, 2000, pp. 37–50.
M. Collins, R. Schapire, and Y. Singer, “Logistic Regression, AdaBoost and Bregman Distances,” Proc. of 13 Annual Conf. on Comput. Learn. Theory, 2000, pp. 158–169.
J. N. Darroch and D. Ratcliff, “Generalized Iterative Scaling for Log-linear Models,” Ann. Math. Stat., vol. 43, 1972, pp. 1470–1480.
S. Della Pietra, V. Della Pietra, and J. Lafferty, “Inducing Features of Random Fields,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, 1996, pp. 380–393.
Y. Freund and R. Schapire, “Experiments with a New Boosting Algorithm,” Proc. ICML, 1996, pp. 148–156.
N. Friedman, M. Goldszmidt, and T. J. Lee, “Bayesian Network Classification with Continuous Attributes: Getting the Best of Both Discretization and Parametric Fitting,” Proc. ICML, 1998, pp. 179–187.
E. T. Jaynes, “Papers on Probability, Statistics and Statistical Physics,” Reidel, Dordrecht, 1982.
R. Jin, Y. Liu, L. Si, J. Carbonell, and A. Hauptmann, “A New Boosting Algorithm Using Input-dependent Regularizer,” Proc. ICML, 2003.
H. Kang, K. Kim, and J. Kim, “Optimal Approximation of Discrete Probability Distribution with kth-order Dependency and its Application to Combining Multiple Classifiers,” Pattern Recogn. Lett., vol. 18, no. 6, 1997, pp. 515–523.
J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On Combining Classifiers,” IEEE Trans. Pattern. Anal. Mach. Intell., vol. 20, no. 3, 1998, pp. 226–239.
R. Kohavi and M. Sahami, “Error-based and Entropy-based Discretization of Continuous Features,” in Proc. of the 2nd International Conference KDD, 1996, pp. 114–119.
R. Lau and M. Sahami, “Adaptive Language Modelling Using the Maximum Entropy Approach,” in Procs. of the ARPA Human Lang. Tech. Workshop, 1993, pp. 110–113.
G. Lebanon and J. Lafferty. “Boosting and Maximum Likelihood for Exponential Models,” NIPS, vol. 15, 2001.
R. Malouf, “A Comparison of Algorithms for Maximum Entropy Parameter Estimation,” in Procs of the sixth conf. on Natural Language Learning, 2002, pp. 49–55.
J. Jeon and R. Manmatha, “Using Maximum Entropy for Automatic Image Annotation,” Image and Video Retrieval: Third Intl. Conf., CIVR, 2004.
R. Meir, R. El-Yaniv, and S. Ben-David, “Localized Boosting,” in Proc. Conf. on Comput. Learning Theory, 2000, pp. 190–199.
D. J. Miller and L. Yan, “Approximate Maximum Entropy Joint Feature Inference Consistent with Arbitrary Lower-order Probability Constraints: Application to Statistical Classification,” Neural Comput., vol. 12, no. 9, 2000, pp. 2175–2207.
D. J. Miller and S. Pal, “Transductive Methods for the Distributed Ensemble Classification Problem,” Neural Comput (in press).
D. J. Miller and L. Yan, “An Approximate Maximum Entropy Method for Classification and more General Inference: Relation to other Maxent Methods and to Naive Bayes,” CISS, 2000.
S. J. Phillips, M. Dudik, and R. E. Schapire, “A Maximum Entropy Approach to Species Distribution Modeling,” ICML, 2004.
A. Schwaighofer, “SVM Toolbox for Matlab,” Available from http://ida.first.fraunhofer.de/~anton/software.html.
G. Schwarz, “Estimating the Dimension of a Model,” Ann. Statist., vol. 6, no. 2, 1978, pp. 461–464.
P. Smyth, “Clustering Using Monte Carlo Cross-validation,” KDD, 1996, pp. 126–133.
L. Tjen-Sien, L. Wei-Yin, and S. Yu-Shan, “A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-three Old and New Classification Algorithms,” Mach. Learn., 2000, pp. 203–229.
N. Ueda and R. Nakano, “Combining Discriminant-based Classifiers Using the Minimum Classification Error Discriminant,” IEEE W. on NNSP, 1997, pp. 365–374.
S. Wang, D. Schuurmans, and Y. Zhao, “The Latent Maximum Entropy Principle,” IEEE Trans. on Inf. Theory, 2002. (Submitted).
L. Yan and D. J. Miller, “Critic-driven Ensemble Classification via a Learning Method Akin to Boosting,” in Intell. Eng. Sys. Through ANN 1, 2001, pp. 27–32.
L. Yan and D. J. Miller, “General Statistical Inference for Discrete and Mixed Spaces by an Approximate Application of the Maximum Entropy Principle,” IEEE Trans. on NN, vol. 11, no. 3, 2000, pp. 558–573.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix A
For an arbitrary parameter vector \(\underline{\gamma}\), the conditional log-likelihood is
For a change in the parameter vector \(\underline{\Delta\gamma}\), the change in log-likelihood is \(L(\underline{\gamma}+\underline{\Delta\gamma})-L(\underline{\gamma})\). Using the identity \(-\ln(\alpha)\geq 1-\alpha\), we obtain the lower bound
With \(\underline{\Delta\gamma}\) chosen so that \(B(\underline{\Delta\gamma}|\underline{\gamma}) \geq 0\) there is improvement in log-likelihood. The obvious approach then is to maximize \(B(\underline{\Delta\gamma}|\underline{\gamma})\). However, there is coupled dependence in \(B(\underline{\Delta\gamma}|\underline{\gamma})\) on the individual components \(\left\{\Delta\gamma(C=c,{F}_{i}=j)\right\}\), which would necessitate a complicated joint optimization. Thus, we seek an auxiliary function that decouples this dependence. We first rewrite
Then, we note that \(\sum\nolimits_{i=1}^{N_{d}}\sum\nolimits_{j\in{\cal A}_{i}}\frac{P[{F}_{i}=j|t]}{N_{d}}=1\) i.e., \(\left\{\frac{1}{N_d}P[{F}_{i}=j|t], \hspace{0.05in} i=1,...,N_d,\hspace{0.05in} j\in{\cal A}_{i}\right\} \) is an instance of the joint pmf \(P[I,F_I]\), associated with first selecting a feature \(i \in \left\{1,...,N_d\right\}\) and then a feature value \(f_i \in {\cal A}_{i}\). Applying Jensen’s inequality \(e^{\sum_{x} p(x)q(x)} \leq \sum_{x} p(x) e^{q(x)}\) to the right hand side of Eq. (23), we have
Thus \(L(\underline{\gamma}+\underline{\Delta\gamma})-L(\underline{\gamma}) \geq B(\underline{\Delta\gamma}|\underline{\gamma}) \geq A(\underline{\Delta\gamma}|\underline{\gamma})\), i.e., we have a new, not as tight, lower bound. Let \(\underline{\Delta \gamma^{\ast}} = \arg\max_{\underline{\Delta \gamma}} A(\cdot)\). Since it is easy to verify that \(A(\underline{0}|\underline{\gamma})=0\), it must be true that \(A(\underline{\Delta \gamma^{\ast}}|\underline{\gamma}) \geq 0\), i.e., property A1 in Section 2.3 is satisfied by this function. Moreover, \(A(\underline{\Delta\gamma}|\underline{\gamma})\) can be additively decoupled into individual terms each depending on a single \(\Delta\gamma(C=k,{F}_{n}=q)\). Differentiating \(A(\underline{\Delta\gamma}|\underline{\gamma})\) with respect to \(\Delta\gamma(C=k,{F}_{n}=q)\) and equating to 0 we get the choice of \(\underline{\Delta\gamma}\) to maximize \(A(\underline{\Delta\gamma}|\underline{\gamma})\), i.e.,
Appendix B
Theorem 1
Consider the function \(A(\underline{\Delta\gamma}|\underline{\gamma})\) defined in Eq. (24). Let \(\underline{\Delta \gamma^{\ast}} = \arg\max_{\underline{\Delta \gamma}} A(\underline{\Delta \gamma}|\underline{\gamma})\). Then, \(A(\underline{\Delta\gamma^{\ast}}|\underline{\gamma})=0\) \(ff\;\underline{{\Delta \gamma ^{ * } }} = \underline{0} \). When this occurs the constraints (4) are all met.
Proof
Setting \( \dfrac{\partial A(\underline{\Delta\gamma}|\underline{\gamma})}{\partial\Delta\gamma(F_{i}=j,C=c)}=0\) gives the solution in Eq. (25). Moreover, this is a maximum since \(\frac{{\partial ^{2} A{\left( {\underline{{\Delta \gamma }} \left| {\underline{\gamma } } \right.} \right)}}}{{\partial \Delta \gamma ^{2} {\left( {F_{i} = j,C = c} \right)}}} < 0\forall i,j,c\). Thus, \(\underline{\Delta \gamma^{\ast}} = \left\{ \Delta\gamma^{\ast}(F_{i}=j,C=c) \hspace{0.05in} \forall i,\hspace{0.05in} \forall j \in {\cal A}_i, \hspace{0.05in} c \in C \right\}.\)
Plugging the solution for \(\underline{\Delta \gamma^{\ast}}\) back into \(A(\underline{\Delta\gamma}|\underline{\gamma})\) in Eq. (24) and simplifying gives
Noting that the third term equals T, we have
Now, since \(E(-\ln(\cdot)) \ge -\ln(E(\cdot))\) by Jensen’s inequality, we have
i.e., \(A(\underline{\Delta \gamma^{\ast}}|\underline{\gamma}) \ge 0\). Finally by strict convexity of \(\ln\left(\cdot\right)\), equality is achieved iff the argument of the \(\ln\left(\cdot\right)\) is 1. But this occurs iff
i.e., by Eq. (25), iff \(\underline{\Delta \gamma^{\ast}} = \underline{0}\). Clearly, when this occurs, the constraints are met.▪
Rights and permissions
About this article
Cite this article
Pal, S., Miller, D.J. An Extension of Iterative Scaling for Decision and Data Aggregation in Ensemble Classification. J VLSI Sign Process Syst Sign Im 48, 21–37 (2007). https://doi.org/10.1007/s11265-006-0009-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11265-006-0009-6