Large margin principle in hyperrectangle learning
Introduction
Besides pure classification accuracy, human interpretability becomes more and more important in many real world applications. For example, the US federal law “Equal Credit Opportunity Act” forces financial institutions to provide profound reasons in case of rejected credit applications [1]. As low scores or pure classification results are clearly no sufficient reasons, they have to use interpretable, for example, rule based models instead, which provide more detailed explanations. Besides legal requirements, interpretability often plays a major role for the user acceptance of machine learning models. For example, in the medical domain most doctors are not willing to blindly prescribe treatments solely based on the diagnosis results of black box models. Enabling users to extract the learned concepts for validation and knowledge acquisition is another crucial advantage of comprehensibility.
In general, interpretability is a very subjective concept. It is both, hard to formalize and to measure. Nevertheless, the hypothesis language is one of the main criteria. While, for example, decision rules are a very natural way of expressing knowledge, arbitrary hyperplanes in multidimensional feature spaces, polytopes or mixtures of Gaussian probability distributions are hard to grasp. In this paper, we are focusing on axis-parallel hyperrectangle models in numerical domains as they can be directly transformed to equivalent decision rules. Among others, decision trees are an example of this model type.
Despite their popularity in many application areas, interpretable models, like decision trees, are often discarded in favor of black box models due to weaker prediction accuracies. Especially support vector machines became very popular in the last two decades. One of their key features and a crucial performance factor is the large margin principle. More precisely, it has been proven that the margin of linear classifiers is indirectly proportional to its statistical risk. Nevertheless, the large margin principle is not limited to linear classifiers, but has also been proven relevant to other machine learning approaches like boosting and clustering.
In this paper, we are introducing a new way to combine the interpretability of hyperrectangle models and the beneficial effects of the large margin principle. As a foundation, we describe the basic ideas behind our approach and provide a formalized margin criterion for hyperrectangle models in Section 3. Subsequently, we present the new meta learning approach LMRL and describe the applied algorithms in Section 4. And finally, we provide experimental results in Section 5.
Section snippets
Supervised learning and the large margin principle
“The concept of Large Margins has recently been identified as a unifying principle for analyzing many different approaches to the problem of learning to classify data from examples, including Boosting [2], Mathematical Programming, Neural Networks and Support Vector Machines. The fact that it is the margin or confidence level of a classification (i.e., a scale parameter) rather than the raw training error that matters has become a key tool in recent years when dealing with classifiers” [3].
To
Hyperrectangle models as supervised clustering
In this paper, we especially consider supervised learning limited to hyperrectangles as hypothesis language. Based on a given training set, the goal is to infer corresponding classification models with both, high prediction accuracy and good interpretability. Definition 3.1 Hyperrectangle Models Classification models in numerical feature space, which are defined by a set of axis-parallel hyperrectangles . The corresponding hyperrectangle boundaries are represented by and . Thereby, is the upper bound of
The basic approach
Large Margin Rectangle Learning (LMRL) is a new meta learning approach aiming to incorporate the large margin principle into the generation of hyperrectangle models. It significantly differs from existing divide-and-conquer and separate-and-conquer approaches, where decision boundaries are directly created based on the given training examples. In contrast, LMRL is a two step approach.
In the first step, called supervised clustering, we aim to find a suitable representation of the given example
Experimental settings
In our experiments, we separately studied LMSC and LearnRight and both in combination with LMRL. Moreover, we compared them to the decision tree learner C4.5, the rule learner RIPPER, a k-nearest neighbor classifier, a naive Bayes classifier and a SVM. Therefore, we primarily used 10 numerical and normalized benchmark data sets3
Related work
One of the first explicit hyperrectangle based learning approach has been proposed in [14] as the Nested Generalized Exemplar (NGE) framework. The authors directly generated hyperrectangles by iteratively aggregating examples from the given training set. Following the nearest neighbor principle, unseen examples have then been classified regarding their nearest rectangle. LearnRight [7], which has been used by us as a supervised clustering method, was an extension of this basic approach. Other
Conclusion
The purpose of our work has been to improve the accuracy of interpretable hyperrectangle models. Thereby, we identified the large margin principle, successfully applied in other machine learning areas, as a promising approach. Based on a new formal margin definition for hyperrectangle models, which naturally supports multiclass problems, we introduced a corresponding novel meta learning approach. Large Margin Rectangle Learning aims to optimize the global configuration margin by a two step
Matthias Kirmse received his B.S. and Diploma degrees in computer science from the Dresden University of Technology in 2007 and 2008 respectively. He is currently a Ph.D. candidate at the Artificial Intelligence Institute of Dresden University of Technology. His research interests include data mining, theoretical and applied machine learning as well as fault detection and diagnosis.
References (21)
- et al.
Using rule extraction to improve the comprehensibility of predictive models
Social Science Research Network
(2006) - et al.
Boosting the margina new explanation for the effectiveness of voting methods
Annals of Statistics
(1998) Advances in Large Margin Classifiers
(2000)Statistical Learning Theory
(1998)- J. Sinkkonen, S. Kaski, J. Nikkilae, Discriminative clustering: optimal contingency tables by learning metrics, in:...
Supervised ClusteringAlgorithms and Applications
(2005)- B.J. Gao, Hyper-rectangle-based discriminative data generalization and applications in data mining, Ph.D. Thesis, Simon...
Genetic Algorithms in Search, Optimization, and Machine Learning
(1989)- C.F. Eick, N. Zeidat, Z. Zhao, Supervised clustering – algorithms and benefits, in: 16th IEEE International Conference...
- K. Bennett, J. Blue, Optimal Decision Trees, Rensselaer Polytechnic Institute Math Report,...
Cited by (0)
Matthias Kirmse received his B.S. and Diploma degrees in computer science from the Dresden University of Technology in 2007 and 2008 respectively. He is currently a Ph.D. candidate at the Artificial Intelligence Institute of Dresden University of Technology. His research interests include data mining, theoretical and applied machine learning as well as fault detection and diagnosis.
Uwe Petersohn studied Computer Science at the Technical University of Dresden , where he received his Promotion in 1975 and Habilitation in 1981 about Discrete Optimization. Scientific Assistant in Computer Science Department (1974–1981). Member of a research staff and head of group of the Research Centre at Robotron Company, Dresden (1981–1986). Lecturer and Assistant Professor at TUD, Department of Computer Science (since 1986) and Head of the group Applied Knowledge Representation and Reasoning. Areas of interests are Knowledge Representation and Reasoning, Problem Solving, Reasoning with Uncertain Knowledge, Case based Reasoning, Complex Decisions, Discrete optimization, Hybrid knowledge models, Machine Learning and Design of applications.