We introduce a mathematical programming approach to building rule lists, which are a type of interpretable, nonlinear, and logical machine learning classifier involving IF-THEN rules. Unlike traditional decision tree algorithms like CART and C5.0, this method does not use greedy splitting and pruning. Instead, it aims to fully optimize a combination of accuracy and sparsity, obeying user-defined constraints. This method is useful for producing non-black-box predictive models, and has the benefit of a clear user-defined tradeoff between training accuracy and sparsity. The flexible framework of mathematical programming allows users to create customized models with a provable guarantee of optimality. The software reviewed as part of this submission was given the DOI (Digital Object Identifier) https://doi.org/10.5281/zenodo.1344142.

The recall or sensitivity of a classifier is the true positive rate, the precision is the fraction of predicted positives that are true positives, the specificity is the true negative rate:
$$ \begin{aligned} \text { recall = sensitivity }= & {} \frac{\sum _i 1_{(y_i=f(x_i)) \& (y_i=1)}}{\sum _i 1_{y_i=1}}, \,\,\, \text { precision }= \frac{\sum _i 1_{(y_i = f(x_i)) \& (y_i=1)}}{\sum _i 1_{f(x_i)=1}}, \\&\text { specificity } = \frac{\sum _i 1_{(y_i=f(x_i)) \& (y_i=-1)}}{\sum _i 1_{y_i=-1}} . \end{aligned}$$
We gratefully acknowledge funding from the MIT Big Data Initiative, and the National Science Foundation under grant IIS-1053407. Thanks to Daniel Bienstock and anonymous reviewers for encouragement and for helping us to improve the readability of the manuscript.
Appendix A: Additional accuracy comparison experiments
Table 10 shows more detail about experiments, specifically, it contains the numerical values for accuracy for all algorithms, and all pairwise hypothesis tests. Because there can be a large number of parameters to tune in several of the algorithms in Table 10, it is clearly possible to tune them to provide better performance; for instance, in our method there are tuning parameters that govern the number and characteristics of rules in each class, along with tuning parameters for regularization. We chose a single parameter setting for our method for experimental comparisons to other methods, to avoid the possibility that the method performs well due to its flexibility. Further, as Table 11 shows, for SVM with gaussian kernels, there is not a single setting of SVM parameter values that is the best for all datasets. This table also shows the range of values one obtains when using SVM with various parameter settings. Note in particular that SVM never has perfect test accuracy on the Tic Tac Toe dataset, for any parameter settings we tried.
Appendix B: CART and C5.0 have difficulty with the Tic Tac Toe dataset
Figures 5 and 7 show decision trees for CART and C5.0, which are not particularly interpretable. Even as we varied C5.0 and CART’s parameters across their full ranges, they were not able to detect the pattern, as shown in Figs. 6 and 8.
Appendix C: ORL Tic Tac Toe models for other folds
The ORL models for other folds are shown in Tables 12 and 13. ORL provides correct models on all folds.
Appendix D: Additional Haberman experiments
In Table 14 we show the effect of varying \(C_1\) on the accuracy of classification for one fold of the Haberman experiment, with C fixed at 1 / number of rules, with a 2 h maximum time limit for the solver (here, CPLEX). As long as \(C_1\) is small enough the accuracy is not affected.
Appendix E: Violent crime F-scores and Gmeans
Table 15 shows numerical values for the training and test F-scores and G-means. The test values are also displayed in Fig. 3.
Appendix F: README for ORL package
This package contains the data and code for running ORL experiments, associated with the paper Learning Customized and Optimized Lists of Rules with Mathematical Programming by Cynthia Rudin and Şeyda Ertekin.
In the github repository https://github.com/SeydaErtekin/ORL, the code for the first phase of ORL (Rule Generation) is under the Rule_Generation directory, and code for the second phase (Ranking of the discovered rules) is under the Rule_Ranking directory. We provide two of the datasets that we used in our experiments, namely Haberman’s Survival and TicTacToe, under the Datasets directory.
In the package, we provide two shell scripts for running experiments with Haberman and TicTacToe datasets. The first script, run_haberman.sh, uses Haberman’s sample train/test split under Datasets/processed/ and invokes the sequence of codes for generating and ranking rules, followed by displaying the ranked rules. With the default settings, the script generates the ranked rules shown in Table 5 in the paper. For TicTacToe, we use the toy ruleset under Rule_Generation/rules, so run_tictactoe.sh only runs the rule ranking and displaying routines. This ruleset and corresponding results form the basis of our discussion in Sect. 3.1. Note that both scripts require Matlab and AMPL with Gurobi solver to be installed on the local machine.
An overview of the order of execution and the dependencies of the code is given in the diagram below.

In this package, we also provide a sample train/test split for both datasets, as well as the rules (under Rule_Generation/rules directory), the data input for rule ranking and the ranked rules (under Rule_Ranking/rules directory). The script print_ranked_rules.m can be used to view the ordered rule lists for these splits. For the Haberman’s Survival dataset, the set of rules include all rules discovered with a particular setting of the input parameters. For the TicTacToe dataset, we provide the toy ruleset (that we discuss in Sect. 3.1 in the paper) that is a trimmed version of all discovered rules. This toy ruleset includes eight rules for the 1 class, three rules for the 0 class, and two default rules (one of each class). The input data for TicTacToe used for ranking (under Rule_Ranking/rules/tictactoe_binary_train12_rank_input.dat) only initializes the necessary parameters required for ranking; it does not need to precompute the values of the variables because the number of rules is small and the optimization completes within a few seconds.
Directory structure
Datasets .csv files of the original datasets. If you’d like to generate brand new train/test splits for the datasets, you can use the script generate_rulegen_input.m to generate up to 3 train/test splits by chunking the dataset into 3 equal sized chunks. Files for each split are suffixed with 12, 13, or 23, indicating which chunks were used for training. For example, the files with suffix 12 indicate that first and second chunks are in the train set and chunk 3 is in test set.
Note that due to the random shuffling of the examples, any newly generated train/test splits will be different than what we provided, hence may yield different results. If you’d like to use the existing splits that we reported results for in the paper, you can use the files under Datasets/processed.
Datasets/processed Directory that contains train/test sets (files with .txt extension) and the train sets in ampl data format (with .dat extension). The former files are used for performance evaluation whereas the latter files are used in rule generation.
Rule_Generation Contains generate_rulegen_input.m script for generating files under Datasets/processed, and the ampl code(s) that implement rule generation routines. GenerateRules.sa is the main implementation of the rule generation routine and AddRule.sa is a helper script (called from GenerateRules.sa) that is responsible for writing discovered rules to the output file as well as adding the rule to the list of constraints so we do not discover the same rule again in subsequent iterations. The objective and constraints for rule generation are specified in a model file called RuleGen.mod.
Rule_Generation/rules Contains the files for the discovered rules for both classes in the datasets. We provide representative rules for both datasets in this directory. Files with “one” and “zero” suffixes include rules for one and zero classes, respectively. The file with “all” suffix is the aggregate of both files and default rules for both classes.
Rule_Ranking Contains matlab script generate_rulerank_input.m for aggregating the rules for both classes under Rule_Generation/rules. The aggregate rules are written to Rule_Generation/rules (with “all” suffix and .txt extension) and an ampl formatted version is written under the rules subdirectory. The Rule_Ranking directory also includes the ampl code RankRules.sa that implements the rule ranking routine and the model file RankObj.mod.
Rule_Ranking/rules Contains the data input used for rule ranking as well as the ranking output (the \(\pi \) vector of rule heights). This directory contains the ranked rules for both dataset at obtained for different C and \(C_1\) settings. Running print_ranked_rules.m (up in the Rule_Ranking directory) prints the ranked rules for the specified dataset/experiment in human-readable form. print_accuracy.m similarly computes the accuracy on train or test set (controlled within the code) for the specified dataset/experiment.
Cite this article
Rudin, C., Ertekin, Ş. Learning customized and optimized lists of rules with mathematical programming. Math. Prog. Comp. 10, 659–702 (2018). https://doi.org/10.1007/s12532-018-0143-8
- Mixed-integer programming
- Decision trees
- Decision lists
- Sparsity
- Interpretable modeling
- Associative classification 68T05—Computer Science
- Artificial intelligence
- Learning and adaptive systems