ABSTRACT
The use of well-established statistical testing procedures to compare the performance of evolutionary algorithms often yields pessimistic results. This requires increasing the number of independent samples, and thus the computation time, in order to get results with the necessary precision.
We aim at improving this situation by developing statistical tests that are good in answering typical questions coming from benchmarking of evolutionary algorithms. Our first step, presented in this paper, is a procedure that determines whether the performance distributions of two given algorithms are identical for each of the benchmarks. Our experimental study shows that this procedure is able to spot very small differences in the performance of algorithms while requiring computational budgets which are by an order of magnitude smaller (e.g. 15x) compared to the existing approaches.
- 2015. Bayesian statistics. Nature Methods 12 (2015), 377--378.Google ScholarCross Ref
- William Jay Conover. 1999. Practical Nonparametric Statistics (3rd ed.). Wiley.Google Scholar
- Axel de Perthuis de Laillevault, Benjamin Doerr, and Carola Doerr. 2015. Money for Nothing: Speeding Up Evolutionary Algorithms Through Better Initialization. In Proceedings of Genetic and Evolutionary Computation Conference. 815--822. Google ScholarDigital Library
- Joaquin Derrac, Salvador Garcia, Daniel Molina, and Francisco Herrera. 2011. A Practical Tutorial on the Use of Nonparametric Statistical Tests as a Methodology for Comparing Evolutionary and Swarm Intelligence Algorithms. Swarm and Evolutionary Computation 1, 1 (2011), 3--18.Google ScholarCross Ref
- Benjamin Doerr and Carola Doerr. 2016. The Impact of Random Initialization on the Runtime of Randomized Search Heuristics. Algorithmica 75, 3 (2016), 529--553. Google ScholarDigital Library
- Olive Jean Dunn. 1961. Multiple Comparisons Among Means. J. Amer. Statist. Assoc. 56, 293 (1961), 52--64.Google ScholarCross Ref
- Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics 11, 1 (1940), 86--92.Google ScholarCross Ref
- Yosef Hochberg. 1988. A Sharper Bonferroni Procedure for Multiple Tests of Significance. Biometrika 75, 4 (1988), 800--802.Google ScholarCross Ref
- Myles Hollander, Douglas A. Wolfe, and Eric Chicken. 2007. Nonparametric Statistical Methods (3rd ed.). Wiley.Google Scholar
- Andrey Kolmogorov. 1933. Sulla determinazione empirica di una legge di distribuzione. Giornale dell'Istituto Italiano degli Attuari 4 (1933), 83--91.Google Scholar
- William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. J. Amer. Statist. Assoc. 47 (1952), 583--621.Google ScholarCross Ref
- Henry B. Mann and Donald R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18, 1 (1947), 50--60.Google ScholarCross Ref
- R Core Team. 2013. R: A Language and Environment for Statistical Computing. http://www.R-project.org/. http://www.R-project.org/Google Scholar
- John A. Rice. 2007. Mathematical Statistics and Data Analysis (3rd ed.). Cengage Learning.Google Scholar
- Nikolai Smirnov. 1948. Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics 19, 2 (1948), 279--281.Google ScholarCross Ref
- Frank Wilcoxon. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1, 6 (1945), 80--83.Google ScholarCross Ref
Index Terms
- Towards better estimation of statistical significance when comparing evolutionary algorithms
Recommendations
A comparison of statistical significance tests for information retrieval evaluation
CIKM '07: Proceedings of the sixteenth ACM conference on Conference on information and knowledge managementInformation retrieval (IR) researchers commonly use three tests of statistical significance: the Student's paired t-test, the Wilcoxon signed rank test, and the sign test. Other researchers have previously proposed using both the bootstrap and Fisher's ...
Conducting Laboratory Experiments Properly with Statistical Tools: An Easy Hands-On Tutorial
WSDM '19: Proceedings of the Twelfth ACM International Conference on Web Search and Data MiningThis hands-on half-day tutorial consists of two sessions. Part~I covers the following topics: Preliminaries; Paired and two-sample t-tests, confidence intervals; One-way ANOVA and two-way ANOVA without replication; Familiwise error rate. Part~II covers ...
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalStatistical significance testing is widely accepted as a means to assess how well a difference in effectiveness reflects an actual difference between systems, as opposed to random noise because of the selection of topics. According to recent surveys on ...
Comments