Abstract
In the design of mobile systems, hardware/software (HW/SW) co-design has important advantages by creating specialized hardware for the performance or power optimizations. Dynamic binary translation (DBT) is a key component in co-design. During the translation, a dynamic optimizer in the DBT system applies various software optimizations to improve the quality of the translated code. With dynamic optimization, optimization time is an exposed run-time overhead and useful analyses are often restricted due to their high costs. Thus, a dynamic optimizer needs to make smart decisions with limited analysis information, which complicates the design of optimization decision models and often causes failures in human-made heuristics. In mobile systems, this problem is even more challenging because of strict constraints on computing capabilities and memory size.
To overcome the challenge, we investigate an opportunity to build practical optimization decision models for DBT by using machine learning techniques. As the first step, loop unrolling is chosen as the representative optimization. We base our approach on the industrial strength DBT infrastructure and conduct evaluation with 17,116 unrollable loops collected from 200 benchmarks and real-life programs across various domains. By utilizing all available features that are potentially important for loop unrolling decision, we identify the best classification algorithm for our infrastructure with consideration for both prediction accuracy and cost. The greedy feature selection algorithm is then applied to the classification algorithm to distinguish its significant features and cut down the feature space. By maintaining significant features only, the best affordable classifier, which satisfies the budgets allocated to the decision process, shows 74.5% of prediction accuracy for the optimal unroll factor and realizes an average 20.9% reduction in dynamic instruction count during the steady-state translated code execution. For comparison, the best baseline heuristic achieves 46.0% prediction accuracy with an average 13.6% instruction count reduction. Given that the infrastructure is already highly optimized and the ideal upper bound for instruction reduction is observed at 23.8%, we believe this result is noteworthy.
- 2019-02-08. Intel Core i7 Embedded Processor. https://ark.intel.com/products/series/122593/8th-Generation-Intel-Core-i7-Processors#@embedded.Google Scholar
- 2019-06-02. 3DMark. https://www.3dmark.com/.Google Scholar
- 2019-06-02. FPMark. https://www.eembc.org/fpmark/.Google Scholar
- 2019-06-02. Geekbench. https://www.geekbench.com/.Google Scholar
- 2019-06-02. SYSmark. https://bapco.com/products/sysmark-2018/.Google Scholar
- 2019-06-02. TabletMark. https://bapco.com/products/end-of-life-products/tabletmark/.Google Scholar
- Felice Balarin, Paolo Giusto, Attila Jurecska, Michael Chiodo, Harry Hsieh, Claudio Passerone, Ellen Sentovich, Luciano Lavagno, Bassam Tabbara, Alberto Sangiovanni-Vincentelli, et al. 1997. Hardware-software Co-design of Embedded Systems: The POLIS Approach. Springer Science 8 Business Media.Google Scholar
- Stefan Behnel, Robert Bradshaw, Craig Citro, Lisandro Dalcin, Dag Sverre Seljebotn, and Kurt Smith. 2011. Cython: The best of both worlds. Computing in Science 8 Engineering 13, 2 (2011), 31--39.Google Scholar
- Edson Borin, Youfeng Wu, Cheng Wang, Wei Liu, Mauricio Breternitz Jr, Shiliang Hu, Esfir Natanzon, Shai Rotem, and Roni Rosner. 2010. TAO: Two-level atomicity for dynamic binary optimizations. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 12--21.Google ScholarDigital Library
- James Bucek, Klaus-Dieter Lange, et al. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. ACM, 41--42.Google ScholarDigital Library
- John Cavazos and J. Eliot B. Moss. 2004. Inducing heuristics to decide whether to schedule. In ACM SIGPLAN Notices, Vol. 39. ACM, 183--194.Google ScholarDigital Library
- John Cavazos and Michael F. P. O’Boyle. 2005. Automatic tuning of inlining heuristics. In Supercomputing, 2005. Proceedings of the ACM/IEEE SC 2005 Conference. IEEE, 14--14.Google Scholar
- John Cavazos and Michael F. P. O’boyle. 2006. Method-specific dynamic compilation using logistic regression. ACM SIGPLAN Notices 41, 10 (2006), 229--240.Google ScholarDigital Library
- Jack W. Davidson and Sanjay Jinturkar. 1996. Aggressive loop unrolling in a retargetable, optimizing compiler. In International Conference on Compiler Construction. Springer, 59--73.Google Scholar
- James C. Dehnert, Brian K. Grant, John P. Banning, Richard Johnson, Thomas Kistler, Alexander Klaiber, and Jim Mattson. 2003. The transmeta code morphing/spl trade/Software: Using speculation, recovery, and adaptive retranslation to address real-life challenges. In Code Generation and Optimization, 2003. CGO 2003. International Symposium on. IEEE, 15--24.Google ScholarCross Ref
- Kemal Ebcioglu, Erik Altman, Michael Gschwind, and Sumedh Sathaye. 2001. Dynamic binary translation and optimization. IEEE Transactions on Computers 50, 6 (2001), 529--548.Google ScholarDigital Library
- Ramaswamy Govindarajan, Erik R. Altman, and Guang R. Gao. 1994. Minimizing register requirements under resource-constrained rate-optimal software pipelining. In Proceedings of the 27th Annual International Symposium on Microarchitecture. ACM, 85--94.Google Scholar
- John L. Hennessy and David A. Patterson. 2011. Computer Architecture: A Quantitative Approach. Elsevier.Google ScholarDigital Library
- John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. ACM SIGARCH Computer Architecture News 34, 4 (2006), 1--17.Google ScholarDigital Library
- Kenneth Hoste, Andy Georges, and Lieven Eeckhout. 2010. Automated just-in-time compiler tuning. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization. ACM, 62--72.Google ScholarDigital Library
- Chandra Krintz and Brad Calder. 2001. Using annotations to reduce dynamic optimization time. ACM Sigplan Notices 36, 5 (2001), 156--167.Google ScholarDigital Library
- Hugh Leather, Edwin Bonilla, and Michael O’Boyle. 2009. Automatic feature generation for machine learning based optimizing compilation. In Proceedings of the 7th Annual IEEE/ACM International Symposium on Code Generation and Optimization. IEEE Computer Society, 81--91.Google ScholarDigital Library
- Andy Liaw, Matthew Wiener, et al. 2002. Classification and regression by randomForest. R News 2, 3 (2002), 18--22.Google Scholar
- Ankur Limaye and Tosiron Adegbija. 2018. A workload characterization of the SPEC CPU2017 benchmark suite. In Performance Analysis of Systems and Software (ISPASS), 2018 IEEE International Symposium on. IEEE, 149--158.Google ScholarCross Ref
- Yu Liu, Hantian Zhang, Luyuan Zeng, Wentao Wu, and Ce Zhang. 2018. MLbench: Benchmarking machine learning services against human experts. Proceedings of the VLDB Endowment 11, 10 (2018), 1220--1232.Google ScholarDigital Library
- Josep Llosa, Mateo Valero, E. Agyuade, and Antonio González. 1998. Modulo scheduling with reduced register pressure. IEEE Transactions on Computers6 (1998), 625--638.Google ScholarDigital Library
- Uma Mahadevan and Lacky Shah. 1998. Intelligent loop unrolling. US Patent 5,797,013.Google Scholar
- Antoine Monsifrot, François Bodin, and Rene Quiniou. 2002. A machine learning approach to automatic production of compiler heuristics. In International Conference on Artificial Intelligence: Methodology, Systems, and Applications. Springer, 41--50.Google ScholarDigital Library
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, Oct (2011), 2825--2830.Google ScholarDigital Library
- Erez Perelman, Greg Hamerly, Michael Van Biesbrouck, Timothy Sherwood, and Brad Calder. 2003. Using SimPoint for accurate and efficient simulation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 31. ACM, 318--319.Google Scholar
- Archana Ravindar and Y. N. Srikant. 2011. Relative roles of instruction count and cycles per instruction in WCET estimation. In ACM SIGSOFT Software Engineering Notes, Vol. 36. ACM, 55--60.Google Scholar
- Stuart J Russell and Peter Norvig. 2016. Artificial Intelligence: A Modern Approach. Malaysia; Pearson Education Limited.Google ScholarDigital Library
- Vivek Sarkar. 2000. Optimized unrolling of nested loops. In Proceedings of the 14th International Conference on Supercomputing. ACM, 153--166.Google ScholarDigital Library
- Mark Stephenson and Saman Amarasinghe. 2005. Predicting unroll factors using supervised classification. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE Computer Society, 123--134.Google ScholarDigital Library
- Mark Stephenson, Saman Amarasinghe, Martin Martin, and Una-May O’Reilly. 2003. Meta optimization: Improving compiler heuristics with machine learning. In ACM SIGPLAN Notices, Vol. 38. ACM, 77--90.Google ScholarDigital Library
- Cheng Wang and Youfeng Wu. 2013. TSO_ATOMICITY: Efficient hardware primitive for TSO-preserving region optimizations. In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 509--520.Google ScholarDigital Library
- Zheng Wang and Michael O’Boyle. 2018. Machine learning in compiler optimization. Proc. IEEE (2018).Google Scholar
- Markus Willems, Volker Bursgens, Thorsten Grotker, and Heinrich Meyr. 1997. FRIDGE: An interactive code generation environment for HW/SW codesign. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, 287--290.Google ScholarCross Ref
- Wayne H. Wolf. 1994. Hardware-software co-design of embedded systems. Proc. IEEE 82, 7 (1994), 967--989.Google ScholarCross Ref
- Kamen Yotov, Xiaoming Li, Gang Ren, Michael Cibulskis, Gerald DeJong, Maria Garzaran, David Padua, Keshav Pingali, Paul Stodghill, and Peng Wu. 2003. A comparison of empirical and model-driven optimization. ACM SIGPLAN Notices 38, 5 (2003), 63--76.Google ScholarDigital Library
- Xinchuan Zeng and Tony R. Martinez. 2000. Distribution-balanced stratified cross-validation for accuracy estimation. Journal of Experimental 8 Theoretical Artificial Intelligence 12, 1 (2000), 1--12.Google ScholarCross Ref
- Guoqiang Peter Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30, 4 (2000), 451--462.Google ScholarDigital Library
Index Terms
- Multi-objective Exploration for Practical Optimization Decisions in Binary Translation
Recommendations
The continuous artificial bee colony algorithm for binary optimization
This paper introduces an ABC variant to solve binary optimization problems.The performance of the proposed method is investigated on well-known UFLPs.The proposed method is compared with the ABC variants and PSO variants.The experimental results show ...
Movement Strategies for Multi-Objective Particle Swarm Optimization
Particle Swarm Optimization (PSO) is one of the most effective metaheuristics algorithms, with many successful real-world applications. The reason for the success of PSO is the movement behavior, which allows the swarm to effectively explore the search ...
Multi-objective optimization using BFO algorithm
ICIC'11: Proceedings of the 7th international conference on Intelligent Computing: bio-inspired computing and applicationsThis paper describes a novel bacterial foraging optimization (BFO) approach to multi-objective optimization, called Multi-objective Bacterial Foraging Optimization (MBFO). The search for Pareto optimal set of multi-objective optimization problems is ...
Comments