Abstract
Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted. In this paper, we report the potential performance improvement of domain-specific interpreter specialization for data analysis workloads and evaluate how the characteristics of data analysis workloads affect the specialization, both positively and negatively. Assisted by compilers, we specialize the Lu and CPython interpreters at source-level using the script being interpreted and the data types during the interpretation as invariants for five common tasks from real data analysis workloads. Through experiments, we measure 9.0–39.6% performance improvement for Lu and 11.0–17.2% performance improvement for CPython for benchmarks that perform data loading, histogram computation, data filtering, data transformation, and dataset shuffle. This specialization does not include misspeculation checks of data types at possible type conversion code that may be necessary for other workloads. We report the details of our evaluation and present a semi-automatic method for specializing the interpreters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kaggle. https://www.kaggle.com/. Accessed on Apr 2021
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. Technical report arXiv:1411.1607v4, MIT and Julia Computing (2015)
Bolz, C.F., Cuni, A., Fijalkowski, M., Rigo, A.: Tracing the meta-level: pypy’s tracing JIT compiler. In: Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, pp. 18–25 (2009)
Catanzaro, B., et al.: SEJITS: getting productivity and performance with selective embedded JIT specialization. Technical report. UCB/EECS-2010-23, EECS Department, University of California, Berkeley (2010)
Chamberlain, B.L., et al.: Chapel comes of age: making scalable programming productive. In: Cray User Group Conference (2018)
Cheng, L., Ilbeyi, B., Bolz-Tereick, C.F., Batten, C.: Type freezing: exploiting attribute type monomorphism in tracing JIT compilers. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 16–29 (2020)
Chevalier-Boisvert, M., Hendren, L., Verbrugge, C.: Optimizing Matlab through Just-In-Time specialization. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 46–65. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11970-5_4
Coverage.py Developers: Coverage.py 5.5 documentation (2021). https://coverage.readthedocs.io/en/coverage-5.5/. Accessed on Apr 2021
Valgrind Developers: Callgrind: a call-graph generating cache and branch prediction profiler. Valgrind (2021). http://valgrind.org/docs/manual/cl-manual.html. Accessed on Apr 2021
Futamura, Y.: Partial evaluation of computation process-an approach to a compiler-compiler. Higher-Order Symbolic Comput. 12(4), 381–391 (1999)
Gal, A., Eich, B., Shaver, M., Anderson, D., Mandelin, D., Haghighat, M.R., Kaplan, B., Hoare, G., Zbarsky, B., Orendorff, J., et al.: Trace-based Just-In-Time type specialization for dynamic languages. ACM Sigplan Not. 44(6), 465–478 (2009)
Hunter, J.D.: Matplotlib: A 2D Graphics Environment. IEEE Ann. Hist. Comput. 9(03), 90–95 (2007)
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004, pp. 75–86. IEEE (2004)
Lindenbaum, P.: Programming language use distribution from recent programs/articles (2017). https://www.biostars.org/p/251002/
LuaCov Developers: LuaCov - Coverage analysis for Lua scripts (2021). https://keplerproject.github.io/luacov/. Accessed on Apr 2021
McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), 1–9 (2011)
Oh, T., Beard, S.R., Johnson, N.P., Popovych, S., August, D.I.: A generalized framework for automatic scripting language parallelization. In: In the Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2017)
Oh, T., Kim, H., Johnson, N.P., Lee, J.W., August, D.I.: Practical automatic loop specialization. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 419–430. ASPLOS 2013, ACM, New York, NY, USA (2013)
Oliphant, T.E.: Guide to NumPy, vol. 1. Trelgol Publishing USA (2006)
Pedregosa, F., et al.: Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Plotly Developers: Plotly Python Open Source Graphing Library (2021). https://plotly.com/python/. Accessed on Apr 2021
Seaborn Developers: seaborn: statistical data visualization (2021). https://seaborn.pydata.org/. Accessed on Apr 2021
Shoelson, B., Tannenbaum, B.: New features for high-performance image processing in MATLAB (2012). https://www.mathworks.com/company/newsletters/articles/new-features-for-high-performance-image-processing-in-matlab.html
van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020)
Zhang, R., Debray, S., Snodgrass, R.T.: Micro-specialization: dynamic code specialization of database management systems. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 63–73 (2012)
Zhang, R., Snodgrass, R.T., Debray, S.: Application of micro-specialization to query evaluation operators. In: 2012 IEEE 28th International Conference on Data Engineering Workshops, pp. 315–321. IEEE (2012)
Zhang, R., Snodgrass, R.T., Debray, S.: Micro-specialization in DBMSes. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 690–701. IEEE (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
He, W., Strout, M.M. (2021). Potential of Interpreter Specialization for Data Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_14
Download citation
DOI: https://doi.org/10.1007/978-3-030-90539-2_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)