Potential of Interpreter Specialization for Data Analysis

He, Wei; Strout, Michelle Mills

doi:10.1007/978-3-030-90539-2_14

Wei He¹² &
Michelle Mills Strout¹²

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

International Conference on High Performance Computing

1519 Accesses

Abstract

Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted. In this paper, we report the potential performance improvement of domain-specific interpreter specialization for data analysis workloads and evaluate how the characteristics of data analysis workloads affect the specialization, both positively and negatively. Assisted by compilers, we specialize the Lu and CPython interpreters at source-level using the script being interpreted and the data types during the interpretation as invariants for five common tasks from real data analysis workloads. Through experiments, we measure 9.0–39.6% performance improvement for Lu and 11.0–17.2% performance improvement for CPython for benchmarks that perform data loading, histogram computation, data filtering, data transformation, and dataset shuffle. This specialization does not include misspeculation checks of data types at possible type conversion code that may be necessary for other workloads. We report the details of our evaluation and present a semi-automatic method for specializing the interpreters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kaggle. https://www.kaggle.com/. Accessed on Apr 2021
Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. Technical report arXiv:1411.1607v4, MIT and Julia Computing (2015)
Bolz, C.F., Cuni, A., Fijalkowski, M., Rigo, A.: Tracing the meta-level: pypy’s tracing JIT compiler. In: Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, pp. 18–25 (2009)
Google Scholar
Catanzaro, B., et al.: SEJITS: getting productivity and performance with selective embedded JIT specialization. Technical report. UCB/EECS-2010-23, EECS Department, University of California, Berkeley (2010)
Google Scholar
Chamberlain, B.L., et al.: Chapel comes of age: making scalable programming productive. In: Cray User Group Conference (2018)
Google Scholar
Cheng, L., Ilbeyi, B., Bolz-Tereick, C.F., Batten, C.: Type freezing: exploiting attribute type monomorphism in tracing JIT compilers. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 16–29 (2020)
Google Scholar
Chevalier-Boisvert, M., Hendren, L., Verbrugge, C.: Optimizing Matlab through Just-In-Time specialization. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 46–65. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11970-5_4
Chapter Google Scholar
Coverage.py Developers: Coverage.py 5.5 documentation (2021). https://coverage.readthedocs.io/en/coverage-5.5/. Accessed on Apr 2021
Valgrind Developers: Callgrind: a call-graph generating cache and branch prediction profiler. Valgrind (2021). http://valgrind.org/docs/manual/cl-manual.html. Accessed on Apr 2021
Futamura, Y.: Partial evaluation of computation process-an approach to a compiler-compiler. Higher-Order Symbolic Comput. 12(4), 381–391 (1999)
Article Google Scholar
Gal, A., Eich, B., Shaver, M., Anderson, D., Mandelin, D., Haghighat, M.R., Kaplan, B., Hoare, G., Zbarsky, B., Orendorff, J., et al.: Trace-based Just-In-Time type specialization for dynamic languages. ACM Sigplan Not. 44(6), 465–478 (2009)
Article Google Scholar
Hunter, J.D.: Matplotlib: A 2D Graphics Environment. IEEE Ann. Hist. Comput. 9(03), 90–95 (2007)
Google Scholar
Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004, pp. 75–86. IEEE (2004)
Google Scholar
Lindenbaum, P.: Programming language use distribution from recent programs/articles (2017). https://www.biostars.org/p/251002/
LuaCov Developers: LuaCov - Coverage analysis for Lua scripts (2021). https://keplerproject.github.io/luacov/. Accessed on Apr 2021
McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), 1–9 (2011)
Google Scholar
Oh, T., Beard, S.R., Johnson, N.P., Popovych, S., August, D.I.: A generalized framework for automatic scripting language parallelization. In: In the Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2017)
Google Scholar
Oh, T., Kim, H., Johnson, N.P., Lee, J.W., August, D.I.: Practical automatic loop specialization. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 419–430. ASPLOS 2013, ACM, New York, NY, USA (2013)
Google Scholar
Oliphant, T.E.: Guide to NumPy, vol. 1. Trelgol Publishing USA (2006)
Google Scholar
Pedregosa, F., et al.: Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Plotly Developers: Plotly Python Open Source Graphing Library (2021). https://plotly.com/python/. Accessed on Apr 2021
Seaborn Developers: seaborn: statistical data visualization (2021). https://seaborn.pydata.org/. Accessed on Apr 2021
Shoelson, B., Tannenbaum, B.: New features for high-performance image processing in MATLAB (2012). https://www.mathworks.com/company/newsletters/articles/new-features-for-high-performance-image-processing-in-matlab.html
van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)
Article Google Scholar
Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020)
Google Scholar
Zhang, R., Debray, S., Snodgrass, R.T.: Micro-specialization: dynamic code specialization of database management systems. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 63–73 (2012)
Google Scholar
Zhang, R., Snodgrass, R.T., Debray, S.: Application of micro-specialization to query evaluation operators. In: 2012 IEEE 28th International Conference on Data Engineering Workshops, pp. 315–321. IEEE (2012)
Google Scholar
Zhang, R., Snodgrass, R.T., Debray, S.: Micro-specialization in DBMSes. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 690–701. IEEE (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Arizona, Tucson, AZ, 85721, USA
Wei He & Michelle Mills Strout

Authors

Wei He
View author publications
You can also search for this author in PubMed Google Scholar
Michelle Mills Strout
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei He .

Editor information

Editors and Affiliations

University of Tennessee at Knoxville, Knowville, TN, USA
Heike Jagode
Karlsruhe Institute of Technology, Karlsruhe, Baden-Württemberg, Germany
Hartwig Anzt
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
Hatem Ltaief
University of Tennessee System, Knoxville, TN, USA
Piotr Luszczek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

He, W., Strout, M.M. (2021). Potential of Interpreter Specialization for Data Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-90539-2_14
Published: 13 November 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90538-5
Online ISBN: 978-3-030-90539-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics