Skip to main content

Potential of Interpreter Specialization for Data Analysis

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2021)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12761))

Included in the following conference series:

  • 1519 Accesses

Abstract

Scientists frequently implement data analyses in high-level programming languages such as Python, Perl, Lu, and R. Many of these languages are inefficient due to the overhead of being dynamically typed and interpreted. In this paper, we report the potential performance improvement of domain-specific interpreter specialization for data analysis workloads and evaluate how the characteristics of data analysis workloads affect the specialization, both positively and negatively. Assisted by compilers, we specialize the Lu and CPython interpreters at source-level using the script being interpreted and the data types during the interpretation as invariants for five common tasks from real data analysis workloads. Through experiments, we measure 9.0–39.6% performance improvement for Lu and 11.0–17.2% performance improvement for CPython for benchmarks that perform data loading, histogram computation, data filtering, data transformation, and dataset shuffle. This specialization does not include misspeculation checks of data types at possible type conversion code that may be necessary for other workloads. We report the details of our evaluation and present a semi-automatic method for specializing the interpreters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kaggle. https://www.kaggle.com/. Accessed on Apr 2021

  2. Bezanson, J., Edelman, A., Karpinski, S., Shah, V.B.: Julia: A fresh approach to numerical computing. Technical report arXiv:1411.1607v4, MIT and Julia Computing (2015)

  3. Bolz, C.F., Cuni, A., Fijalkowski, M., Rigo, A.: Tracing the meta-level: pypy’s tracing JIT compiler. In: Proceedings of the 4th Workshop on the Implementation, Compilation, Optimization of Object-Oriented Languages and Programming Systems, pp. 18–25 (2009)

    Google Scholar 

  4. Catanzaro, B., et al.: SEJITS: getting productivity and performance with selective embedded JIT specialization. Technical report. UCB/EECS-2010-23, EECS Department, University of California, Berkeley (2010)

    Google Scholar 

  5. Chamberlain, B.L., et al.: Chapel comes of age: making scalable programming productive. In: Cray User Group Conference (2018)

    Google Scholar 

  6. Cheng, L., Ilbeyi, B., Bolz-Tereick, C.F., Batten, C.: Type freezing: exploiting attribute type monomorphism in tracing JIT compilers. In: Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization, pp. 16–29 (2020)

    Google Scholar 

  7. Chevalier-Boisvert, M., Hendren, L., Verbrugge, C.: Optimizing Matlab through Just-In-Time specialization. In: Gupta, R. (ed.) CC 2010. LNCS, vol. 6011, pp. 46–65. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-11970-5_4

    Chapter  Google Scholar 

  8. Coverage.py Developers: Coverage.py 5.5 documentation (2021). https://coverage.readthedocs.io/en/coverage-5.5/. Accessed on Apr 2021

  9. Valgrind Developers: Callgrind: a call-graph generating cache and branch prediction profiler. Valgrind (2021). http://valgrind.org/docs/manual/cl-manual.html. Accessed on Apr 2021

  10. Futamura, Y.: Partial evaluation of computation process-an approach to a compiler-compiler. Higher-Order Symbolic Comput. 12(4), 381–391 (1999)

    Article  Google Scholar 

  11. Gal, A., Eich, B., Shaver, M., Anderson, D., Mandelin, D., Haghighat, M.R., Kaplan, B., Hoare, G., Zbarsky, B., Orendorff, J., et al.: Trace-based Just-In-Time type specialization for dynamic languages. ACM Sigplan Not. 44(6), 465–478 (2009)

    Article  Google Scholar 

  12. Hunter, J.D.: Matplotlib: A 2D Graphics Environment. IEEE Ann. Hist. Comput. 9(03), 90–95 (2007)

    Google Scholar 

  13. Lattner, C., Adve, V.: LLVM: a compilation framework for lifelong program analysis & transformation. In: International Symposium on Code Generation and Optimization, 2004. CGO 2004, pp. 75–86. IEEE (2004)

    Google Scholar 

  14. Lindenbaum, P.: Programming language use distribution from recent programs/articles (2017). https://www.biostars.org/p/251002/

  15. LuaCov Developers: LuaCov - Coverage analysis for Lua scripts (2021). https://keplerproject.github.io/luacov/. Accessed on Apr 2021

  16. McKinney, W., et al.: pandas: a foundational python library for data analysis and statistics. Python High Perform. Sci. Comput. 14(9), 1–9 (2011)

    Google Scholar 

  17. Oh, T., Beard, S.R., Johnson, N.P., Popovych, S., August, D.I.: A generalized framework for automatic scripting language parallelization. In: In the Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) (2017)

    Google Scholar 

  18. Oh, T., Kim, H., Johnson, N.P., Lee, J.W., August, D.I.: Practical automatic loop specialization. In: Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 419–430. ASPLOS 2013, ACM, New York, NY, USA (2013)

    Google Scholar 

  19. Oliphant, T.E.: Guide to NumPy, vol. 1. Trelgol Publishing USA (2006)

    Google Scholar 

  20. Pedregosa, F., et al.: Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  21. Plotly Developers: Plotly Python Open Source Graphing Library (2021). https://plotly.com/python/. Accessed on Apr 2021

  22. Seaborn Developers: seaborn: statistical data visualization (2021). https://seaborn.pydata.org/. Accessed on Apr 2021

  23. Shoelson, B., Tannenbaum, B.: New features for high-performance image processing in MATLAB (2012). https://www.mathworks.com/company/newsletters/articles/new-features-for-high-performance-image-processing-in-matlab.html

  24. van der Walt, S., Colbert, S.C., Varoquaux, G.: The NumPy array: a structure for efficient numerical computation. Comput. Sci. Eng. 13(2), 22–30 (2011)

    Article  Google Scholar 

  25. Virtanen, P., et al.: SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17(3), 261–272 (2020)

    Google Scholar 

  26. Zhang, R., Debray, S., Snodgrass, R.T.: Micro-specialization: dynamic code specialization of database management systems. In: Proceedings of the Tenth International Symposium on Code Generation and Optimization, pp. 63–73 (2012)

    Google Scholar 

  27. Zhang, R., Snodgrass, R.T., Debray, S.: Application of micro-specialization to query evaluation operators. In: 2012 IEEE 28th International Conference on Data Engineering Workshops, pp. 315–321. IEEE (2012)

    Google Scholar 

  28. Zhang, R., Snodgrass, R.T., Debray, S.: Micro-specialization in DBMSes. In: 2012 IEEE 28th International Conference on Data Engineering, pp. 690–701. IEEE (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei He .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

He, W., Strout, M.M. (2021). Potential of Interpreter Specialization for Data Analysis. In: Jagode, H., Anzt, H., Ltaief, H., Luszczek, P. (eds) High Performance Computing. ISC High Performance 2021. Lecture Notes in Computer Science(), vol 12761. Springer, Cham. https://doi.org/10.1007/978-3-030-90539-2_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-90539-2_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-90538-5

  • Online ISBN: 978-3-030-90539-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics