research-article

PyAnalyzer: An Effective and Practical Approach for Dependency Extraction from Python Code

Authors:

Dinghong Zhong,

Ting LiuAuthors Info & Claims

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Article No.: 112, Pages 1 - 12

https://doi.org/10.1145/3597503.3640325

Published: 12 April 2024 Publication History

Abstract

Dependency extraction based on static analysis lays the groundwork for a wide range of applications. However, dynamic language features in Python make code behaviors obscure and nondeterministic; consequently, it poses huge challenges for static analyses to resolve symbol-level dependencies. Although prosperous techniques and tools are adequately available, they still lack sufficient capabilities to handle object changes, first-class citizens, varying call sites, and library dependencies. To address the fundamental difficulty for dynamic languages, this work proposes an effective and practical method namely PyAnalyzer for dependency extraction. PyAnalyzer uniformly models functions, classes, and modules into first-class heap objects, propagating the dynamic changes of these objects and class inheritance. This manner better simulates dynamic features like duck typing, object changes, and first-class citizens, resulting in high recall results without compromising precision. Moreover, PyAnalyzer leverages optional type annotations as a shortcut to express varying call sites and resolve library dependencies on demand. We collected two micro-benchmarks (278 small programs), two macro-benchmarks (59 real-world applications), and 191 real-world projects (10MSLOC) for comprehensive comparisons with 7 advanced techniques (i.e., Understand, Sourcetrail, Depends, ENRE19, PySonar2, PyCG, and Type4Py). The results demonstrated that PyAnalyzer achieves a high recall and hence improves the F₁ by 24.7% on average, at least 1.4x faster without an obvious compromise of memory efficiency. Our work will benefit diverse client applications.

References

[1]

Beatrice Åkerblom, Jonathan Stendahl, Mattias Tumlin, and Tobias Wrigstad. 2014. Tracing dynamic features in python programs. In Proceedings of the 11th working conference on mining software repositories. ACM, New York, NY, USA, 292--295.

Digital Library

[2]

Beatrice Åkerblom and Tobias Wrigstad. 2015. Measuring polymorphism in Python programs. In ACM SIGPLAN Notices, Vol. 51. ACM, ACM, New York, NY, USA, 114--128.

Digital Library

[3]

Miltiadis Allamanis, Earl T Barr, Soline Ducousso, and Zheng Gao. 2020. Typilus: Neural type hints. In Proceedings of the 41st acm sigplan conference on programming language design and implementation. ACM, New York, NY, USA, 91--105.

Digital Library

[4]

Erik Arisholm, Lionel C Briand, and Audun Foyen. 2004. Dynamic coupling measurement for object-oriented software. IEEE Transactions on software engineering 30, 8 (2004), 491--506.

Digital Library

[5]

Carliss Young Baldwin and Kim B Clark. 2000. Design rules: The power of modularity. Industrial and Corporate Change 1, 1 (2000), 1--10.

[6]

Benjamin Cosman, Madeline Endres, Georgios Sakkas, Leon Medvinsky, Yao-Yuan Yang, Ranjit Jhala, Kamalika Chaudhuri, and Westley Weimer. 2020. Pablo: Helping novices debug python code through data-driven fault localization. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. ACM, New York, NY, USA, 1047--1053.

Digital Library

[7]

Siwei Cui, Gang Zhao, Zeyu Dai, Luochao Wang, Ruihong Huang, and Jeff Huang. 2021. PYInfer: Deep Learning Semantic Type Inference for Python Variables. CoRR abs/2106.14316 (2021). arXiv:2106.14316 https://arxiv.org/abs/2106.14316

[8]

Hoa Khanh Dam, Trang Pham, Shien Wee Ng, Truyen Tran, John Grundy, Aditya Ghose, Taeksu Kim, and Chul-Joo Kim. 2019. Lessons learned from using a deep tree-based model for software defect prediction in practice. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, IEEE, Montreal, Quebec, Canada, 46--57.

Digital Library

[9]

daneads. 2007. Python Call Graph. Retrieved 2023-07-31 from https://github.com/daneads/pycallgraph2

[10]

Python docs. 2001--2022. https://docs.python.org/3/glossary.html#term-duck-typing.

[11]

Aryaz Eghbali and Michael Pradel. 2022. DynaPyt: A Dynamic Analysis Framework for Python. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 760--771.

Digital Library

[12]

Maryam Emami, Rakesh Ghiya, and Laurie J Hendren. 1994. Context-sensitive interprocedural points-to analysis in the presence of function pointers. ACM SIGPLAN Notices 29, 6 (1994), 242--256.

Digital Library

[13]

Francesca Arcelli Fontana, Ilaria Pigazzini, Riccardo Roveda, and Marco Zanoni. 2016. Automatic detection of instability architectural smells. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, IEEE Computer Society, Raleigh, North Carolina, USA, 433--437.

[14]

Python Software Foundation. 2003--2023. PyPI · The Python Package Index. https://pypi.org

[15]

Python Software Foundation. 2023. https://docs.python.org/3/reference/.

[16]

Aymeric Fromherz, Abdelraouf Ouadjaout, and Antoine Miné. 2018. Static value analysis of Python programs by abstract interpretation. In NASA Formal Methods: 10th International Symposium, NFM 2018, Newport News, VA, USA, April 17--19, 2018, Proceedings 10. Springer, Springer, Cham, 185--202.

[17]

Ritu Garg and Rakesh Kumar Singh. 2022. SBCSim: Classification and Prioritization of Similarities Between Versions. International Journal of Software Innovation (IJSI) 10, 1 (2022), 1--18.

Digital Library

[18]

giampaolo. 2014--2022. https://github.com/giampaolo/psutil.

[19]

Instagram. 2017--2022. https://github.com/instagram/MonkeyType.

[20]

Muhui Jiang, Yajin Zhou, Xiapu Luo, Ruoyu Wang, Yang Liu, and Kui Ren. 2020. An Empirical Study on ARM Disassembly Tools. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). ACM, New York, NY, USA, 401--414.

Digital Library

[21]

Wuxia Jin, Yuanfang Cai, Rick Kazman, Gang Zhang, Qinghua Zheng, and Ting Liu. 2020. Exploring the Architectural Impact of Possible Dependencies in Python Software. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, ACM, New York, NY, USA, 1--13.

[22]

Wuxia Jin, Yuanfang Cai, Rick Kazman, Qinghua Zheng, Di Cui, and Ting Liu. 2019. ENRE: a tool framework for extensible eNtity relation extraction. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, IEEE, Montréal, QC, Canada, 67--70.

Digital Library

[23]

Wuxia Jin, Yitong Dai, Jianguo Zheng, Yu Qu, Ming Fan, Zhenyu Huang, Dezhi Huang, and Ting Liu. 2023. Dependency Facade: The Coupling and Conflicts between Android Framework and Its Customization. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, IEEE, Melbourne, Australia, 1674--1686.

Digital Library

[24]

Wuxia Jin, Dinghong Zhong, Yuanfang Cai, Rick Kazman, and Ting Liu. 2023. Evaluating the impact of possible dependencies on architecture-level maintainability. IEEE Transactions on Software Engineering 49, 3 (2023), 1064--1085.

Digital Library

[25]

Wuxia Jin, Dinghong Zhong, Zifan Ding, Ming Fan, and Ting Liu. 2021. Where to Start: Studying Type Annotation Practices in Python. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE, Luxembourg, Luxembourg, 529--541.

Digital Library

[26]

George Kastrinis and Yannis Smaragdakis. 2013. Hybrid context-sensitivity for points-to analysis. ACM SIGPLAN Notices 48, 6 (2013), 423--434.

Digital Library

[27]

Triet H. M. Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges. ACM Comput. Surv. 53, 3, Article 62 (jun 2020), 38 pages.

Digital Library

[28]

Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A neural network language {Model-Guided} {JavaScript} engine fuzzer. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, Boston, MA, USA, 2613--2630.

[29]

Ondřej Lhoták and Laurie Hendren. 2003. Scaling Java points-to analysis using S park. In Compiler Construction: 12th International Conference, CC 2003 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2003 Warsaw, Poland, April 7--11, 2003 Proceedings 12. Springer, Springer Berlin Heidelberg, Berlin, Heidelberg, 153--169.

[30]

Yue Li, Tian Tan, and Jingling Xue. 2019. Understanding and analyzing java reflection. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 2 (2019), 1--50.

Digital Library

[31]

Jingwen Liu, Wuxia Jin, Qiong Feng, Xinyu Zhang, and Yitong Dai. 2021. one step further: investigating problematic files of architecture anti-patterns. In 2021 IEEE 32st International Symposium on Software Reliability Engineering (ISSRE). IEEE, IEEE, Wuhan, China, 1--12.

[32]

Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context Sensitivity without Contexts: A Cut-Shortcut Approach to Fast and Precise Pointer Analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539--564.

Digital Library

[33]

Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of the 41st International Conference on Software Engineering. IEEE, Montréal, QC, Canada, 304--315.

Digital Library

[34]

Meta. 2018--2023. https://pyre-check.org.

[35]

Amir M Mir, Evaldas Latoškinas, Sebastian Proksch, and Georgios Gousios. 2022. Type4Py: Practical deep similarity learning-based type inference for Python. In Proceedings of the 44th International Conference on Software Engineering. ACM, New York, NY, USA, 2241--2252.

Digital Library

[36]

Ran Mo, Yuanfang Cai, Rick Kazman, Lu Xiao, and Qiong Feng. 2019. Architecture anti-patterns: Automatically detectable violations of design principles. IEEE Transactions on Software Engineering 47, 5 (2019), 1008--1028.

[37]

Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele. 2018. Experiences applying automated architecture analysis tool suites. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, ACM, New York, NY, USA, 779--789.

Digital Library

[38]

Multilang-depends. 2018--2022. https://github.com/multilang-depends/depends.

[39]

Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics to predict component failures. In Proceedings of the 28th international conference on Software engineering. ACM, Shanghai, China, 452--461.

Digital Library

[40]

J Palsberg. 1991. Object-oriented type inference. In Proc. OOPSLA'91. ACM, New York, NY, USA, 146--161.

Digital Library

[41]

Terence J. Parr and Russell W. Quong. 1995. ANTLR: A predicated-LL (k) parser generator. Software: Practice and Experience 25, 7 (1995), 789--810.

Digital Library

[42]

Yun Peng, Cuiyun Gao, Zongjie Li, Bowei Gao, David Lo, Qirun Zhang, and Michael Lyu. 2022. Static inference meets deep learning: a hybrid type inference approach for python. In Proceedings of the 44th International Conference on Software Engineering. ACM, New York, NY, USA, 2019--2030.

Digital Library

[43]

Yun Peng, Yu Zhang, and Mingzhe Hu. 2021. An Empirical Study for Common Language Features Used in Python Projects. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 24--35.

[44]

Denys Poshyvanyk, Andrian Marcus, Rudolf Ferenc, and Tibor Gyimóthy. 2009. Using information retrieval based coupling measures for impact analysis. Empirical software engineering 14, 1 (2009), 5--32.

[45]

Michael Pradel, Georgios Gousios, Jason Liu, and Satish Chandra. 2020. Typewriter: Neural type prediction with search-based validation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 209--220.

Digital Library

[46]

Python. 2001--2022. https://www.python.org/dev/peps/pep-0484/.

[47]

python. 2015--2023. typeshed. https://github.com/python/typeshed

[48]

Python Software Foundation. 2021. Python AST (Abstract Syntax Trees) (3.10.0 ed.). Python Software Foundation, Wilmington, DE. https://docs.python.org/3/library/ast.html.

[49]

Jonathan Raiman. 2022. Deeptype 2: Superhuman entity linking, all you need is type interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. AAAI, Virtual, 8028--8035.

[50]

Leodanis Pozo Ramos. [n. d.]. Python Scope & the LEGB Rule: Resolving Names in Your Code. https://realpython.com/python-scope-legb-rule/

[51]

Veselin Raychev, Martin Vechev, and Andreas Krause. 2019. Predicting program properties from'big code'. Commun. ACM 62, 3 (2019), 99--107.

Digital Library

[52]

Jukka Ruohonen, Kalle Hjerppe, and Kalle Rindell. 2021. A large-scale security-oriented static analysis of python packages in PyPI. In 2021 18th International Conference on Privacy, Security and Trust (PST). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 1--10.

[53]

Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, IEEE, Madrid, Spain, 1646--1657.

Digital Library

[54]

Darius Sas, Paris Avgeriou, Ronald Kruizinga, and Ruben Scheedler. 2021. Exploring the relation between co-changes and architectural smells. SN Computer Science 2, 1 (2021), 1--15.

Digital Library

[55]

Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang. 2, 1 (apr 2015), 1--69.

Digital Library

[56]

Ioana Şora. 2016. Helping program comprehension of large software systems by identifying their most important classes. In Evaluation of Novel Approaches to Software Engineering: 10th International Conference, ENASE 2015, Barcelona, Spain, April 29--30, 2015, Revised Selected Papers 10. Springer, Springer, Cham, 122--140.

[57]

Sourcetrail. 2014--2022. https://www.sourcetrail.com/.

[58]

Li Sui, Jens Dietrich, Amjed Tahir, and George Fourtounis. 2020. On the Recall of Static Call Graph Construction in Practice. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 1049--1060.

[59]

Symbolk. 2020--2022. https://github.com/Symbolk/Code2Graph.

[60]

Tian Tan, Yue Li, Xiaoxing Ma, Chang Xu, and Yannis Smaragdakis. 2021. Making pointer analysis more precise by unleashing the power of selective context sensitivity. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1--27.

Digital Library

[61]

Technologicat. 2009--2021. pyan. https://github.com/Technologicat/pyan

[62]

Manas Thakur. 2020. How (not) to write java pointer analyses after 2020. In Proceedings of the 2020 acm sigplan international symposium on new ideas, new paradigms, and reflections on programming and software. ACM, New York, NY, USA, 134--145.

Digital Library

[63]

SciTools Understand. 1996--2023. https://scitools.com/.

[64]

Yin Wang. 2022. pysonar2. https://github.com/yinwang0/pysonar2/tree/f47662443310200755cbfa9a3bc020efc1a442de

[65]

Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. LambdaNet: Probabilistic Type Inference using Graph Neural Networks. In 8th International Conference on Learning Representations, ICLR 2020, April 26--30, 2020. OpenReview.net, Addis Ababa, Ethiopia. https://openreview.net/forum?id=Hkx6hANtwH

Cited By

Feng QJi HMa XLiang P(2024)Cross-Language Dependencies: An Empirical Study of Kotlin-JavaProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686680(189-199)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686680
Hou YJin WWang ZWang LChen SWang YSang LWang HLiu T(2024)ERD-CQC : Enhanced Rule and Dependency Code Quality Check for JavaProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674820(377-386)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674820

Recommendations

Tracing dynamic features in python programs
MSR 2014: Proceedings of the 11th Working Conference on Mining Software Repositories

Recent years have seen a number of proposals for adding (retrofitting) static typing to dynamic programming languages, a natural consequence of their growing popularity for non-toy applications across a multitude of domains. These proposals often make ...
A language-independent approach to the extraction of dependencies between source code entities

Context: Software networks are directed graphs of static dependencies between source code entities (functions, classes, modules, etc.). These structures can be used to investigate the complexity and evolution of large-scale software systems and to ...
Static Slicing for Python First-Class Objects
QSIC '13: Proceedings of the 2013 13th International Conference on Quality Software

Program slicing is an important program analysis technique and now has been used in many fields of software engineering. However, most existing program slicing methods focus on static programming languages such as C/C++ and Java, and methods on dynamic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

May 2024

2942 pages

ISBN:9798400702174

DOI:10.1145/3597503

Co-chairs:
Ana Paiva,
Rui Abreu,
Program Co-chairs:
Abhik Roychoudhury,
Margaret Storey

Copyright © 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 April 2024

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSFC

Conference

ICSE '24

Sponsor:

SIGSOFT

ICSE '24: IEEE/ACM 46th International Conference on Software Engineering

April 14 - 20, 2024

Lisbon, Portugal

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
322
Total Downloads

Downloads (Last 12 months)322
Downloads (Last 6 weeks)26

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Feng QJi HMa XLiang P(2024)Cross-Language Dependencies: An Empirical Study of Kotlin-JavaProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686680(189-199)Online publication date: 24-Oct-2024
https://dl.acm.org/doi/10.1145/3674805.3686680
Hou YJin WWang ZWang LChen SWang YSang LWang HLiu T(2024)ERD-CQC : Enhanced Rule and Dependency Code Quality Check for JavaProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674820(377-386)Online publication date: 24-Jul-2024
https://dl.acm.org/doi/10.1145/3671016.3674820

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten