skip to main content
10.1145/3597503.3640325acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

PyAnalyzer: An Effective and Practical Approach for Dependency Extraction from Python Code

Published: 12 April 2024 Publication History

Abstract

Dependency extraction based on static analysis lays the groundwork for a wide range of applications. However, dynamic language features in Python make code behaviors obscure and nondeterministic; consequently, it poses huge challenges for static analyses to resolve symbol-level dependencies. Although prosperous techniques and tools are adequately available, they still lack sufficient capabilities to handle object changes, first-class citizens, varying call sites, and library dependencies. To address the fundamental difficulty for dynamic languages, this work proposes an effective and practical method namely PyAnalyzer for dependency extraction. PyAnalyzer uniformly models functions, classes, and modules into first-class heap objects, propagating the dynamic changes of these objects and class inheritance. This manner better simulates dynamic features like duck typing, object changes, and first-class citizens, resulting in high recall results without compromising precision. Moreover, PyAnalyzer leverages optional type annotations as a shortcut to express varying call sites and resolve library dependencies on demand. We collected two micro-benchmarks (278 small programs), two macro-benchmarks (59 real-world applications), and 191 real-world projects (10MSLOC) for comprehensive comparisons with 7 advanced techniques (i.e., Understand, Sourcetrail, Depends, ENRE19, PySonar2, PyCG, and Type4Py). The results demonstrated that PyAnalyzer achieves a high recall and hence improves the F1 by 24.7% on average, at least 1.4x faster without an obvious compromise of memory efficiency. Our work will benefit diverse client applications.

References

[1]
Beatrice Åkerblom, Jonathan Stendahl, Mattias Tumlin, and Tobias Wrigstad. 2014. Tracing dynamic features in python programs. In Proceedings of the 11th working conference on mining software repositories. ACM, New York, NY, USA, 292--295.
[2]
Beatrice Åkerblom and Tobias Wrigstad. 2015. Measuring polymorphism in Python programs. In ACM SIGPLAN Notices, Vol. 51. ACM, ACM, New York, NY, USA, 114--128.
[3]
Miltiadis Allamanis, Earl T Barr, Soline Ducousso, and Zheng Gao. 2020. Typilus: Neural type hints. In Proceedings of the 41st acm sigplan conference on programming language design and implementation. ACM, New York, NY, USA, 91--105.
[4]
Erik Arisholm, Lionel C Briand, and Audun Foyen. 2004. Dynamic coupling measurement for object-oriented software. IEEE Transactions on software engineering 30, 8 (2004), 491--506.
[5]
Carliss Young Baldwin and Kim B Clark. 2000. Design rules: The power of modularity. Industrial and Corporate Change 1, 1 (2000), 1--10.
[6]
Benjamin Cosman, Madeline Endres, Georgios Sakkas, Leon Medvinsky, Yao-Yuan Yang, Ranjit Jhala, Kamalika Chaudhuri, and Westley Weimer. 2020. Pablo: Helping novices debug python code through data-driven fault localization. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. ACM, New York, NY, USA, 1047--1053.
[7]
Siwei Cui, Gang Zhao, Zeyu Dai, Luochao Wang, Ruihong Huang, and Jeff Huang. 2021. PYInfer: Deep Learning Semantic Type Inference for Python Variables. CoRR abs/2106.14316 (2021). arXiv:2106.14316 https://arxiv.org/abs/2106.14316
[8]
Hoa Khanh Dam, Trang Pham, Shien Wee Ng, Truyen Tran, John Grundy, Aditya Ghose, Taeksu Kim, and Chul-Joo Kim. 2019. Lessons learned from using a deep tree-based model for software defect prediction in practice. In Proceedings of the 16th International Conference on Mining Software Repositories. IEEE Press, IEEE, Montreal, Quebec, Canada, 46--57.
[9]
daneads. 2007. Python Call Graph. Retrieved 2023-07-31 from https://github.com/daneads/pycallgraph2
[10]
Python docs. 2001--2022. https://docs.python.org/3/glossary.html#term-duck-typing.
[11]
Aryaz Eghbali and Michael Pradel. 2022. DynaPyt: A Dynamic Analysis Framework for Python. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Singapore, Singapore) (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 760--771.
[12]
Maryam Emami, Rakesh Ghiya, and Laurie J Hendren. 1994. Context-sensitive interprocedural points-to analysis in the presence of function pointers. ACM SIGPLAN Notices 29, 6 (1994), 242--256.
[13]
Francesca Arcelli Fontana, Ilaria Pigazzini, Riccardo Roveda, and Marco Zanoni. 2016. Automatic detection of instability architectural smells. In 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, IEEE Computer Society, Raleigh, North Carolina, USA, 433--437.
[14]
Python Software Foundation. 2003--2023. PyPI · The Python Package Index. https://pypi.org
[15]
Python Software Foundation. 2023. https://docs.python.org/3/reference/.
[16]
Aymeric Fromherz, Abdelraouf Ouadjaout, and Antoine Miné. 2018. Static value analysis of Python programs by abstract interpretation. In NASA Formal Methods: 10th International Symposium, NFM 2018, Newport News, VA, USA, April 17--19, 2018, Proceedings 10. Springer, Springer, Cham, 185--202.
[17]
Ritu Garg and Rakesh Kumar Singh. 2022. SBCSim: Classification and Prioritization of Similarities Between Versions. International Journal of Software Innovation (IJSI) 10, 1 (2022), 1--18.
[18]
giampaolo. 2014--2022. https://github.com/giampaolo/psutil.
[19]
Instagram. 2017--2022. https://github.com/instagram/MonkeyType.
[20]
Muhui Jiang, Yajin Zhou, Xiapu Luo, Ruoyu Wang, Yang Liu, and Kui Ren. 2020. An Empirical Study on ARM Disassembly Tools. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). ACM, New York, NY, USA, 401--414.
[21]
Wuxia Jin, Yuanfang Cai, Rick Kazman, Gang Zhang, Qinghua Zheng, and Ting Liu. 2020. Exploring the Architectural Impact of Possible Dependencies in Python Software. In 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, ACM, New York, NY, USA, 1--13.
[22]
Wuxia Jin, Yuanfang Cai, Rick Kazman, Qinghua Zheng, Di Cui, and Ting Liu. 2019. ENRE: a tool framework for extensible eNtity relation extraction. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). IEEE, IEEE, Montréal, QC, Canada, 67--70.
[23]
Wuxia Jin, Yitong Dai, Jianguo Zheng, Yu Qu, Ming Fan, Zhenyu Huang, Dezhi Huang, and Ting Liu. 2023. Dependency Facade: The Coupling and Conflicts between Android Framework and Its Customization. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, IEEE, Melbourne, Australia, 1674--1686.
[24]
Wuxia Jin, Dinghong Zhong, Yuanfang Cai, Rick Kazman, and Ting Liu. 2023. Evaluating the impact of possible dependencies on architecture-level maintainability. IEEE Transactions on Software Engineering 49, 3 (2023), 1064--1085.
[25]
Wuxia Jin, Dinghong Zhong, Zifan Ding, Ming Fan, and Ting Liu. 2021. Where to Start: Studying Type Annotation Practices in Python. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, IEEE, Luxembourg, Luxembourg, 529--541.
[26]
George Kastrinis and Yannis Smaragdakis. 2013. Hybrid context-sensitivity for points-to analysis. ACM SIGPLAN Notices 48, 6 (2013), 423--434.
[27]
Triet H. M. Le, Hao Chen, and Muhammad Ali Babar. 2020. Deep Learning for Source Code Modeling and Generation: Models, Applications, and Challenges. ACM Comput. Surv. 53, 3, Article 62 (jun 2020), 38 pages.
[28]
Suyoung Lee, HyungSeok Han, Sang Kil Cha, and Sooel Son. 2020. Montage: A neural network language {Model-Guided} {JavaScript} engine fuzzer. In 29th USENIX Security Symposium (USENIX Security 20). USENIX Association, Boston, MA, USA, 2613--2630.
[29]
Ondřej Lhoták and Laurie Hendren. 2003. Scaling Java points-to analysis using S park. In Compiler Construction: 12th International Conference, CC 2003 Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2003 Warsaw, Poland, April 7--11, 2003 Proceedings 12. Springer, Springer Berlin Heidelberg, Berlin, Heidelberg, 153--169.
[30]
Yue Li, Tian Tan, and Jingling Xue. 2019. Understanding and analyzing java reflection. ACM Transactions on Software Engineering and Methodology (TOSEM) 28, 2 (2019), 1--50.
[31]
Jingwen Liu, Wuxia Jin, Qiong Feng, Xinyu Zhang, and Yitong Dai. 2021. one step further: investigating problematic files of architecture anti-patterns. In 2021 IEEE 32st International Symposium on Software Reliability Engineering (ISSRE). IEEE, IEEE, Wuhan, China, 1--12.
[32]
Wenjie Ma, Shengyuan Yang, Tian Tan, Xiaoxing Ma, Chang Xu, and Yue Li. 2023. Context Sensitivity without Contexts: A Cut-Shortcut Approach to Fast and Precise Pointer Analysis. Proceedings of the ACM on Programming Languages 7, PLDI (2023), 539--564.
[33]
Rabee Sohail Malik, Jibesh Patra, and Michael Pradel. 2019. NL2Type: inferring JavaScript function types from natural language information. In Proceedings of the 41st International Conference on Software Engineering. IEEE, Montréal, QC, Canada, 304--315.
[34]
Meta. 2018--2023. https://pyre-check.org.
[35]
Amir M Mir, Evaldas Latoškinas, Sebastian Proksch, and Georgios Gousios. 2022. Type4Py: Practical deep similarity learning-based type inference for Python. In Proceedings of the 44th International Conference on Software Engineering. ACM, New York, NY, USA, 2241--2252.
[36]
Ran Mo, Yuanfang Cai, Rick Kazman, Lu Xiao, and Qiong Feng. 2019. Architecture anti-patterns: Automatically detectable violations of design principles. IEEE Transactions on Software Engineering 47, 5 (2019), 1008--1028.
[37]
Ran Mo, Will Snipes, Yuanfang Cai, Srini Ramaswamy, Rick Kazman, and Martin Naedele. 2018. Experiences applying automated architecture analysis tool suites. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. ACM, ACM, New York, NY, USA, 779--789.
[38]
Multilang-depends. 2018--2022. https://github.com/multilang-depends/depends.
[39]
Nachiappan Nagappan, Thomas Ball, and Andreas Zeller. 2006. Mining metrics to predict component failures. In Proceedings of the 28th international conference on Software engineering. ACM, Shanghai, China, 452--461.
[40]
J Palsberg. 1991. Object-oriented type inference. In Proc. OOPSLA'91. ACM, New York, NY, USA, 146--161.
[41]
Terence J. Parr and Russell W. Quong. 1995. ANTLR: A predicated-LL (k) parser generator. Software: Practice and Experience 25, 7 (1995), 789--810.
[42]
Yun Peng, Cuiyun Gao, Zongjie Li, Bowei Gao, David Lo, Qirun Zhang, and Michael Lyu. 2022. Static inference meets deep learning: a hybrid type inference approach for python. In Proceedings of the 44th International Conference on Software Engineering. ACM, New York, NY, USA, 2019--2030.
[43]
Yun Peng, Yu Zhang, and Mingzhe Hu. 2021. An Empirical Study for Common Language Features Used in Python Projects. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 24--35.
[44]
Denys Poshyvanyk, Andrian Marcus, Rudolf Ferenc, and Tibor Gyimóthy. 2009. Using information retrieval based coupling measures for impact analysis. Empirical software engineering 14, 1 (2009), 5--32.
[45]
Michael Pradel, Georgios Gousios, Jason Liu, and Satish Chandra. 2020. Typewriter: Neural type prediction with search-based validation. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, New York, NY, USA, 209--220.
[46]
Python. 2001--2022. https://www.python.org/dev/peps/pep-0484/.
[47]
python. 2015--2023. typeshed. https://github.com/python/typeshed
[48]
Python Software Foundation. 2021. Python AST (Abstract Syntax Trees) (3.10.0 ed.). Python Software Foundation, Wilmington, DE. https://docs.python.org/3/library/ast.html.
[49]
Jonathan Raiman. 2022. Deeptype 2: Superhuman entity linking, all you need is type interactions. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. AAAI, Virtual, 8028--8035.
[50]
Leodanis Pozo Ramos. [n. d.]. Python Scope & the LEGB Rule: Resolving Names in Your Code. https://realpython.com/python-scope-legb-rule/
[51]
Veselin Raychev, Martin Vechev, and Andreas Krause. 2019. Predicting program properties from'big code'. Commun. ACM 62, 3 (2019), 99--107.
[52]
Jukka Ruohonen, Kalle Hjerppe, and Kalle Rindell. 2021. A large-scale security-oriented static analysis of python packages in PyPI. In 2021 18th International Conference on Privacy, Security and Trust (PST). IEEE, IEEE Computer Society, Los Alamitos, CA, USA, 1--10.
[53]
Vitalis Salis, Thodoris Sotiropoulos, Panos Louridas, Diomidis Spinellis, and Dimitris Mitropoulos. 2021. Pycg: Practical call graph generation in python. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, IEEE, Madrid, Spain, 1646--1657.
[54]
Darius Sas, Paris Avgeriou, Ronald Kruizinga, and Ruben Scheedler. 2021. Exploring the relation between co-changes and architectural smells. SN Computer Science 2, 1 (2021), 1--15.
[55]
Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang. 2, 1 (apr 2015), 1--69.
[56]
Ioana Şora. 2016. Helping program comprehension of large software systems by identifying their most important classes. In Evaluation of Novel Approaches to Software Engineering: 10th International Conference, ENASE 2015, Barcelona, Spain, April 29--30, 2015, Revised Selected Papers 10. Springer, Springer, Cham, 122--140.
[57]
Sourcetrail. 2014--2022. https://www.sourcetrail.com/.
[58]
Li Sui, Jens Dietrich, Amjed Tahir, and George Fourtounis. 2020. On the Recall of Static Call Graph Construction in Practice. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 1049--1060.
[59]
Symbolk. 2020--2022. https://github.com/Symbolk/Code2Graph.
[60]
Tian Tan, Yue Li, Xiaoxing Ma, Chang Xu, and Yannis Smaragdakis. 2021. Making pointer analysis more precise by unleashing the power of selective context sensitivity. Proceedings of the ACM on Programming Languages 5, OOPSLA (2021), 1--27.
[61]
Technologicat. 2009--2021. pyan. https://github.com/Technologicat/pyan
[62]
Manas Thakur. 2020. How (not) to write java pointer analyses after 2020. In Proceedings of the 2020 acm sigplan international symposium on new ideas, new paradigms, and reflections on programming and software. ACM, New York, NY, USA, 134--145.
[63]
SciTools Understand. 1996--2023. https://scitools.com/.
[64]
Yin Wang. 2022. pysonar2. https://github.com/yinwang0/pysonar2/tree/f47662443310200755cbfa9a3bc020efc1a442de
[65]
Jiayi Wei, Maruth Goyal, Greg Durrett, and Isil Dillig. 2020. LambdaNet: Probabilistic Type Inference using Graph Neural Networks. In 8th International Conference on Learning Representations, ICLR 2020, April 26--30, 2020. OpenReview.net, Addis Ababa, Ethiopia. https://openreview.net/forum?id=Hkx6hANtwH

Cited By

View all
  • (2024)Cross-Language Dependencies: An Empirical Study of Kotlin-JavaProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686680(189-199)Online publication date: 24-Oct-2024
  • (2024)ERD-CQC : Enhanced Rule and Dependency Code Quality Check for JavaProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674820(377-386)Online publication date: 24-Jul-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering
May 2024
2942 pages
ISBN:9798400702174
DOI:10.1145/3597503
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

In-Cooperation

  • Faculty of Engineering of University of Porto

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 April 2024

Check for updates

Author Tags

  1. dependency extraction
  2. Python
  3. dynamic features

Qualifiers

  • Research-article

Funding Sources

  • NSFC

Conference

ICSE '24
Sponsor:

Acceptance Rates

Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)322
  • Downloads (Last 6 weeks)26
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Cross-Language Dependencies: An Empirical Study of Kotlin-JavaProceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3674805.3686680(189-199)Online publication date: 24-Oct-2024
  • (2024)ERD-CQC : Enhanced Rule and Dependency Code Quality Check for JavaProceedings of the 15th Asia-Pacific Symposium on Internetware10.1145/3671016.3674820(377-386)Online publication date: 24-Jul-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media