Abstract
Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer’s ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications.
We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program’s code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.
L. Seidel and S.D. Baker Effendi—Contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Amazon Web Services. AWS DynamoDB: NoSQL Database (2023). https://aws.amazon.com/dynamodb/
Bierman, G., Abadi, M., Torgersen, M.: Understanding typescript. In: Jones, R. (ed.) ECOOP 2014 - Object-Oriented Programming. LNCS, vol. 8586, pp. 257–281. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44202-9_11
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc. (2020)
Dagenais, B., Hendren, L.: Enabling static analysis for partial java programs. In: Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, pp. 313–328 (2008)
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (2022)
DefinitelyTyped. DefinitelyTyped: repository for high quality TypeScript type definitions. https://github.com/DefinitelyTyped/DefinitelyTyped
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in javascript. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 758–769. IEEE (2017)
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in Javascript. In: Proceedings of the 39th International Conference on Software Engineering (ICSE 2017), pp. 758–769. IEEE Press (2017)
GitHub. The state of the octoverse: open source software survey (2023). https://octoverse.github.com/
Guo, D., et al.: Graphcodebert: pre-training code representations with data flow. arXiv preprint (2020)
Hanenberg, S., Kleinschmager, S., Robbes, R., Tanter, É., Stefik, A.: An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19(5), 1335–1382 (2014)
Hellendoorn, V.J., Bird, C., Barr, E.T., Allamanis, M.: Deep learning type inference. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 152–162 (2018)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM Trans. Progr. Lang. Syst. 12(1) (1990)
Huggingface: List of pre-trained models on huggingface (2023). https://huggingface.co/transformers/v3.3.1/pretrained_models.html
Jesse, K.: GraphCodeBERT on Huggingface (2023). https://huggingface.co/kevinjesse/graphcodebert-MT4TS
Jesse, K.: TypeBert on Huggingface (2023). https://huggingface.co/kevinjesse/typebert
Jesse, K., Devanbu, P.T.: ManyTypes4TypeScript: a comprehensive typescript dataset for sequence-based type inference. In: 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp. 294–298 (2022)
Jesse, K., Devanbu, P.T., Ahmed, T.: Learning type annotation: Is big data enough? In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1483–1486 (2021)
Jesse, K., Devanbu, P.T., Sawant, A.: Learning to predict user-defined types. IEEE Trans. Softw. Eng. 49(4), 1508–1522 (2023)
Lu, S., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)
Microsoft. Codexglue benchmark for understanding programming code. https://microsoft.github.io/CodeXGLUE/. Accessed 16 May 2023
Nijkamp, E., et al.: Codegen: an open large language model for code with multi-turn program synthesis. arXiv preprint (2022)
Pandi, I.V., Barr, E.T., Gordon, A.D., Sutton, C.: Opttyper: probabilistic type inference by optimising logical and natural constraints. arXiv preprint arXiv:2004.00348 (2020)
Park, J.: Javascript API misuse detection by using typescript. In: Proceedings of the Companion Publication of the 13th International Conference on Modularity, pp. 11–12 (2014)
Peng, Y., et al.: Static inference meets deep learning: a hybrid type inference approach for python. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2019–2030 (2022)
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans (2018)
Pradel, M., Gousios, G., Liu, J., Chandra, S.: Typewriter: neural type prediction with search-based validation. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 209–220 (2020)
QwietAI. Joern: code analysis tool. https://github.com/joernio/joern
Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1) (2020)
Reps, T.: Program analysis via graph reachability. University of Wisconsin, Tech. rep. (1998)
Reps, T., Horwitz, S., Sagiv, M.: Precise interprocedural dataflow analysis via graph reachability. In: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 49–61 (1995)
Sagiv, M., Reps, T., Horwitz, S.: Precise interprocedural dataflow analysis with applications to constant propagation. Theoret. Comput. Sci. 167(1), 131–170 (1996)
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Voruganti, S., Jesse, K., Devanbu, P.: FlexType: a plug-and-play framework for type inference models. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)
Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: open code large language models for code understanding and generation (2023)
Wang, Y., Wang, W., Joty, S., Hoi, S.C.: CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021)
Wei, J., Durrett, G., Dillig, I.: Typet5: Seq2seq type inference using static analysis. In: International Conference on Learning Representations (2023)
Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet experiment data. https://github.com/MrVPlusOne/LambdaNet/blob/master/LambdaNet-Experiments.zip. Accessed 16 May 2023
Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet: probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161 (2020)
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020)
Xu, B., An, L., Thung, F., Khomh, F., Lo, D.: Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empir. Softw. Eng. 25, 755–789 (2020)
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604 (2014)
Ye, F., Zhao, J., Sarkar, V.: Advanced graph-based deep learning for probabilistic type inference. arXiv preprint arXiv:2009.05949 (2020)
Acknowledgements
The authors gratefully acknowledge funding from the European Union’s Horizon 2020 research and innovation programme under project TESTABLE, grant agreement No. 101019206, from the German Federal Ministry of Education and Research (BMBF) under the grant BIFOLD (BIFOLD23B), the National Research Foundation (NRF), and Stellenbosch University Postgraduate Scholarship Programme (PSP). We would also like to thank Kevin Jesse for help with the MT4TS dataset and models and the anonymous reviewers for the feedback on our work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix
Appendix
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Seidel, L., Baker Effendi, S.D., Pinho, X., Rieck, K., van der Merwe, B., Yamaguchi, F. (2024). Learning Type Inference for Enhanced Dataflow Analysis. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_10
Download citation
DOI: https://doi.org/10.1007/978-3-031-51482-1_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)