Learning Type Inference for Enhanced Dataflow Analysis

Seidel, Lukas; Baker Effendi, Sedick David; Pinho, Xavier; Rieck, Konrad; van der Merwe, Brink; Yamaguchi, Fabian

doi:10.1007/978-3-031-51482-1_10

Lukas Seidel ORCID: orcid.org/0009-0006-8190-0300^11,13,
Sedick David Baker Effendi ORCID: orcid.org/0000-0002-4942-626X^12,14,
Xavier Pinho¹¹,
Konrad Rieck¹³,
Brink van der Merwe¹² &
…
Fabian Yamaguchi^11,12

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14347))

Included in the following conference series:

European Symposium on Research in Computer Security

670 Accesses

Abstract

Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer’s ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications.

We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program’s code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.

L. Seidel and S.D. Baker Effendi—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Evaluating Baselines for Type Inference: Static Code Analysis Versus Large Language Model

Novice Type Error Diagnosis with Natural Language Models

TypeMiner: Recovering Types in Binary Programs Using Machine Learning

References

Amazon Web Services. AWS DynamoDB: NoSQL Database (2023). https://aws.amazon.com/dynamodb/
Bierman, G., Abadi, M., Torgersen, M.: Understanding typescript. In: Jones, R. (ed.) ECOOP 2014 - Object-Oriented Programming. LNCS, vol. 8586, pp. 257–281. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44202-9_11
Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc. (2020)
Google Scholar
Dagenais, B., Hendren, L.: Enabling static analysis for partial java programs. In: Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, pp. 313–328 (2008)
Google Scholar
Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (2022)
Google Scholar
DefinitelyTyped. DefinitelyTyped: repository for high quality TypeScript type definitions. https://github.com/DefinitelyTyped/DefinitelyTyped
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)
Google Scholar
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in javascript. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 758–769. IEEE (2017)
Google Scholar
Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in Javascript. In: Proceedings of the 39th International Conference on Software Engineering (ICSE 2017), pp. 758–769. IEEE Press (2017)
Google Scholar
GitHub. The state of the octoverse: open source software survey (2023). https://octoverse.github.com/
Guo, D., et al.: Graphcodebert: pre-training code representations with data flow. arXiv preprint (2020)
Google Scholar
Hanenberg, S., Kleinschmager, S., Robbes, R., Tanter, É., Stefik, A.: An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19(5), 1335–1382 (2014)
Article Google Scholar
Hellendoorn, V.J., Bird, C., Barr, E.T., Allamanis, M.: Deep learning type inference. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 152–162 (2018)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Google Scholar
Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM Trans. Progr. Lang. Syst. 12(1) (1990)
Google Scholar
Huggingface: List of pre-trained models on huggingface (2023). https://huggingface.co/transformers/v3.3.1/pretrained_models.html
Jesse, K.: GraphCodeBERT on Huggingface (2023). https://huggingface.co/kevinjesse/graphcodebert-MT4TS
Jesse, K.: TypeBert on Huggingface (2023). https://huggingface.co/kevinjesse/typebert
Jesse, K., Devanbu, P.T.: ManyTypes4TypeScript: a comprehensive typescript dataset for sequence-based type inference. In: 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp. 294–298 (2022)
Google Scholar
Jesse, K., Devanbu, P.T., Ahmed, T.: Learning type annotation: Is big data enough? In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1483–1486 (2021)
Google Scholar
Jesse, K., Devanbu, P.T., Sawant, A.: Learning to predict user-defined types. IEEE Trans. Softw. Eng. 49(4), 1508–1522 (2023)
Article Google Scholar
Lu, S., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)
Microsoft. Codexglue benchmark for understanding programming code. https://microsoft.github.io/CodeXGLUE/. Accessed 16 May 2023
Nijkamp, E., et al.: Codegen: an open large language model for code with multi-turn program synthesis. arXiv preprint (2022)
Google Scholar
Pandi, I.V., Barr, E.T., Gordon, A.D., Sutton, C.: Opttyper: probabilistic type inference by optimising logical and natural constraints. arXiv preprint arXiv:2004.00348 (2020)
Park, J.: Javascript API misuse detection by using typescript. In: Proceedings of the Companion Publication of the 13th International Conference on Modularity, pp. 11–12 (2014)
Google Scholar
Peng, Y., et al.: Static inference meets deep learning: a hybrid type inference approach for python. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2019–2030 (2022)
Google Scholar
Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans (2018)
Google Scholar
Pradel, M., Gousios, G., Liu, J., Chandra, S.: Typewriter: neural type prediction with search-based validation. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 209–220 (2020)
Google Scholar
QwietAI. Joern: code analysis tool. https://github.com/joernio/joern
Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)
Google Scholar
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1) (2020)
Google Scholar
Reps, T.: Program analysis via graph reachability. University of Wisconsin, Tech. rep. (1998)
Book Google Scholar
Reps, T., Horwitz, S., Sagiv, M.: Precise interprocedural dataflow analysis via graph reachability. In: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 49–61 (1995)
Google Scholar
Sagiv, M., Reps, T., Horwitz, S.: Precise interprocedural dataflow analysis with applications to constant propagation. Theoret. Comput. Sci. 167(1), 131–170 (1996)
Article MathSciNet Google Scholar
Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)
Google Scholar
Voruganti, S., Jesse, K., Devanbu, P.: FlexType: a plug-and-play framework for type inference models. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)
Google Scholar
Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: open code large language models for code understanding and generation (2023)
Google Scholar
Wang, Y., Wang, W., Joty, S., Hoi, S.C.: CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021)
Wei, J., Durrett, G., Dillig, I.: Typet5: Seq2seq type inference using static analysis. In: International Conference on Learning Representations (2023)
Google Scholar
Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet experiment data. https://github.com/MrVPlusOne/LambdaNet/blob/master/LambdaNet-Experiments.zip. Accessed 16 May 2023
Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet: probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161 (2020)
Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020)
Google Scholar
Xu, B., An, L., Thung, F., Khomh, F., Lo, D.: Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empir. Softw. Eng. 25, 755–789 (2020)
Article Google Scholar
Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604 (2014)
Google Scholar
Ye, F., Zhao, J., Sarkar, V.: Advanced graph-based deep learning for probabilistic type inference. arXiv preprint arXiv:2009.05949 (2020)

Download references

Acknowledgements

The authors gratefully acknowledge funding from the European Union’s Horizon 2020 research and innovation programme under project TESTABLE, grant agreement No. 101019206, from the German Federal Ministry of Education and Research (BMBF) under the grant BIFOLD (BIFOLD23B), the National Research Foundation (NRF), and Stellenbosch University Postgraduate Scholarship Programme (PSP). We would also like to thank Kevin Jesse for help with the MT4TS dataset and models and the anonymous reviewers for the feedback on our work.

Author information

Authors and Affiliations

QwietAI, San Jose, USA
Lukas Seidel, Xavier Pinho & Fabian Yamaguchi
Stellenbosch University, Stellenbosch, South Africa
Sedick David Baker Effendi, Brink van der Merwe & Fabian Yamaguchi
Technische Universität Berlin, Berlin, Germany
Lukas Seidel & Konrad Rieck
Whirly Labs, Cape Town, South Africa
Sedick David Baker Effendi

Authors

Lukas Seidel
View author publications
You can also search for this author in PubMed Google Scholar
Sedick David Baker Effendi
View author publications
You can also search for this author in PubMed Google Scholar
Xavier Pinho
View author publications
You can also search for this author in PubMed Google Scholar
Konrad Rieck
View author publications
You can also search for this author in PubMed Google Scholar
Brink van der Merwe
View author publications
You can also search for this author in PubMed Google Scholar
Fabian Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sedick David Baker Effendi .

Editor information

Editors and Affiliations

University of California, Irvine, CA, USA
Gene Tsudik
University of Padua, Padua, Italy
Mauro Conti
Delft University of Technology, Delft, The Netherlands
Kaitai Liang
Delft University of Technology, Delft, The Netherlands
Georgios Smaragdakis

Appendix

Table 4. The GitHub subdirectories of each manually reviewed open-source web application or library. Notable technologies include React, Express, Chroma, MongoDB, Meteor, AWS, and Postgres.

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Seidel, L., Baker Effendi, S.D., Pinho, X., Rieck, K., van der Merwe, B., Yamaguchi, F. (2024). Learning Type Inference for Enhanced Dataflow Analysis. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_10

Download citation

DOI: https://doi.org/10.1007/978-3-031-51482-1_10
Published: 11 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-51481-4
Online ISBN: 978-3-031-51482-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning Type Inference for Enhanced Dataflow Analysis

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Evaluating Baselines for Type Inference: Static Code Analysis Versus Large Language Model

Novice Type Error Diagnosis with Natural Language Models

TypeMiner: Recovering Types in Binary Programs Using Machine Learning

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us