Skip to main content

Learning Type Inference for Enhanced Dataflow Analysis

  • Conference paper
  • First Online:
Computer Security – ESORICS 2023 (ESORICS 2023)

Abstract

Statically analyzing dynamically-typed code is a challenging endeavor, as even seemingly trivial tasks such as determining the targets of procedure calls are non-trivial without knowing the types of objects at compile time. Addressing this challenge, gradual typing is increasingly added to dynamically-typed languages, a prominent example being TypeScript that introduces static typing to JavaScript. Gradual typing improves the developer’s ability to verify program behavior, contributing to robust, secure and debuggable programs. In practice, however, users only sparsely annotate types directly. At the same time, conventional type inference faces performance-related challenges as program size grows. Statistical techniques based on machine learning offer faster inference, but although recent approaches demonstrate overall improved accuracy, they still perform significantly worse on user-defined types than on the most common built-in types. Limiting their real-world usefulness even more, they rarely integrate with user-facing applications.

We propose CodeTIDAL5, a Transformer-based model trained to reliably predict type annotations. For effective result retrieval and re-integration, we extract usage slices from a program’s code property graph. Comparing our approach against recent neural type inference systems, our model outperforms the current state-of-the-art by 7.85% on the ManyTypes4TypeScript benchmark, achieving 71.27% accuracy overall. Furthermore, we present JoernTI, an integration of our approach into Joern, an open source static analysis tool, and demonstrate that the analysis benefits from the additional type information. As our model allows for fast inference times even on commodity CPUs, making our system available through Joern leads to high accessibility and facilitates security research.

L. Seidel and S.D. Baker Effendi—Contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Amazon Web Services. AWS DynamoDB: NoSQL Database (2023). https://aws.amazon.com/dynamodb/

  2. Bierman, G., Abadi, M., Torgersen, M.: Understanding typescript. In: Jones, R. (ed.) ECOOP 2014 - Object-Oriented Programming. LNCS, vol. 8586, pp. 257–281. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-662-44202-9_11

  3. Brown, T., et al.: Language models are few-shot learners. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33. Curran Associates, Inc. (2020)

    Google Scholar 

  4. Dagenais, B., Hendren, L.: Enabling static analysis for partial java programs. In: Proceedings of the 23rd ACM SIGPLAN Conference on Object-Oriented Programming Systems Languages and Applications, pp. 313–328 (2008)

    Google Scholar 

  5. Dao, T., Fu, D.Y., Ermon, S., Rudra, A., Ré, C.: FlashAttention: fast and memory-efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  6. DefinitelyTyped. DefinitelyTyped: repository for high quality TypeScript type definitions. https://github.com/DefinitelyTyped/DefinitelyTyped

  7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019)

    Google Scholar 

  8. Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in javascript. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 758–769. IEEE (2017)

    Google Scholar 

  9. Gao, Z., Bird, C., Barr, E.T.: To type or not to type: quantifying detectable bugs in Javascript. In: Proceedings of the 39th International Conference on Software Engineering (ICSE 2017), pp. 758–769. IEEE Press (2017)

    Google Scholar 

  10. GitHub. The state of the octoverse: open source software survey (2023). https://octoverse.github.com/

  11. Guo, D., et al.: Graphcodebert: pre-training code representations with data flow. arXiv preprint (2020)

    Google Scholar 

  12. Hanenberg, S., Kleinschmager, S., Robbes, R., Tanter, É., Stefik, A.: An empirical study on the impact of static typing on software maintainability. Empir. Softw. Eng. 19(5), 1335–1382 (2014)

    Article  Google Scholar 

  13. Hellendoorn, V.J., Bird, C., Barr, E.T., Allamanis, M.: Deep learning type inference. In: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 152–162 (2018)

    Google Scholar 

  14. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Google Scholar 

  15. Horwitz, S., Reps, T., Binkley, D.: Interprocedural slicing using dependence graphs. ACM Trans. Progr. Lang. Syst. 12(1) (1990)

    Google Scholar 

  16. Huggingface: List of pre-trained models on huggingface (2023). https://huggingface.co/transformers/v3.3.1/pretrained_models.html

  17. Jesse, K.: GraphCodeBERT on Huggingface (2023). https://huggingface.co/kevinjesse/graphcodebert-MT4TS

  18. Jesse, K.: TypeBert on Huggingface (2023). https://huggingface.co/kevinjesse/typebert

  19. Jesse, K., Devanbu, P.T.: ManyTypes4TypeScript: a comprehensive typescript dataset for sequence-based type inference. In: 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp. 294–298 (2022)

    Google Scholar 

  20. Jesse, K., Devanbu, P.T., Ahmed, T.: Learning type annotation: Is big data enough? In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1483–1486 (2021)

    Google Scholar 

  21. Jesse, K., Devanbu, P.T., Sawant, A.: Learning to predict user-defined types. IEEE Trans. Softw. Eng. 49(4), 1508–1522 (2023)

    Article  Google Scholar 

  22. Lu, S., et al.: Codexglue: a machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)

  23. Microsoft. Codexglue benchmark for understanding programming code. https://microsoft.github.io/CodeXGLUE/. Accessed 16 May 2023

  24. Nijkamp, E., et al.: Codegen: an open large language model for code with multi-turn program synthesis. arXiv preprint (2022)

    Google Scholar 

  25. Pandi, I.V., Barr, E.T., Gordon, A.D., Sutton, C.: Opttyper: probabilistic type inference by optimising logical and natural constraints. arXiv preprint arXiv:2004.00348 (2020)

  26. Park, J.: Javascript API misuse detection by using typescript. In: Proceedings of the Companion Publication of the 13th International Conference on Modularity, pp. 11–12 (2014)

    Google Scholar 

  27. Peng, Y., et al.: Static inference meets deep learning: a hybrid type inference approach for python. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2019–2030 (2022)

    Google Scholar 

  28. Peters, M.E., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans (2018)

    Google Scholar 

  29. Pradel, M., Gousios, G., Liu, J., Chandra, S.: Typewriter: neural type prediction with search-based validation. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 209–220 (2020)

    Google Scholar 

  30. QwietAI. Joern: code analysis tool. https://github.com/joernio/joern

  31. Radford, A., Narasimhan, K.: Improving language understanding by generative pre-training (2018)

    Google Scholar 

  32. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(1) (2020)

    Google Scholar 

  33. Reps, T.: Program analysis via graph reachability. University of Wisconsin, Tech. rep. (1998)

    Book  Google Scholar 

  34. Reps, T., Horwitz, S., Sagiv, M.: Precise interprocedural dataflow analysis via graph reachability. In: Proceedings of the 22nd ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pp. 49–61 (1995)

    Google Scholar 

  35. Sagiv, M., Reps, T., Horwitz, S.: Precise interprocedural dataflow analysis with applications to constant propagation. Theoret. Comput. Sci. 167(1), 131–170 (1996)

    Article  MathSciNet  Google Scholar 

  36. Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc. (2017)

    Google Scholar 

  37. Voruganti, S., Jesse, K., Devanbu, P.: FlexType: a plug-and-play framework for type inference models. In: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pp. 1–5 (2022)

    Google Scholar 

  38. Wang, Y., Le, H., Gotmare, A.D., Bui, N.D.Q., Li, J., Hoi, S.C.H.: Codet5+: open code large language models for code understanding and generation (2023)

    Google Scholar 

  39. Wang, Y., Wang, W., Joty, S., Hoi, S.C.: CodeT5: identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859 (2021)

  40. Wei, J., Durrett, G., Dillig, I.: Typet5: Seq2seq type inference using static analysis. In: International Conference on Learning Representations (2023)

    Google Scholar 

  41. Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet experiment data. https://github.com/MrVPlusOne/LambdaNet/blob/master/LambdaNet-Experiments.zip. Accessed 16 May 2023

  42. Wei, J., Goyal, M., Durrett, G., Dillig, I.: Lambdanet: probabilistic type inference using graph neural networks. arXiv preprint arXiv:2005.02161 (2020)

  43. Wolf, T., et al.: Transformers: State-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics (2020)

    Google Scholar 

  44. Xu, B., An, L., Thung, F., Khomh, F., Lo, D.: Why reinventing the wheels? An empirical study on library reuse and re-implementation. Empir. Softw. Eng. 25, 755–789 (2020)

    Article  Google Scholar 

  45. Yamaguchi, F., Golde, N., Arp, D., Rieck, K.: Modeling and discovering vulnerabilities with code property graphs. In: 2014 IEEE Symposium on Security and Privacy, pp. 590–604 (2014)

    Google Scholar 

  46. Ye, F., Zhao, J., Sarkar, V.: Advanced graph-based deep learning for probabilistic type inference. arXiv preprint arXiv:2009.05949 (2020)

Download references

Acknowledgements

The authors gratefully acknowledge funding from the European Union’s Horizon 2020 research and innovation programme under project TESTABLE, grant agreement No. 101019206, from the German Federal Ministry of Education and Research (BMBF) under the grant BIFOLD (BIFOLD23B), the National Research Foundation (NRF), and Stellenbosch University Postgraduate Scholarship Programme (PSP). We would also like to thank Kevin Jesse for help with the MT4TS dataset and models and the anonymous reviewers for the feedback on our work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sedick David Baker Effendi .

Editor information

Editors and Affiliations

Appendix

Appendix

Table 4. The GitHub subdirectories of each manually reviewed open-source web application or library. Notable technologies include React, Express, Chroma, MongoDB, Meteor, AWS, and Postgres.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Seidel, L., Baker Effendi, S.D., Pinho, X., Rieck, K., van der Merwe, B., Yamaguchi, F. (2024). Learning Type Inference for Enhanced Dataflow Analysis. In: Tsudik, G., Conti, M., Liang, K., Smaragdakis, G. (eds) Computer Security – ESORICS 2023. ESORICS 2023. Lecture Notes in Computer Science, vol 14347. Springer, Cham. https://doi.org/10.1007/978-3-031-51482-1_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-51482-1_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-51481-4

  • Online ISBN: 978-3-031-51482-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics