skip to main content
10.1145/3425898.3426962acmconferencesArticle/Chapter ViewAbstractPublication PagesgpceConference Proceedingsconference-collections
research-article

VarSem: declarative expression and automated inference of variable usage semantics

Published:16 November 2020Publication History

ABSTRACT

Programmers declare variables to serve specific implementation purposes that we refer to as variable usage semantics (VUS). Understanding VUS is required for various software engineering tasks, including program comprehension, code audits, and vulnerability detection. To help programmers understand VUS, we present a new program analysis that infers a variable's usage semantics from its textual and context information (e.g., symbolic name, type, scope, information flow). To support this analysis, we introduce VarSem, a domain-specific language, in which a variable's semantic category is expressed as a set of declarative rules. VarSem's execution determines which program variables belong to a given semantic category. VarSem translates high-level declarative rules into low-level program analysis techniques, including natural language processing and data flow, and provides a highly extensible architecture for specifying new rules and analysis techniques. We evaluate VarSem with eight real-world systems to identify their personally identifiable information variables. The evaluation results show that VarSem infers variable semantics with satisfying accuracy/precision and passable recall, thus potentially benefiting both software and security engineers.

Skip Supplemental Material Section

Supplemental Material

gpce20main-p36-p-video.mp4

mp4

114.3 MB

3425898.3426962.mp4

mp4

28.4 MB

References

  1. Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 281-293.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 38-49.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International conference on machine learning. 2091-2100.Google ScholarGoogle Scholar
  4. Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53, 4 ( 2018 ), 404-419.Google ScholarGoogle Scholar
  5. Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL ( 2019 ), 1-29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. antlersoft. 2011. Browse-by-Query. htp://browsebyquery.sourceforge. net/.Google ScholarGoogle Scholar
  7. Dmitry Baryshkov. 2019. Tools to work with EMV bank cards. htps: //github.com/lumag/emv-tools.Google ScholarGoogle Scholar
  8. Raymond PL Buse and Westley R Weimer. 2008. A metric for software readability. In Proceedings of the 2008 international symposium on Software testing and analysis. 121-130.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guang Chen, Yuexing Wang, Min Zhou, and Jiaguang Sun. 2019. VFQL: combinational static analysis as query language. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 378-381.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Yue Chen, Mustakimur Khandaker, and Zhi Wang. 2017. Pinpointing vulnerabilities. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 334-345.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Clang Front End for LLVM Developers. 2019. Clang Static Analyzer. htps://clang-analyzer.llvm.org/.Google ScholarGoogle Scholar
  12. Tal Cohen, Joseph Gil, and Itay Maman. 2006. JTL: the Java tools language. ACM SIGPLAN Notices 41, 10 ( 2006 ), 89-108.Google ScholarGoogle Scholar
  13. CVE site. 2011. CVE-2011-4120. htps://cvesite.com/cves/CVE-2011-4120.Google ScholarGoogle Scholar
  14. CVE site. 2019. CVE-2019-12210. htps://cvesite.com/cves/CVE-2019-12210.Google ScholarGoogle Scholar
  15. drkblog. 2018. findmacs. htps://github.com/drkblog/findmacs.Google ScholarGoogle Scholar
  16. Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. 2013. Boa: A language and infrastructure for analyzing ultra-largescale software repositories. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 422-431.Google ScholarGoogle ScholarCross RefCross Ref
  17. Peter Harry Eidorf, Fritz Henglein, Christian Mossin, Henning Niss, Morten Heine Sørensen, and Mads Tofte. 1999. AnnoDomini: from type theory to Year 2000 conversion tool. In Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 1-14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Peter Harry Eidorf, Fritz Henglein, Christian Mossin, Henning Niss, Morten Heine B Sørensen, and Mads Tofte. 1999. AnnoDomini in practice: A type-theoretic approach to the year 2000 problem. In International Conference on Typed Lambda Calculi and Applications. Springer, 6-13.Google ScholarGoogle ScholarCross RefCross Ref
  19. Peter Eisentraut. 2015. emailaddr type for PostgreSQL. htps://github. com/petere/pgemailaddr.Google ScholarGoogle Scholar
  20. Edward M Gellenbeck and Curtis R Cook. 1991. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop. Ablex Publishing, Norwood, NJ, 65-81.Google ScholarGoogle Scholar
  21. Google. 2019. word2vec. htps://code.google.com/archive/p/ word2vec/.Google ScholarGoogle Scholar
  22. Google. 2019. word2vec-GoogleNews-vectors. htps://github.com/ mmihaltz/word2vec-GoogleNews-vectors.Google ScholarGoogle Scholar
  23. Hunter Gregal. 2019. MimiPenguin 2.0. htps://github.com/ huntergregal/mimipenguin.Google ScholarGoogle Scholar
  24. Cay S Horstmann. 2012. Scala for the Impatient. Pearson Education.Google ScholarGoogle Scholar
  25. Einar W Høst and Bjarte M Østvold. 2009. Debugging method names. In European Conference on Object-Oriented Programming. Springer, 294-317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jianjun Huang, Zhichun Li, Xusheng Xiao, Zhenyu Wu, Kangjie Lu, Xiangyu Zhang, and Guofei Jiang. 2015. {SUPOR}: Precise and Scalable Sensitive User Input Detection for Android Apps. In 24th USENIX Security Symposium (USENIX Security 15). 977-992.Google ScholarGoogle Scholar
  27. J. Karau. 2014. phone number scanner. htps://github.com/witycoder/ phone_number_scanner.Google ScholarGoogle Scholar
  28. Lin Jiang, Hui Liu, and He Jiang. 2019. Machine Learning Based Recommendation of Method Names: How Far are We. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 602-614.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jorrit Kronjee, Arjen Hommersom, and Harald Vranken. 2018. Discovering software vulnerabilities using data-flow analysis and machine learning. In Proceedings of the 13th International Conference on Availability, Reliability and Security. 1-10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. KYLIN Information Technology Co., Ltd. 2019. Biometric Authentication. htps://github.com/ukui/biometric-authentication.Google ScholarGoogle Scholar
  31. Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. What's in a Name? A Study of Identifiers. In 14th IEEE International Conference on Program Comprehension (ICPC'06). IEEE, 3-12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Xing Liu, Jiqiang Liu, Wei Wang, Yongzhong He, and Xiangliang Zhang. 2018. Discovering and understanding Android sensor usage behaviors with data flow analysis. World Wide Web 21, 1 ( 2018 ), 105-126.Google ScholarGoogle Scholar
  33. llvm-admin team. 2019. The LLVM Compiler Infrastructure. htps: //llvm.org/.Google ScholarGoogle Scholar
  34. Kenny MacDermid. 2016. wdpassport-utils. htps://github.com/ KenMacD/wdpassport-utils.Google ScholarGoogle Scholar
  35. Michael Martin, Benjamin Livshits, and Monica S. Lam. 2005. Finding Application Errors and Security Flaws Using PQL: A Program Query Language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, CA, USA) ( OOPSLA '05). Association for Computing Machinery, New York, NY, USA, 365-383. htps://doi.org/10.1145/ 1094811.1094840Google ScholarGoogle Scholar
  36. Alon Mishne, Sharon Shoham, and Eran Yahav. 2012. Typestate-based semantic code search over partial programs. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications. 997-1016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yuhong Nan, Min Yang, Zhemin Yang, Shunfan Zhou, Guofei Gu, and XiaoFeng Wang. 2015. Uipicker: User-input privacy identification in mobile applications. In 24th USENIX Security Symposium (USENIX Security 15). 993-1008.Google ScholarGoogle Scholar
  38. Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information". Commun. ACM 53, 6 ( 2010 ), 24-26.Google ScholarGoogle Scholar
  39. NetBeans. 2012. Jackpot. htp://wiki.netbeans.org/Jackpot.Google ScholarGoogle Scholar
  40. Son Nguyen, Tien Nguyen, Yi Li, and Shaohua Wang. 2019. Combining Program Analysis and Statistical Language Model for Code Statement Completion. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 710-721.Google ScholarGoogle Scholar
  41. Veselin Raychev, Martin Vechev, and Andreas Krause. 2019. Predicting program properties from'big code'. Commun. ACM 62, 3 ( 2019 ), 99-107.Google ScholarGoogle Scholar
  42. Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. 2017. Detecting argument selection defects. Proceedings of the ACM on Programming Languages 1, OOPSLA ( 2017 ), 1-22.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Luciano Sampaio and Alessandro Garcia. 2016. Exploring contextsensitive data flow analysis for early vulnerability detection. Journal of Systems and Software 113 ( 2016 ), 337-361.Google ScholarGoogle Scholar
  44. Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information. NYUL rev. 86 ( 2011 ), 1814.Google ScholarGoogle Scholar
  45. Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 43-52.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. The Clang Team. 2020. Matching the Clang AST. htps://clang.llvm. org/docs/LibASTMatchers.html.Google ScholarGoogle Scholar
  47. Technology Services Group, University of Illinois at UrbanaChampaign. 2014. Ssniper Social Security Scanner for Linux. htps: //github.com/racooper/ssniper.Google ScholarGoogle Scholar
  48. The Clang Team. 2019. LibTooling. htps://clang.llvm.org/docs/ LibTooling.html.Google ScholarGoogle Scholar
  49. Raoul-Gabriel Urma and Alan Mycroft. 2015. Source-code queries with graph databases-with application to programming language usage and evolution. Science of Computer Programming 97 ( 2015 ), 127-134.Google ScholarGoogle Scholar
  50. Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 683-693.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397-407.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Fengguo Wei, Sankardas Roy, and Xinming Ou. 2018. Amandroid: a precise and general inter-component data flow analysis framework for security vetting of Android apps. ACM Transactions on Privacy and Security (TOPS) 21, 3 ( 2018 ), 1-32.Google ScholarGoogle Scholar
  53. Westley Weimer and George C Necula. 2005. Mining temporal speciifcations for error detection. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 461-476.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Ian H Witten, Eibe Frank, and Mark A Hall. 2005. Practical machine learning tools and techniques. Morgan Kaufmann ( 2005 ), 578.Google ScholarGoogle Scholar
  55. Ludwig Wittgenstein. 2009. Philosophical investigations. John Wiley & Sons.Google ScholarGoogle Scholar
  56. Yubico Company. 2019. Yubico PAM module. htps://developers. yubico.com/yubico-pam/.Google ScholarGoogle Scholar
  57. Yu Zhao, Tingting Yu, Ting Su, Yang Liu, Wei Zheng, Jingzhi Zhang, and William GJ Halfond. 2019. Recdroid: automatically reproducing Android application crashes from bug reports. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, 128-139.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. VarSem: declarative expression and automated inference of variable usage semantics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        GPCE 2020: Proceedings of the 19th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences
        November 2020
        136 pages
        ISBN:9781450381741
        DOI:10.1145/3425898
        • General Chair:
        • Martin Erwig,
        • Program Chair:
        • Jeff Gray

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 November 2020

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate56of180submissions,31%
      • Article Metrics

        • Downloads (Last 12 months)10
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader