ABSTRACT
Programmers declare variables to serve specific implementation purposes that we refer to as variable usage semantics (VUS). Understanding VUS is required for various software engineering tasks, including program comprehension, code audits, and vulnerability detection. To help programmers understand VUS, we present a new program analysis that infers a variable's usage semantics from its textual and context information (e.g., symbolic name, type, scope, information flow). To support this analysis, we introduce VarSem, a domain-specific language, in which a variable's semantic category is expressed as a set of declarative rules. VarSem's execution determines which program variables belong to a given semantic category. VarSem translates high-level declarative rules into low-level program analysis techniques, including natural language processing and data flow, and provides a highly extensible architecture for specifying new rules and analysis techniques. We evaluate VarSem with eight real-world systems to identify their personally identifiable information variables. The evaluation results show that VarSem infers variable semantics with satisfying accuracy/precision and passable recall, thus potentially benefiting both software and security engineers.
Supplemental Material
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2014. Learning natural coding conventions. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. 281-293.Google ScholarDigital Library
- Miltiadis Allamanis, Earl T Barr, Christian Bird, and Charles Sutton. 2015. Suggesting accurate method and class names. In Proceedings of the 2015 10th Joint Meeting on Foundations of Software Engineering. 38-49.Google ScholarDigital Library
- Miltiadis Allamanis, Hao Peng, and Charles Sutton. 2016. A convolutional attention network for extreme summarization of source code. In International conference on machine learning. 2091-2100.Google Scholar
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2018. A general path-based representation for predicting program properties. ACM SIGPLAN Notices 53, 4 ( 2018 ), 404-419.Google Scholar
- Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning distributed representations of code. Proceedings of the ACM on Programming Languages 3, POPL ( 2019 ), 1-29.Google ScholarDigital Library
- antlersoft. 2011. Browse-by-Query. htp://browsebyquery.sourceforge. net/.Google Scholar
- Dmitry Baryshkov. 2019. Tools to work with EMV bank cards. htps: //github.com/lumag/emv-tools.Google Scholar
- Raymond PL Buse and Westley R Weimer. 2008. A metric for software readability. In Proceedings of the 2008 international symposium on Software testing and analysis. 121-130.Google ScholarDigital Library
- Guang Chen, Yuexing Wang, Min Zhou, and Jiaguang Sun. 2019. VFQL: combinational static analysis as query language. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 378-381.Google ScholarDigital Library
- Yue Chen, Mustakimur Khandaker, and Zhi Wang. 2017. Pinpointing vulnerabilities. In Proceedings of the 2017 ACM on Asia conference on computer and communications security. 334-345.Google ScholarDigital Library
- Clang Front End for LLVM Developers. 2019. Clang Static Analyzer. htps://clang-analyzer.llvm.org/.Google Scholar
- Tal Cohen, Joseph Gil, and Itay Maman. 2006. JTL: the Java tools language. ACM SIGPLAN Notices 41, 10 ( 2006 ), 89-108.Google Scholar
- CVE site. 2011. CVE-2011-4120. htps://cvesite.com/cves/CVE-2011-4120.Google Scholar
- CVE site. 2019. CVE-2019-12210. htps://cvesite.com/cves/CVE-2019-12210.Google Scholar
- drkblog. 2018. findmacs. htps://github.com/drkblog/findmacs.Google Scholar
- Robert Dyer, Hoan Anh Nguyen, Hridesh Rajan, and Tien N Nguyen. 2013. Boa: A language and infrastructure for analyzing ultra-largescale software repositories. In 2013 35th International Conference on Software Engineering (ICSE). IEEE, 422-431.Google ScholarCross Ref
- Peter Harry Eidorf, Fritz Henglein, Christian Mossin, Henning Niss, Morten Heine Sørensen, and Mads Tofte. 1999. AnnoDomini: from type theory to Year 2000 conversion tool. In Proceedings of the 26th ACM SIGPLAN-SIGACT symposium on Principles of programming languages. 1-14.Google ScholarDigital Library
- Peter Harry Eidorf, Fritz Henglein, Christian Mossin, Henning Niss, Morten Heine B Sørensen, and Mads Tofte. 1999. AnnoDomini in practice: A type-theoretic approach to the year 2000 problem. In International Conference on Typed Lambda Calculi and Applications. Springer, 6-13.Google ScholarCross Ref
- Peter Eisentraut. 2015. emailaddr type for PostgreSQL. htps://github. com/petere/pgemailaddr.Google Scholar
- Edward M Gellenbeck and Curtis R Cook. 1991. An investigation of procedure and variable names as beacons during program comprehension. In Empirical studies of programmers: Fourth workshop. Ablex Publishing, Norwood, NJ, 65-81.Google Scholar
- Google. 2019. word2vec. htps://code.google.com/archive/p/ word2vec/.Google Scholar
- Google. 2019. word2vec-GoogleNews-vectors. htps://github.com/ mmihaltz/word2vec-GoogleNews-vectors.Google Scholar
- Hunter Gregal. 2019. MimiPenguin 2.0. htps://github.com/ huntergregal/mimipenguin.Google Scholar
- Cay S Horstmann. 2012. Scala for the Impatient. Pearson Education.Google Scholar
- Einar W Høst and Bjarte M Østvold. 2009. Debugging method names. In European Conference on Object-Oriented Programming. Springer, 294-317.Google ScholarDigital Library
- Jianjun Huang, Zhichun Li, Xusheng Xiao, Zhenyu Wu, Kangjie Lu, Xiangyu Zhang, and Guofei Jiang. 2015. {SUPOR}: Precise and Scalable Sensitive User Input Detection for Android Apps. In 24th USENIX Security Symposium (USENIX Security 15). 977-992.Google Scholar
- J. Karau. 2014. phone number scanner. htps://github.com/witycoder/ phone_number_scanner.Google Scholar
- Lin Jiang, Hui Liu, and He Jiang. 2019. Machine Learning Based Recommendation of Method Names: How Far are We. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 602-614.Google ScholarDigital Library
- Jorrit Kronjee, Arjen Hommersom, and Harald Vranken. 2018. Discovering software vulnerabilities using data-flow analysis and machine learning. In Proceedings of the 13th International Conference on Availability, Reliability and Security. 1-10.Google ScholarDigital Library
- KYLIN Information Technology Co., Ltd. 2019. Biometric Authentication. htps://github.com/ukui/biometric-authentication.Google Scholar
- Dawn Lawrie, Christopher Morrell, Henry Feild, and David Binkley. 2006. What's in a Name? A Study of Identifiers. In 14th IEEE International Conference on Program Comprehension (ICPC'06). IEEE, 3-12.Google ScholarDigital Library
- Xing Liu, Jiqiang Liu, Wei Wang, Yongzhong He, and Xiangliang Zhang. 2018. Discovering and understanding Android sensor usage behaviors with data flow analysis. World Wide Web 21, 1 ( 2018 ), 105-126.Google Scholar
- llvm-admin team. 2019. The LLVM Compiler Infrastructure. htps: //llvm.org/.Google Scholar
- Kenny MacDermid. 2016. wdpassport-utils. htps://github.com/ KenMacD/wdpassport-utils.Google Scholar
- Michael Martin, Benjamin Livshits, and Monica S. Lam. 2005. Finding Application Errors and Security Flaws Using PQL: A Program Query Language. In Proceedings of the 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications (San Diego, CA, USA) ( OOPSLA '05). Association for Computing Machinery, New York, NY, USA, 365-383. htps://doi.org/10.1145/ 1094811.1094840Google Scholar
- Alon Mishne, Sharon Shoham, and Eran Yahav. 2012. Typestate-based semantic code search over partial programs. In Proceedings of the ACM international conference on Object oriented programming systems languages and applications. 997-1016.Google ScholarDigital Library
- Yuhong Nan, Min Yang, Zhemin Yang, Shunfan Zhou, Guofei Gu, and XiaoFeng Wang. 2015. Uipicker: User-input privacy identification in mobile applications. In 24th USENIX Security Symposium (USENIX Security 15). 993-1008.Google Scholar
- Arvind Narayanan and Vitaly Shmatikov. 2010. Myths and fallacies of" personally identifiable information". Commun. ACM 53, 6 ( 2010 ), 24-26.Google Scholar
- NetBeans. 2012. Jackpot. htp://wiki.netbeans.org/Jackpot.Google Scholar
- Son Nguyen, Tien Nguyen, Yi Li, and Shaohua Wang. 2019. Combining Program Analysis and Statistical Language Model for Code Statement Completion. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 710-721.Google Scholar
- Veselin Raychev, Martin Vechev, and Andreas Krause. 2019. Predicting program properties from'big code'. Commun. ACM 62, 3 ( 2019 ), 99-107.Google Scholar
- Andrew Rice, Edward Aftandilian, Ciera Jaspan, Emily Johnston, Michael Pradel, and Yulissa Arroyo-Paredes. 2017. Detecting argument selection defects. Proceedings of the ACM on Programming Languages 1, OOPSLA ( 2017 ), 1-22.Google ScholarDigital Library
- Luciano Sampaio and Alessandro Garcia. 2016. Exploring contextsensitive data flow analysis for early vulnerability detection. Journal of Systems and Software 113 ( 2016 ), 337-361.Google Scholar
- Paul M Schwartz and Daniel J Solove. 2011. The PII problem: Privacy and a new concept of personally identifiable information. NYUL rev. 86 ( 2011 ), 1814.Google Scholar
- Giriprasad Sridhara, Emily Hill, Divya Muppaneni, Lori Pollock, and K Vijay-Shanker. 2010. Towards automatically generating summary comments for java methods. In Proceedings of the IEEE/ACM international conference on Automated software engineering. 43-52.Google ScholarDigital Library
- The Clang Team. 2020. Matching the Clang AST. htps://clang.llvm. org/docs/LibASTMatchers.html.Google Scholar
- Technology Services Group, University of Illinois at UrbanaChampaign. 2014. Ssniper Social Security Scanner for Linux. htps: //github.com/racooper/ssniper.Google Scholar
- The Clang Team. 2019. LibTooling. htps://clang.llvm.org/docs/ LibTooling.html.Google Scholar
- Raoul-Gabriel Urma and Alan Mycroft. 2015. Source-code queries with graph databases-with application to programming language usage and evolution. Science of Computer Programming 97 ( 2015 ), 127-134.Google Scholar
- Bogdan Vasilescu, Casey Casalnuovo, and Premkumar Devanbu. 2017. Recovering clear, natural identifiers from obfuscated JS names. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 683-693.Google ScholarDigital Library
- Yao Wan, Zhou Zhao, Min Yang, Guandong Xu, Haochao Ying, Jian Wu, and Philip S Yu. 2018. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 397-407.Google ScholarDigital Library
- Fengguo Wei, Sankardas Roy, and Xinming Ou. 2018. Amandroid: a precise and general inter-component data flow analysis framework for security vetting of Android apps. ACM Transactions on Privacy and Security (TOPS) 21, 3 ( 2018 ), 1-32.Google Scholar
- Westley Weimer and George C Necula. 2005. Mining temporal speciifcations for error detection. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 461-476.Google ScholarDigital Library
- Ian H Witten, Eibe Frank, and Mark A Hall. 2005. Practical machine learning tools and techniques. Morgan Kaufmann ( 2005 ), 578.Google Scholar
- Ludwig Wittgenstein. 2009. Philosophical investigations. John Wiley & Sons.Google Scholar
- Yubico Company. 2019. Yubico PAM module. htps://developers. yubico.com/yubico-pam/.Google Scholar
- Yu Zhao, Tingting Yu, Ting Su, Yang Liu, Wei Zheng, Jingzhi Zhang, and William GJ Halfond. 2019. Recdroid: automatically reproducing Android application crashes from bug reports. In Proceedings of the 41st International Conference on Software Engineering. IEEE Press, 128-139.Google ScholarDigital Library
Index Terms
- VarSem: declarative expression and automated inference of variable usage semantics
Recommendations
Everything old is new again: quoted domain-specific languages
PEPM '16: Proceedings of the 2016 ACM SIGPLAN Workshop on Partial Evaluation and Program ManipulationWe describe a new approach to implementing Domain-Specific Languages(DSLs), called Quoted DSLs (QDSLs), that is inspired by two old ideas:quasi-quotation, from McCarthy's Lisp of 1960, and the subformula principle of normal proofs, from Gentzen's ...
A DSL for writing type systems for Xtext languages
PPPJ '11: Proceedings of the 9th International Conference on Principles and Practice of Programming in JavaXtext is a framework for the development of languages, which also generates all the typical and recurrent artifacts for a fully-fledged IDE on top of Eclipse. The validation (e.g., checking the correctness of programs from the point of view of types) of ...
A generic analysis environment for declarative programs
WCFLP '05: Proceedings of the 2005 ACM SIGPLAN workshop on Curry and functional logic programmingIn this paper we present CurryBrowser, a generic analysis environment for the declarative multi-paradigm language Curry. CurryBrowser supports browsing through the implementation of an application written in Curry, i.e., the main module and all directly ...
Comments