Abstract
Identifying library functions in program binaries is important to many security applications, such as threat analysis, digital forensics, software infringement, and malware detection. Today’s program binaries normally contain a significant amount of third-party library functions taken from standard libraries or free open-source software packages. The ability to automatically identify such library functions not only enhances the quality and the efficiency of threat analysis and reverse engineering tasks, but also improves their accuracy by avoiding false correlations between irrelevant code bases. Existing methods are found to either lack efficiency or are not robust enough to identify different versions of the same library function caused by the use of different compilers, different compilation settings, or obfuscation techniques. To address these limitations, we present a scalable and robust system called BinShape to identify standard library functions in binaries. The key idea of BinShape is twofold. First, we derive a robust signature for each library function based on heterogeneous features covering CFGs, instruction-level characteristics, statistical characteristics, and function-call graphs. Second, we design a novel data structure to store such signatures and facilitate efficient matching against a target function. We evaluate BinShape on a diverse set of C/C++ binaries, compiled with GCC and Visual Studio compilers on x86-x64 CPU architectures, at optimization levels \(O0-O3\). Our experiments show that BinShape is able to identify library functions in real binaries both efficiently and accurately, with an average accuracy of \(89\%\) and taking about 0.14 s to identify one function out of three million candidates. We also show that BinShape is robust enough when the code is subjected to different compilers, slight modification, or some obfuscation techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
C Language Library. http://www.cplusplus.com/reference/clibrary/
Exeinfo PE. http://exeinfo.atwebpages.com
HexRays: IDA F.L.I.R.T. Technology. https://www.hex-rays.com/products/ida/tech/flirt/in_depth.shtml
HexRays: IDA Pro. https://www.hex-rays.com/products/ida/index.shtml
MongoDB. https://www.mongodb.com/
NIST/SEMATECH e-Handbook of Statistical Methods. http://www.itl.nist.gov/div898/handbook/
Alrabaee, S., Saleem, N., Preda, S., Wang, L., Debbabi, M.: OBA2: an Onion approach to binary code authorship attribution. Digital Invest. 11, S94–S103 (2014)
Alrabaee, S., Shirani, P., Wang, L., Debbabi, M.: SIGMA: a semantic integrated graph matching approach for identifying reused functions in binary code. Digital Invest. 12, S61–S71 (2015)
Alrabaee, S., Wang, L., Debbabi, M.: BinGold: towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs). Digital Invest. 18, S11–S22 (2016)
Bourquin, M., King, A., Robbins, E.: BinSlayer: accurate comparison of binary executables. In: Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, p. 4. ACM (2013)
David, Y., Partush, N., Yahav, E.: Statistical similarity of binaries. In: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 266–280. ACM (2016)
David, Y., Yahav, E.: Tracelet-based code search in executables. In: ACM SIGPLAN Notices, vol. 49, pp. 349–360. ACM (2014)
Dullien, T., Rolles, R.: Graph-based comparison of executable objects (English version). SSTIC 5, 1–3 (2005)
Eagle, C.: The IDA Pro Book: The Unofficial Guide to the World’s Most Popular Disassembler. No Starch Press, San Francisco (2011)
Egele, M., Scholte, T., Kirda, E., Kruegel, C.: A survey on automated dynamic malware-analysis techniques and tools. ACM Comput. Surv. (CSUR) 44(2), 6 (2012)
Egele, M., Woo, M., Chapman, P., Brumley, D.: Blanket execution: dynamic similarity testing for program binaries and components. In: Usenix Security, pp. 303–317 (2014)
Elmore, K.L., Richman, M.B.: Euclidean distance as a similarity metric for principal component analysis. Mon. Weather Rev. 129(3), 540–549 (2001)
Eschweiler, S., Yakdan, K., Gerhards-Padilla, E.: discovRE: efficient cross-architecture identification of bugs in binary code. In Proceedings of the 23th Symposium on Network and Distributed System Security (NDSS) (2016)
Farhadi, M.R., Fung, B.C., Fung, Y.B., Charland, P., Preda, S., Debbabi, M.: Scalable code clone search for malware analysis. Digital Invest. 15, 46–60 (2015)
Feng, Q., Zhou, R., Xu, C., Cheng, Y., Testa, B., Yin, H.: Scalable graph-based bug search for firmware images. In: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security (CCS), pp. 480–491. ACM (2016)
Frank, E., Wang, Y., Inglis, S., Holmes, G., Witten, I.H.: Using model trees for classification. Mach. Learn. 32(1), 63–76 (1998)
Gascon, H., Yamaguchi, F., Arp, D., Rieck, K.: Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM Workshop on Artificial Intelligence and Security (AISec), pp. 45–54. ACM (2013)
Griffin, C., Theory, G.: Penn State Math 485 Lecture Notes (2012). http://www.personal.psu.edu/cxg286/Math485.pdf
Hido, S., Kashima, H.: A linear-time graph kernel. In: Ninth IEEE International Conference on Data Mining, ICDM 2009, pp. 179–188. IEEE (2009)
Hu, X., Chiueh, T.-C., Shin, K.G.: Large-scale malware indexing using function-call graphs. In: Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS), pp. 611–620. ACM (2009)
Huang, H., Youssef, A.M., Debbabi, M.: BinSequence: fast, accurate and scalable binary code reuse detection. In: Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security (ASIA CCS), pp. 155–166. ACM (2017)
Jacobson, E.R., Rosenblum, N., Miller, B.P.: Labeling library functions in stripped binaries. In: Proceedings of the 10th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools (PASTE), pp. 1–8. ACM (2011)
Junod, P., Rinaldini, J., Wehrli, J., Michielin, J.: Obfuscator-LLVM: software protection for the masses. In: Proceedings of the 1st International Workshop on Software PROtection (SPRO), pp. 3–9. IEEE Press (2015)
Khoo, W.M.: Decompilation as search. Technical report, University of Cambridge, Computer Laboratory (2013)
Khoo, W.M., Mycroft, A., Anderson, R.: Rendezvous: a search engine for binary code. In: Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), pp. 329–338. IEEE Press (2013)
Kolbitsch, C., Holz, T., Kruegel, C., Kirda, E.: Inspector gadget: automated extraction of proprietary gadgets from malware binaries. In: 2010 IEEE Symposium on Security and Privacy (SP), pp. 29–44. IEEE (2010)
Kruegel, C., Kirda, E., Mutz, D., Robertson, W., Vigna, G.: Polymorphic worm detection using structural information of executables. In: Valdes, A., Zamboni, D. (eds.) RAID 2005. LNCS, vol. 3858, pp. 207–226. Springer, Heidelberg (2006). doi:10.1007/11663812_11
Kührer, M., Rossow, C., Holz, T.: Paint it black: evaluating the effectiveness of malware blacklists. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 1–21. Springer, Cham (2014). doi:10.1007/978-3-319-11379-1_1
Lin, D., Stamp, M.: Hunting for undetectable metamorphic viruses. J. Comput. Virol. 7(3), 201–214 (2011)
Livi, L., Rizzi, A.: The graph matching problem. Pattern Anal. Appl. 16(3), 253–283 (2013)
Martignoni, L., Christodorescu, M., Jha, S.: Omniunpack: fast, generic, and safe unpacking of malware. In: Twenty-Third Annual Computer Security Applications Conference, ACSAC 2007, pp. 431–441. IEEE (2007)
Nouh, L., Rahimian, A., Mouheb, D., Debbabi, M., Hanna, A.: BinSign: fingerprinting binary functions to support automated analysis of code executables. In: IFIP International Information Security and Privacy Conference (IFIP SEC). Springer (2017)
Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI) 27(8), 1226–1238 (2005)
Pewny, J., Garmany, B., Gawlik, R., Rossow, C., Holz, T.: Cross-architecture bug search in binary executables. In: 2015 IEEE Symposium on Security and Privacy (SP), pp. 709–724. IEEE (2015)
Qiu, J., Su, X., Ma, P.: Using reduced execution flow graph to identify library functions in binary code. IEEE Trans. Softw. Eng. (TSE) 42(2), 187–202 (2016)
Rad, B.B., Masrom, M., Ibrahim, S.: Opcodes histogram for classifying metamorphic portable executables malware. In: 2012 International Conference on e-Learning and e-Technologies in Education (ICEEE), pp. 209–213. IEEE (2012)
Ramaswami, M., Bhaskaran, R.: A study on feature selection techniques in educational data mining. arXiv preprint arXiv:0912.3924 (2009)
Roobaert, D., Karakoulas, G., Chawla, N.V.: Information gain, correlation and support vector machines. In: Guyon, I., Nikravesh, M., Gunn, S., Zadeh, L.A. (eds.) Feature Extraction. STUDFUZZ, vol. 207, pp. 463–470. Springer, Heidelberg (2006)
Rosenblum, N., Zhu, X., Miller, B.P.: Who wrote this code? identifying the authors of program binaries. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 172–189. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23822-2_10
Toderici, A.H., Stamp, M.: Chi-squared distance and metamorphic virus detection. J. Comput. Virol. Hacking Tech. 9(1), 1–14 (2013)
van der Veen, V., Göktas, E., Contag, M., Pawoloski, A., Chen, X., Rawat, S., Bos, H., Holz, T., Athanasopoulos, E., Giuffrida, C.: A tough call: mitigating advanced code-reuse attacks at the binary level. In: 2016 IEEE Symposium on Security and Privacy (SP), pp. 934–953. IEEE (2016)
Yokoyama, A., et al.: SandPrint: fingerprinting malware sandboxes to provide intelligence for sandbox evasion. In: Monrose, F., Dacier, M., Blanc, G., Garcia-Alfaro, J. (eds.) RAID 2016. LNCS, vol. 9854, pp. 165–187. Springer, Cham (2016). doi:10.1007/978-3-319-45719-2_8
Zeng, J., Fu, Y., Miller, K.A., Lin, Z., Zhang, X., Xu, D.: Obfuscation resilient binary code reuse through trace-oriented programming. In: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security (CCS), pp. 487–498. ACM (2013)
Ziegel, E.R.: Probability and Statistics for Engineering and the Sciences. Technometrics (2012)
Acknowledgment
We would like to thank our shepherd, Dr. Cristiano Giuffrida, and the anonymous reviewers for providing us very precious comments. This research is the result of a fruitful collaboration between the Security Research Center (SRC) of Concordia University, Defence Research and Development Canada (DRDC) and Google under a National Defence/NSERC Research Program.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Shirani, P., Wang, L., Debbabi, M. (2017). BinShape: Scalable and Robust Binary Library Function Identification Using Function Shape. In: Polychronakis, M., Meier, M. (eds) Detection of Intrusions and Malware, and Vulnerability Assessment. DIMVA 2017. Lecture Notes in Computer Science(), vol 10327. Springer, Cham. https://doi.org/10.1007/978-3-319-60876-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-60876-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-60875-4
Online ISBN: 978-3-319-60876-1
eBook Packages: Computer ScienceComputer Science (R0)