Abstract
File format vulnerabilities have been highlighted in recent years, and the performance of fuzzing tests relies heavily on the knowledge of target formats. In this paper, we present systematic algorithms and methods to automatically reverse engineer input file formats. The methodology employs dynamic taint analysis to reveal implicit relational information between input file and binary procedures, which is used for the measurement of correlations among data bytes, format segmentation and data type inference. We have implemented a prototype, and its general tests on 10 well-published binary formats yielded an average of over 85 % successful identification rate, while more detailed structural information was unveiled beyond coarse granular format analysis. Besides, a practical pseudo-fuzzing evaluation method is discussed in accordance with real-world demands on security analysis, and the evaluation results demonstrated the practical effectiveness of our system.
Access this article
Rent this article via DeepDyve
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1713-6/MediaObjects/500_2015_1713_Fig1_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1713-6/MediaObjects/500_2015_1713_Fig2_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1713-6/MediaObjects/500_2015_1713_Fig3_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1713-6/MediaObjects/500_2015_1713_Fig4_HTML.gif)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00500-015-1713-6/MediaObjects/500_2015_1713_Fig5_HTML.gif)
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Since \(SimTT_{i,j}=SimTT_{j,i}\), only the lower triangular matrix is plotted.
References
Aitel D (2002) An introduction to spike, the fuzzer creation kit. presentation slides, 1 Aug 2002
Bienz T, Cohn R, Meehan J (1993) Portable document format reference manual. Addison-Wesley, Reading
Bosman E, Slowinska A, Bos H (2011) Minemu: the worlds fastest taint tracker. In: Proceedings of recent advances in intrusion detection, Springer, New York, pp 1–20
Boutell T (1997) Png (portable network graphics) specification version 1.0
Caballero J, Yin H, Liang Z, Song D (2007) Polyglot: automatic extraction of protocol message format using dynamic binary analysis. In: Proceedings of the 14th ACM conference on computer and communications security, ACM, pp 317–329
Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 6:679–698
Castiglione A, De Santis A, Soriente C (2007) Taking advantages of a disadvantage: digital forensics and steganography using document metadata. J Syst Softw 80(5):750–764
Castiglione A, De Santis A, Soriente C (2010) Security and privacy issues in the portable document format. J Syst Softw 83(10):1813–1822
Chen W, Wang HJ, Irun-Briz L (2088) Tupni: automatic reverse engineering of input formats. In: Proceedings of the 15th ACM conference on computer and communications security, ACM, pp 391–402
Christey S, Martin RA (2007) Vulnerability type distributions in cve. In: Proceedings of Mitre report
Comparetti PM, Wondracek G, Kruegel C, Kirda E (2009) Prospex: protocol specification extraction. In: Proceedings of security and privacy, 2009 30th IEEE Symposium, IEEE, pp 110–125
Cui W, Kannan J, Wang HJ (2007) Discoverer: automatic protocol reverse engineering from network traces. In: Proceedings of 16th USENIX security symposium on USENIX security symposium, pp 1–14
Cui B, Wang F, Guo T, Dong G, Zhao B (2013) Flowwalker: a fast and precise off-line taint analysis framework. In: Proceedings of emerging intelligent data and web technologies (EIDWT), 2013 fourth international conference, IEEE, pp 583–588
Deutsch LP (1996) Gzip file format specification version 4.3
Eddington M (2011) Peach fuzzing platform. Peach Fuzzer
Hirschberg DS (1977) Algorithms for the longest common subsequence problem. J ACM (JACM) 24(4):664–675
ISO/IEC 11172-3:1993, Information technology—Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s—Part 3: Audio. Accessed 14 July 2010
Jee K, Kemerlis VP, Keromytis AD, Portokalidis G (2013) Shadowreplica: efficient parallelization of dynamic data flow tracking. In: Proceedings of the 2013 ACM SIGSAC conference on computer and communications security, ACM, pp 235–246
Jee K, Portokalidis G, Kemerlis VP, Ghosh S, August DI, Keromytis AD (2012) A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware. In: Proceedings of NDSS
Kemerlis VP, Portokalidis G, Jee K, Keromytis AD (2012) libdft: practical dynamic data flow tracking for commodity systems. In: Proceedings of ACM SIGPLAN Notices, vol 47. ACM, New York, pp 121–132
Lee JH, Thanassis A, Brumley D (2011) Principled reverse engineering of types in binary programs. In: Proceedings of NDSS, Tie
Li J, Chen X, Jia C, Lou W (2013) Identity-based encryption with outsourced revocation in cloud computing
Li J, Kim K (2010) Hidden attribute-based signatures without anonymity revocation. Inf Sci 180(9):1681–1689
Lin Z, Jiang X, Xu D, Zhang X (2008) Automatic protocol format reverse engineering through context-aware monitored execution. NDSS 8:1–15
Lin Z, Zhang X, Xu D (2010) Automatic reverse engineering of data structures from binary execution
Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. ACM Sigplan Notices 40(6):190–200
Microsoft (1992) Tiff revision 6.0: Specification for revision 6.0
Microsoft (1998) Wave and avi codec registries-rfc 2361
Microsoft (2013) Bitmap structures. http://msdn.microsoft.com/en-us/library/dd183392
Microsoft (2014) Wic api overview. http://msdn.microsoft.com/en-us/library/windows/desktop/ee719655%28v=vs.85%29.aspx
Newsome J, Song D (2005) Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software
Rarlab (2013) Rar 5.0 archive format. http://www.rarlab.com/technote.htm
Schwartz EJ, Avgerinos T, Brumley D (2010) All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In: Proceedings of security and privacy (SP), 2010 IEEE Symposium, IEEE, pp 317–331
Slowinska A, Bos H (2009) Pointless tainting?: evaluating the practicality of pointer tainting. In: Proceedings of the 4th ACM European conference on Computer systems, ACM, pp 61–74
Slowinska A, Stancescu T, Herbert B (2011) A dynamic excavator for reverse engineering data structures. In: Proceedings of NDSS, Howard
Sutton M, Greene A, Amini P (2007) Fuzzing: brute force vulnerability discovery. Addison-Wesley Professional
Wallace GK (1991) The jpeg still picture compression standard. Commun ACM 34(4):30–44
Wang Z, Jiang X, Cui W, Wang X, Grace M (2009) Reformat: automatic reverse engineering of encrypted messages. In: Proceedings of computer security-ESORICS 2009, Springer, New York, pp 200–215
Wang T, Wei T, Gu G, Zou W (2010) Taintscope: a checksum-aware directed fuzzing tool for automatic software vulnerability detection. In: Proceedings of security and privacy (SP), 2010 IEEE Symposium, IEEE, pp 497–512
WikiPedia (2013) Zip (file format). http://en.wikipedia.org/wiki/Zip_%28file_format%29
Wondracek G, Comparetti PM, Kruegel C, Kirda E, Anna S (2008) Automatic network protocol analysis. NDSS 8:1–14
Yin H, Song D (2010) Temu: binary code analysis via whole-system layered annotative execution. Submitted to VEE
Acknowledgments
The research did not involve human participants or animals. The sources of funding include National Natural Science Foundation of China (Nos. 61170268, 61100047, and 61272493). There are no potential conflicts of interests.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by V. Loia.
Rights and permissions
About this article
Cite this article
Cui, B., Wang, F., Hao, Y. et al. A taint based approach for automatic reverse engineering of gray-box file formats. Soft Comput 20, 3563–3578 (2016). https://doi.org/10.1007/s00500-015-1713-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-015-1713-6