Skip to main content
Log in

A taint based approach for automatic reverse engineering of gray-box file formats

  • Methodologies and Application
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

File format vulnerabilities have been highlighted in recent years, and the performance of fuzzing tests relies heavily on the knowledge of target formats. In this paper, we present systematic algorithms and methods to automatically reverse engineer input file formats. The methodology employs dynamic taint analysis to reveal implicit relational information between input file and binary procedures, which is used for the measurement of correlations among data bytes, format segmentation and data type inference. We have implemented a prototype, and its general tests on 10 well-published binary formats yielded an average of over 85 % successful identification rate, while more detailed structural information was unveiled beyond coarse granular format analysis. Besides, a practical pseudo-fuzzing evaluation method is discussed in accordance with real-world demands on security analysis, and the evaluation results demonstrated the practical effectiveness of our system.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Since \(SimTT_{i,j}=SimTT_{j,i}\), only the lower triangular matrix is plotted.

  2. http://www.sweetscape.com/010editor/templates.html.

References

  • Aitel D (2002) An introduction to spike, the fuzzer creation kit. presentation slides, 1 Aug 2002

  • Bienz T, Cohn R, Meehan J (1993) Portable document format reference manual. Addison-Wesley, Reading

  • Bosman E, Slowinska A, Bos H (2011) Minemu: the worlds fastest taint tracker. In: Proceedings of recent advances in intrusion detection, Springer, New York, pp 1–20

  • Boutell T (1997) Png (portable network graphics) specification version 1.0

  • Caballero J, Yin H, Liang Z, Song D (2007) Polyglot: automatic extraction of protocol message format using dynamic binary analysis. In: Proceedings of the 14th ACM conference on computer and communications security, ACM, pp 317–329

  • Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 6:679–698

  • Castiglione A, De Santis A, Soriente C (2007) Taking advantages of a disadvantage: digital forensics and steganography using document metadata. J Syst Softw 80(5):750–764

  • Castiglione A, De Santis A, Soriente C (2010) Security and privacy issues in the portable document format. J Syst Softw 83(10):1813–1822

  • Chen W, Wang HJ, Irun-Briz L (2088) Tupni: automatic reverse engineering of input formats. In: Proceedings of the 15th ACM conference on computer and communications security, ACM, pp 391–402

  • Christey S, Martin RA (2007) Vulnerability type distributions in cve. In: Proceedings of Mitre report

  • Comparetti PM, Wondracek G, Kruegel C, Kirda E (2009) Prospex: protocol specification extraction. In: Proceedings of security and privacy, 2009 30th IEEE Symposium, IEEE, pp 110–125

  • Cui W, Kannan J, Wang HJ (2007) Discoverer: automatic protocol reverse engineering from network traces. In: Proceedings of 16th USENIX security symposium on USENIX security symposium, pp 1–14

  • Cui B, Wang F, Guo T, Dong G, Zhao B (2013) Flowwalker: a fast and precise off-line taint analysis framework. In: Proceedings of emerging intelligent data and web technologies (EIDWT), 2013 fourth international conference, IEEE, pp 583–588

  • Deutsch LP (1996) Gzip file format specification version 4.3

  • Eddington M (2011) Peach fuzzing platform. Peach Fuzzer

  • Hirschberg DS (1977) Algorithms for the longest common subsequence problem. J ACM (JACM) 24(4):664–675

  • ISO/IEC 11172-3:1993, Information technology—Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s—Part 3: Audio. Accessed 14 July 2010

  • Jee K, Kemerlis VP, Keromytis AD, Portokalidis G (2013) Shadowreplica: efficient parallelization of dynamic data flow tracking. In: Proceedings of the 2013 ACM SIGSAC conference on computer and communications security, ACM, pp 235–246

  • Jee K, Portokalidis G, Kemerlis VP, Ghosh S, August DI, Keromytis AD (2012) A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware. In: Proceedings of NDSS

  • Kemerlis VP, Portokalidis G, Jee K, Keromytis AD (2012) libdft: practical dynamic data flow tracking for commodity systems. In: Proceedings of ACM SIGPLAN Notices, vol 47. ACM, New York, pp 121–132

  • Lee JH, Thanassis A, Brumley D (2011) Principled reverse engineering of types in binary programs. In: Proceedings of NDSS, Tie

  • Li J, Chen X, Jia C, Lou W (2013) Identity-based encryption with outsourced revocation in cloud computing

  • Li J, Kim K (2010) Hidden attribute-based signatures without anonymity revocation. Inf Sci 180(9):1681–1689

  • Lin Z, Jiang X, Xu D, Zhang X (2008) Automatic protocol format reverse engineering through context-aware monitored execution. NDSS 8:1–15

  • Lin Z, Zhang X, Xu D (2010) Automatic reverse engineering of data structures from binary execution

  • Luk C-K, Cohn R, Muth R, Patil H, Klauser A, Lowney G, Wallace S, Reddi VJ, Hazelwood K (2005) Pin: building customized program analysis tools with dynamic instrumentation. ACM Sigplan Notices 40(6):190–200

  • Microsoft (1992) Tiff revision 6.0: Specification for revision 6.0

  • Microsoft (1998) Wave and avi codec registries-rfc 2361

  • Microsoft (2013) Bitmap structures. http://msdn.microsoft.com/en-us/library/dd183392

  • Microsoft (2014) Wic api overview. http://msdn.microsoft.com/en-us/library/windows/desktop/ee719655%28v=vs.85%29.aspx

  • Newsome J, Song D (2005) Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software

  • Rarlab (2013) Rar 5.0 archive format. http://www.rarlab.com/technote.htm

  • Schwartz EJ, Avgerinos T, Brumley D (2010) All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In: Proceedings of security and privacy (SP), 2010 IEEE Symposium, IEEE, pp 317–331

  • Slowinska A, Bos H (2009) Pointless tainting?: evaluating the practicality of pointer tainting. In: Proceedings of the 4th ACM European conference on Computer systems, ACM, pp 61–74

  • Slowinska A, Stancescu T, Herbert B (2011) A dynamic excavator for reverse engineering data structures. In: Proceedings of NDSS, Howard

  • Sutton M, Greene A, Amini P (2007) Fuzzing: brute force vulnerability discovery. Addison-Wesley Professional

  • Wallace GK (1991) The jpeg still picture compression standard. Commun ACM 34(4):30–44

  • Wang Z, Jiang X, Cui W, Wang X, Grace M (2009) Reformat: automatic reverse engineering of encrypted messages. In: Proceedings of computer security-ESORICS 2009, Springer, New York, pp 200–215

  • Wang T, Wei T, Gu G, Zou W (2010) Taintscope: a checksum-aware directed fuzzing tool for automatic software vulnerability detection. In: Proceedings of security and privacy (SP), 2010 IEEE Symposium, IEEE, pp 497–512

  • WikiPedia (2013) Zip (file format). http://en.wikipedia.org/wiki/Zip_%28file_format%29

  • Wondracek G, Comparetti PM, Kruegel C, Kirda E, Anna S (2008) Automatic network protocol analysis. NDSS 8:1–14

  • Yin H, Song D (2010) Temu: binary code analysis via whole-system layered annotative execution. Submitted to VEE

Download references

Acknowledgments

The research did not involve human participants or animals. The sources of funding include National Natural Science Foundation of China (Nos. 61170268, 61100047, and 61272493). There are no potential conflicts of interests.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baojiang Cui.

Additional information

Communicated by V. Loia.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cui, B., Wang, F., Hao, Y. et al. A taint based approach for automatic reverse engineering of gray-box file formats. Soft Comput 20, 3563–3578 (2016). https://doi.org/10.1007/s00500-015-1713-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-015-1713-6

Keywords

Navigation