skip to main content
10.1145/3411506.3417599acmconferencesArticle/Chapter ViewAbstractPublication PagespldiConference Proceedingsconference-collections
research-article

Automatic Discovery and Synthesis of Checksum Algorithms from Binary Data Samples

Published: 09 November 2020 Publication History

Abstract

Reverse engineering unknown binary message formats is an important part of security research. Error detecting codes such as checksums and Cyclic Redundancy Check codes (CRCs) are commonly added to messages as a guard against corrupt or untrusted input. Before an analyst can manufacture input for software which uses checksums they must discover the algorithm to calculate a valid checksum. To address this need, we have developed a program synthesis based approach for detecting and reverse-engineering checksum algorithms automatically.
Our approach takes a small set of binary messages as input and automatically returns a Python implementation of the checksum algorithm if one can be found. Our approach first performs a search over the message space to identify the location of the checksum and then uses program synthesis to identify the operations performed on the message to compute the checksum. We return to the user runnable code to both calculate a checksum from a message and to validate a message according to the checksum algorithm. We generate unit tests, allowing the user to validate the synthesized checksum algorithm is correct with regard to the input messages.
We created the Tufts Checksum Corpus comprised of 12 checksum inference questions collected from posts on reverse engineering question and answer sites and 2 instances of common internet protocol checksums.
Our approach successfully synthesized the underlying checksum algorithms for 12 out of 14 cases in our test suite.

References

[1]
Gregory Cook. 2020. CRC RevEng. Retrieved June 23, 2020 from http://reveng.sourceforge.net/
[2]
Weidong Cui, Marcus Peinado, Karl Chen, Helen J Wang, and Luis Irun-Briz. 2008. Tupni: Automatic reverse engineering of input formats. In Proceedings of the 15th ACM conference on Computer and communications security. 391--402.
[3]
Stack Exchange. 2014. Guessing CRC checksum algorithm. Retrieved June 23, 2020 from https://reverseengineering.stackexchange.com/questions/4460
[4]
Stack Exchange. 2014. Reversing simple message + checksum pairs (32 bytes). Retrieved June 23, 2020 from https://reverseengineering.stackexchange.com/questions/6927
[5]
Stack Exchange. 2020. Reverse Engineering Stack Exchange. Retrieved June 23, 2020 from https://reverseengineering.stackexchange.com
[6]
Kathleen Fisher and Robert Gruber. 2005. PADS: a domain-specific language for processing ad hoc data. ACM Sigplan Notices 40, 6 (2005), 295--304.
[7]
Peter D Grünwald and Abhijit Grunwald. 2007. The minimum description length principle. MIT press.
[8]
Sumit Gulwani. 2010. Dimensions in program synthesis. In Proceedings of the 12th international ACM SIGPLAN symposium on Principles and practice of declarative programming. 13--24.
[9]
Fred Halsall. 1995. Data communications, computer networks and open systems. Addison Wesley Longman Publishing Co., Inc. 102--112 pages.
[10]
Stephan Kleber, Lisa Maile, and Frank Kargl. 2018. Survey of protocol reverse engineering algorithms: Decomposition of tools for static traffic analysis. IEEE Communications Surveys & Tutorials 2018 (2018).
[11]
Zohar Manna and Richard Waldinger. 1980. A deductive approach to program synthesis. ACM Transactions on Programming Languages and Systems (TOPLAS) 2, 1 (1980), 90--121.
[12]
John Narayan, Sandeep K Shukla, and T Charles Clancy. 2015. A survey of automatic protocol reverse engineering tools. ACM Computing Surveys (CSUR) 48, 3 (2015), 1--26.
[13]
Stack Overflow. 2020. Stack Overflow. Retrieved June 23, 2020 from https://stackoverflow.com/
[14]
Larry L Peterson and Bruce S Davie. 2007. Computer networks: a systems approach. Elsevier. 93--101 pages.
[15]
Johannes Pohl and Andreas Noack. 2019. Automatic wireless protocol reverse engineering. In 13th {USENIX} Workshop on Offensive Technologies (WOOT 19).
[16]
John Postel. 1981. Internet Control Message Protocol; RFC792. ARPANETWorking Group Requests for Comments 792 (1981).
[17]
Jon Postel. 1990. RFC 791: Internet Protocol, September 1981. Darpa Internet Protocol Specification (1990).
[18]
Edward J Schwartz, Thanassis Avgerinos, and David Brumley. 2010. All you ever wanted to know about dynamic taint analysis and forward symbolic execution (but might have been afraid to ask). In 2010 IEEE symposium on Security and privacy. IEEE, 317--331.
[19]
C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27, 3 (July 1948), 379--423. https://doi.org/10.1002/j.1538--7305.1948.tb01338.x
[20]
Michael Sutton, Adam Greene, and Pedram Amini. 2007. Fuzzing: brute force vulnerability discovery. Pearson Education.
[21]
David Wagner and R Dean. 2000. Intrusion detection via static analysis. In Proceedings 2001 IEEE Symposium on Security and Privacy. S&P 2001. IEEE, 156--168.
[22]
Tielei Wang, Tao Wei, Guofei Gu, and Wei Zou. 2010. TaintScope: A checksumaware directed fuzzing tool for automatic software vulnerability detection. In 2010 IEEE Symposium on Security and Privacy. IEEE, 497--512.

Cited By

View all

Index Terms

  1. Automatic Discovery and Synthesis of Checksum Algorithms from Binary Data Samples

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        PLAS'20: Proceedings of the 15th Workshop on Programming Languages and Analysis for Security
        November 2020
        46 pages
        ISBN:9781450380928
        DOI:10.1145/3411506
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 09 November 2020

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. binary data
        2. checksums
        3. fuzzing
        4. protocols
        5. reverse engineering
        6. synthesis

        Qualifiers

        • Research-article

        Funding Sources

        • Defense Advanced Research Projects Agency (DARPA)
        • Air Force Research Laboratory (AFRL)

        Conference

        CCS '20
        Sponsor:

        Acceptance Rates

        Overall Acceptance Rate 43 of 77 submissions, 56%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 380
          Total Downloads
        • Downloads (Last 12 months)10
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 17 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media