skip to main content
research-article

Extracting Formats of Service Messages with Varying Payloads

Published: 01 February 2022 Publication History

Abstract

Having precise specifications of service APIs is essential for many Software Engineering activities. Unfortunately, available documentation of services is often inadequate and/or imprecise and, hence, cannot be fully relied upon. Generating service documentation manually is a tedious and error-prone task, especially in light of changes to services. Therefore, there is a need for automated support in generating service documentation. In this work, we present a novel approach to infer the API of a service by analyzing recorded messages sent to and received from this service. Our approach includes a novel, two-level clustering technique to cluster messages, a step that many existing approaches to infer message formats fail to perform precisely in the presence of significant variation of payload information of the available messages. We have evaluated our approach on message traces from four different real-world services. The experimental result shows that our approach is more effective than existing techniques in extracting correct message formats from recorded messages.

References

[1]
Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile. Morgan Kaufmann, 487–499.
[2]
Joao Antunes, Nuno Neves, and Paulo Verissimo. 2011. Reverse engineering of protocols from network traces. In Proceedings of the 18th Working Conference on Reverse Engineering, Limerick, Ireland. IEEE, 169–178.
[3]
Gregory A. Babich and Octavia I. Camps. 1996. Weighted Parzen windows for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 5 (May 1996), 567–570.
[4]
Subodh Bapat. 1994. Automatic storage of persistent ASN.1 objects in a relational schema. (March 1994). U.S. Patent 5,291,583.
[5]
Jack Beaton, Sae Young Jeong, Yingyu Xie, Jeffrey Stylos, and Brad A. Myers. 2008. Usability challenges for enterprise service-oriented architecture APIs. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’08), Herrsching am Ammersee, Bavaria, Germany. IEEE, 193–196. DOI:https://doi.org/10.1109/VLHCC.2008.4639084
[6]
Ivan Beschastnikh, Jenny Abrahamson, Yuriy Brun, and Michael D. Ernst. 2011. Synoptic: Studying logged behavior with inferred models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary. ACM, 448–451.
[7]
James C. Bezdek and Richard J. Hathaway. 2002. VAT: A tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN’02), Honolulu, Hawaii. IEEE, 2225–2230.
[8]
Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendelsohn, Henrik Frystyk Nielsen, Satish Thatte, and Dave Winer. 2000. Simple object access protocol (SOAP) 1.1. (May 2000). Retrieved January 8, 2022 from https://www.w3.org/TR/soap/.
[10]
Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA. ACM, 317–329. DOI:https://doi.org/10.1145/1315245.1315286
[11]
Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum. 2004. Understanding data lifetime via whole system simulation. In Proceedings of 13th USENIX Security Symposium, San Diego, California.USENIX Association, 321–336.
[12]
Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum. 2004. Understanding data lifetime via whole system simulation. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA. USENIX Association, 321–336.
[13]
Paolo Milani Comparetti, Gilbert Wondracek, Christopher Kruegel, and Engin Kirda. 2009. Prospex: Protocol specification extraction. In Proceedings of the 30th IEEE Symposium on Security and Privacy. Berkeley, CA. IEEE, 110–125.
[14]
Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2005. Vigilante: End-to-end containment of Internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. Brighton, UK. ACM, 133–147.
[15]
Jedidiah R. Crandall and Frederic T. Chong. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,Portland, OR. IEEE, 221–232.
[16]
Weidong Cui, Jayanthkumar Kannan, and Helen J. Wang. 2007. Discoverer: Automatic protocol reverse engineering from network traces. In Proceedings of the 16th USENIX Security Symposium.Boston, MA. USENIX Association, 1–14.
[17]
Weidong Cui, Vern Paxson, Nicholas Weaver, and Randy H. Katz. 2006. Protocol-independent adaptive replay of application dialog. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS’06). San Diego, CA. Citeseer, 1–15.
[18]
Hetong Dai, Heng Li, Che Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient log parsing using n-gram dictionaries. IEEE Transactions on Software Engineering (2020).
[19]
Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In IEEE 16th International Conference on Data Mining (ICDM’16), Barcelona, Spain. IEEE, 859–864.
[20]
Miao Du, Steve Versteeg, Jean-Guy Schneider, Jun Han, and John Grundy. 2015. Interaction traces mining for efficient system responses generation. ACM SIGSOFT Software Engineering Notes 40, 1 (2015), 1–8.
[21]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 9th IEEE International Conference on Data Mining, Miami Beach, FL. IEEE, 149–158.
[22]
Google LLC. Google Books API. ([n.d.]). Retrieved January 8, 2022 from https://developers.google.com/books/docs/overview
[23]
Pinjia He, Jieming Zhu, Pengcheng Xu, Zibin Zheng, and Michael R. Lyu. 2018. A directed acyclic graph approach to online log parsing. arXiv:1806.04356
[24]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, Honolulu, HI. IEEE, 33–40.
[25]
Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM Computing Surveys 54, 6 (2021), 1–37.
[26]
Cameron Hine, Jean-Guy Schneider, Jun Han, and Steve Versteeg. 2016. Enterprise software service emulation: Constructing large-scale testbeds. In Proceedings of the IEEE/ACM International Workshop on Continuous Software Evolution and Delivery (CSED’16). IEEE, Austin, TX. IEEE, 56–62.
[27]
Matthias Höschele and Andreas Zeller. 2016. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY. 720–725.
[28]
MD Arafat Hossain. 2020. Discovering Context Dependent Service Models for Stateful Service Virtualization. Ph.D. Dissertation. Swinburne University of Technology, Victoria, Australia.
[29]
MD Arafat Hossain, Steve Versteeg, Jun Han, Muhammad Ashad Kabir, Jiaojiao Jiang, and Jean-Guy Schneider. 2018. Mining accurate message formats for service APIs. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18), Campobasso, Italy. IEEE, 266–276.
[30]
Jiaojiao Jiang, Steve Versteeg, Jun Han, MD Arafat Hossain, and Jean-Guy Schneider. 2019. A positional keyword-based approach to inferring fine-grained message formats. Future Generation Computer Systems 102 (Aug. 2019), 369–381.
[31]
Maurice George Kendall, Alan Stuart, and John Keith Ord. 1991. Kendall’s Advanced Theory of Statistics: Classical Inference and Relationship. Vol. 2. Oxford University Press (5th edition), New York, NY.
[32]
Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234, 5323 (Nov. 1971), 34–35.
[33]
Junghee Lim, Thomas Reps, and Ben Liblit. 2006. Extracting output formats from executables. In Proceedings of the 13th Working Conference on Reverse Engineering, Benevento, Italy. IEEE, 167–178.
[34]
Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, and Xiangyu Zhang. 2008. Automatic protocol format reverse engineering through context-aware monitored execution. In Proceedings of the Symposium on Network and Distributed System Security (NDSS’08), San Diego, CA. The Internet Society, 1–15.
[35]
Jian-Zhen Luo and Shun-Zheng Yu. 2013. Position-based automatic reverse engineering of network protocols. Journal of Network and Computer Applications 36, 3 (Feb. 2013), 1070–1077. DOI:
[36]
Marko Määttä and Tomi Räty. 2014. A modelling approach for monitoring sequence activities in diverse environments. In Proceedings of the 9th International Conference on Digital Information Management (ICDIM’14), Phitsanulok, Thailand. IEEE, 33–38.
[37]
Adetokunbo A. O. Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France. ACM, 1255–1264.
[38]
Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. The term vocabulary and postings lists. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England. 22–27 pages.
[39]
McAfee LLC. 2004. Network Protocol Analysis using Bioinformatics Algorithms. (2004). Retrieved August 20, 2019 from www.4tphi.net/~awalters/PI/pi.pdf.
[40]
Geoffrey McLachlan, Kim-Anh Do, and Christophe Ambroise. 2004. Analyzing Microarray Gene Expression Data. Wiley, Hoboken, NJ. 213–218 pages.
[41]
Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, and Raimondas Sasnauskas. 2018. A search-based approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension. Gothenburg, Sweden. ACM, 167–177.
[42]
Mike Mintz and Andrew Sayers. 2003. MSN Messenger protocol. (2003). Retrieved January 8, 2022 from http://www.hypothetic.org/docs/msn/index.php.
[43]
Masayoshi Mizutani. 2013. Incremental mining of system log format. In IEEE International Conference on Services Computing. IEEE, 595–602.
[44]
Edward F. Moore. 1956. Gedanken-experiments on sequential machines. Automata Studies 34 (1956), 129–153.
[45]
Meiyappan Nagappan and Mladen A. Vouk. 2010. Abstracting log lines to log event types for mining software system logs. In 7th IEEE Working Conference on Mining Software Repositories (MSR’10), Cape Town, South Africa. IEEE, 114–117.
[46]
Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (March 1970), 443–453.
[47]
Masatoshi Nei, Fumio Tajima, and Yoshio Tateno. 1983. Accuracy of estimated phylogenetic trees from molecular data. Journal of Molecular Evolution 19, 2 (March 1983), 153–170.
[48]
David L. Olson and Dursun Delen. 2008. Performance Evaluation for Predictive Modeling. Advanced Data Mining Techniques. Springer Science & Business Media, Berlin, 138.
[49]
Claude E. Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (Jan. 2001), 3–55. DOI:https://doi.org/10.1145/584091.584093
[50]
Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 785–794.
[51]
Philip S. Tellis, Steve McAndrewSmith, Michaël Kamp, Wayne Parrott, Ray Van Dolson, and Siddhesh Poyarekar. 2010. libyahoo2: A C library for Yahoo! Messenger. (July 2010). Retrieved January 8, 2022 from http://libyahoo2.sourceforge.net/.
[52]
Alok Tongaonkar, Ram Keralapura, and Antonio Nucci. 2013. Santaclass: A self adaptive network traffic classification system. In Proceedings of the International Federation for Information Processing (IFIP’13) Networking Conference.Brooklyn, NY. IEEE, 1–9.
[53]
Alok Tongaonkar, Ruben Torres, Marios Iliofotou, Ram Keralapura, and Antonio Nucci. 2015. Towards self adaptive network traffic classification. Computer Communications 56 (Feb. 2015), 35–46.
[54]
Andrew Tridgell. 2003. How Samba was written. Retrieved January 8, 2022 from https://www.samba.org/ftp/tridge/misc/french_cafe.txt.
[55]
Twitter Inc.2014. Twitter REST API. Retrieved March 22, 2018 from https://developer.twitter.com/en/docs/api-reference-index.
[56]
Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03), Kansas City, Missouri. IEEE, 119–126.
[57]
Risto Vaarandi and Mauno Pihelgas. 2015. Logcluster-a data clustering and pattern mining algorithm for event logs. In 11th International Conference on Network and Service Management (CNSM’15), Barcelona, Spain. IEEE, 1–7.
[58]
Steve Versteeg, Miao Du, John Bird, Jean-Guy Schneider, John Grundy, and Jun Han. 2016. Enhanced playback of automated service emulation models using entropy analysis. In Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED’16). Austin, TX. IEEE, 49–55.
[59]
Steve Versteeg, Miao Du, Jean-Guy Schneider, John Grundy, Jun Han, and Menka Goyal. 2016. Opaque service virtualisation: A practical tool for emulating endpoint systems. In Proceedings of the 38th International Conference on Software Engineering Companion. Austin, TX. ACM, 202–211.
[60]
Lusheng Wang and Tao Jiang. 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology 1, 4 (June 1994), 337–348. DOI:
[61]
Yipeng Wang, Xingjian Li, Jiao Meng, Yong Zhao, Zhibin Zhang, and Li Guo. 2011. Biprominer: Automatic mining of binary protocol features. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. Gwangju, South Korea. IEEE, 179–184.
[62]
Yipeng Wang, Xiaochun Yun, M Zubair Shafiq, Liyan Wang, Alex X. Liu, Zhibin Zhang, Danfeng Yao, Yongzheng Zhang, and Li Guo. 2012. A semantics aware approach to automated reverse engineering unknown protocols. In Proceedings of the 20th IEEE International Conference on Network Protocols (ICNP’12). Austin, TX. IEEE, 1–10. DOI:
[63]
Yong Wang, Nan Zhang, Yan-Mei Wu, and Bin-Bin Su. 2013. Protocol specification inference based on keywords identification. In Advanced Data Mining and Applications (ADMA’13), Lecture Notes in Computer Science, Vol. 8347,Hiroshi Motoda, Zhaohui Wu, Longbing Cao, Osmar Zaiane, Min Yao, and Wei Wang (Eds.). Springer, Berlin,443–454. DOI:https://doi.org/10.1007/978-3-642-53917-6_40
[64]
Yipeng Wang, Zhibin Zhang, Danfeng (Daphne) Yao, Buyun Qu, and Li Guo. 2011. Inferring protocol state machine from network traces: A probabilistic approach. In Applied Cryptography and Network Security (ACNS 2011), Lecture Notes in Computer Science, Vol. 6715,Javier Lopez and Gene Tsudik (Eds.). Springer, Berlin, 1–18. DOI:https://doi.org/10.1007/978-3-642-21554-4_1
[65]
Zhi Wang, Xuxian Jiang, Weidong Cui, Xinyuan Wang, and Mike Grace. 2009. ReFormat: Automatic reverse engineering of encrypted messages. In Proceedings of the 14th European Symposium on Research in Computer Security. Saint-Malo, France. Springer, 200–215.
[66]
Shameng Wen, Qingkun Meng, Chao Feng, and Chaojing Tang. 2017. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS One 12, 10 (Oct. 2017), e0186188.
[67]
Gilbert Wondracek, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, and Scuola Superiore S. Anna. 2008. Automatic network protocol analysis. In Proceedings of the Network and Distributed System Security Symposium (NDSS’08). San Diego, CA. The Internet Society, 1–14.
[68]
Wengyik Yeong, Tim Howes, and Steve Kille. 1995. Lightweight Directory Access Protocol. RFC 1777. Internet Engineering Task Force (IETF’95),Fremont, CA. ISOC. http://www.rfc-editor.org/info/rfc1777.
[69]
Zhuanghui Yu, Yongzhong Huang, Shaozhong Guo, Bei Zhou, and Hua Ren. 2007. Extracting information from unknown protocols on campusNet. In Proceedings of the 1st IEEE International Symposium on Information Technologies and Applications in Education. Kunming, China. IEEE, 535–539.

Index Terms

  1. Extracting Formats of Service Messages with Varying Payloads

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Transactions on Internet Technology
        ACM Transactions on Internet Technology  Volume 22, Issue 3
        August 2022
        631 pages
        ISSN:1533-5399
        EISSN:1557-6051
        DOI:10.1145/3498359
        • Editor:
        • Ling Liu
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 01 February 2022
        Accepted: 01 November 2021
        Revised: 01 September 2021
        Received: 01 April 2021
        Published in TOIT Volume 22, Issue 3

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. Service API
        2. format extraction
        3. positional keyword
        4. payload variation

        Qualifiers

        • Research-article
        • Refereed

        Funding Sources

        • Australian Research Council Linkage Project

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 211
          Total Downloads
        • Downloads (Last 12 months)18
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 13 Feb 2025

        Other Metrics

        Citations

        View Options

        Login options

        Full Access

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        Full Text

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media