research-article

Extracting Formats of Service Messages with Varying Payloads

Authors:

Md Arafat Hossain,

Jean-Guy Schneider,

Jiaojiao Jiang,

Muhammad Ashad Kabir,

Steve VersteegAuthors Info & Claims

ACM Transactions on Internet Technology (TOIT), Volume 22, Issue 3

Article No.: 71, Pages 1 - 31

https://doi.org/10.1145/3503159

Published: 01 February 2022 Publication History

Abstract

Having precise specifications of service APIs is essential for many Software Engineering activities. Unfortunately, available documentation of services is often inadequate and/or imprecise and, hence, cannot be fully relied upon. Generating service documentation manually is a tedious and error-prone task, especially in light of changes to services. Therefore, there is a need for automated support in generating service documentation. In this work, we present a novel approach to infer the API of a service by analyzing recorded messages sent to and received from this service. Our approach includes a novel, two-level clustering technique to cluster messages, a step that many existing approaches to infer message formats fail to perform precisely in the presence of significant variation of payload information of the available messages. We have evaluated our approach on message traces from four different real-world services. The experimental result shows that our approach is more effective than existing techniques in extracting correct message formats from recorded messages.

References

[1]

Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Santiago de Chile, Chile. Morgan Kaufmann, 487–499.

Digital Library

[2]

Joao Antunes, Nuno Neves, and Paulo Verissimo. 2011. Reverse engineering of protocols from network traces. In Proceedings of the 18th Working Conference on Reverse Engineering, Limerick, Ireland. IEEE, 169–178.

Digital Library

[3]

Gregory A. Babich and Octavia I. Camps. 1996. Weighted Parzen windows for pattern classification. IEEE Transactions on Pattern Analysis and Machine Intelligence 18, 5 (May 1996), 567–570.

Digital Library

[4]

Subodh Bapat. 1994. Automatic storage of persistent ASN.1 objects in a relational schema. (March 1994). U.S. Patent 5,291,583.

[5]

Jack Beaton, Sae Young Jeong, Yingyu Xie, Jeffrey Stylos, and Brad A. Myers. 2008. Usability challenges for enterprise service-oriented architecture APIs. In Proceedings of the IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC’08), Herrsching am Ammersee, Bavaria, Germany. IEEE, 193–196. DOI:https://doi.org/10.1109/VLHCC.2008.4639084

[6]

Ivan Beschastnikh, Jenny Abrahamson, Yuriy Brun, and Michael D. Ernst. 2011. Synoptic: Studying logged behavior with inferred models. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, Szeged, Hungary. ACM, 448–451.

Digital Library

[7]

James C. Bezdek and Richard J. Hathaway. 2002. VAT: A tool for visual assessment of (cluster) tendency. In Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN’02), Honolulu, Hawaii. IEEE, 2225–2230.

[8]

Don Box, David Ehnebuske, Gopal Kakivaya, Andrew Layman, Noah Mendelsohn, Henrik Frystyk Nielsen, Satish Thatte, and Dave Winer. 2000. Simple object access protocol (SOAP) 1.1. (May 2000). Retrieved January 8, 2022 from https://www.w3.org/TR/soap/.

[9]

CA Technologies Inc.2019. CA Identity Manager. (Dec. 2019). Retrieved January 4, 2020 from https://techdocs.broadcom.com/content/broadcom/techdocs/us/en/ca-enterprise-software/layer7-identity-and-access-management/identity-manager/14-3.html.

[10]

Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song. 2007. Polyglot: Automatic extraction of protocol message format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and Communications Security, Alexandria, VA. ACM, 317–329. DOI:https://doi.org/10.1145/1315245.1315286

[11]

Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum. 2004. Understanding data lifetime via whole system simulation. In Proceedings of 13th USENIX Security Symposium, San Diego, California.USENIX Association, 321–336.

Digital Library

[12]

Jim Chow, Ben Pfaff, Tal Garfinkel, Kevin Christopher, and Mendel Rosenblum. 2004. Understanding data lifetime via whole system simulation. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA. USENIX Association, 321–336.

Digital Library

[13]

Paolo Milani Comparetti, Gilbert Wondracek, Christopher Kruegel, and Engin Kirda. 2009. Prospex: Protocol specification extraction. In Proceedings of the 30th IEEE Symposium on Security and Privacy. Berkeley, CA. IEEE, 110–125.

Digital Library

[14]

Manuel Costa, Jon Crowcroft, Miguel Castro, Antony Rowstron, Lidong Zhou, Lintao Zhang, and Paul Barham. 2005. Vigilante: End-to-end containment of Internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. Brighton, UK. ACM, 133–147.

Digital Library

[15]

Jedidiah R. Crandall and Frederic T. Chong. 2004. Minos: Control data attack prevention orthogonal to memory model. In Proceedings of the 37th Annual IEEE/ACM International Symposium on Microarchitecture,Portland, OR. IEEE, 221–232.

Digital Library

[16]

Weidong Cui, Jayanthkumar Kannan, and Helen J. Wang. 2007. Discoverer: Automatic protocol reverse engineering from network traces. In Proceedings of the 16th USENIX Security Symposium.Boston, MA. USENIX Association, 1–14.

Digital Library

[17]

Weidong Cui, Vern Paxson, Nicholas Weaver, and Randy H. Katz. 2006. Protocol-independent adaptive replay of application dialog. In Proceedings of the 13th Annual Network and Distributed System Security Symposium (NDSS’06). San Diego, CA. Citeseer, 1–15.

[18]

Hetong Dai, Heng Li, Che Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2020. Logram: Efficient log parsing using n-gram dictionaries. IEEE Transactions on Software Engineering (2020).

[19]

Min Du and Feifei Li. 2016. Spell: Streaming parsing of system event logs. In IEEE 16th International Conference on Data Mining (ICDM’16), Barcelona, Spain. IEEE, 859–864.

[20]

Miao Du, Steve Versteeg, Jean-Guy Schneider, Jun Han, and John Grundy. 2015. Interaction traces mining for efficient system responses generation. ACM SIGSOFT Software Engineering Notes 40, 1 (2015), 1–8.

Digital Library

[21]

Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In 9th IEEE International Conference on Data Mining, Miami Beach, FL. IEEE, 149–158.

Digital Library

[22]

Google LLC. Google Books API. ([n.d.]). Retrieved January 8, 2022 from https://developers.google.com/books/docs/overview

[23]

Pinjia He, Jieming Zhu, Pengcheng Xu, Zibin Zheng, and Michael R. Lyu. 2018. A directed acyclic graph approach to online log parsing. arXiv:1806.04356

[24]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An online log parsing approach with fixed depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS’17). IEEE, Honolulu, HI. IEEE, 33–40.

[25]

Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R. Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM Computing Surveys 54, 6 (2021), 1–37.

Digital Library

[26]

Cameron Hine, Jean-Guy Schneider, Jun Han, and Steve Versteeg. 2016. Enterprise software service emulation: Constructing large-scale testbeds. In Proceedings of the IEEE/ACM International Workshop on Continuous Software Evolution and Delivery (CSED’16). IEEE, Austin, TX. IEEE, 56–62.

Digital Library

[27]

Matthias Höschele and Andreas Zeller. 2016. Mining input grammars from dynamic taints. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, New York, NY. 720–725.

Digital Library

[28]

MD Arafat Hossain. 2020. Discovering Context Dependent Service Models for Stateful Service Virtualization. Ph.D. Dissertation. Swinburne University of Technology, Victoria, Australia.

[29]

MD Arafat Hossain, Steve Versteeg, Jun Han, Muhammad Ashad Kabir, Jiaojiao Jiang, and Jean-Guy Schneider. 2018. Mining accurate message formats for service APIs. In Proceedings of the 25th International Conference on Software Analysis, Evolution and Reengineering (SANER’18), Campobasso, Italy. IEEE, 266–276.

[30]

Jiaojiao Jiang, Steve Versteeg, Jun Han, MD Arafat Hossain, and Jean-Guy Schneider. 2019. A positional keyword-based approach to inferring fine-grained message formats. Future Generation Computer Systems 102 (Aug. 2019), 369–381.

[31]

Maurice George Kendall, Alan Stuart, and John Keith Ord. 1991. Kendall’s Advanced Theory of Statistics: Classical Inference and Relationship. Vol. 2. Oxford University Press (5th edition), New York, NY.

[32]

Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234, 5323 (Nov. 1971), 34–35.

[33]

Junghee Lim, Thomas Reps, and Ben Liblit. 2006. Extracting output formats from executables. In Proceedings of the 13th Working Conference on Reverse Engineering, Benevento, Italy. IEEE, 167–178.

Digital Library

[34]

Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, and Xiangyu Zhang. 2008. Automatic protocol format reverse engineering through context-aware monitored execution. In Proceedings of the Symposium on Network and Distributed System Security (NDSS’08), San Diego, CA. The Internet Society, 1–15.

[35]

Jian-Zhen Luo and Shun-Zheng Yu. 2013. Position-based automatic reverse engineering of network protocols. Journal of Network and Computer Applications 36, 3 (Feb. 2013), 1070–1077. DOI:

[36]

Marko Määttä and Tomi Räty. 2014. A modelling approach for monitoring sequence activities in diverse environments. In Proceedings of the 9th International Conference on Digital Information Management (ICDIM’14), Phitsanulok, Thailand. IEEE, 33–38.

[37]

Adetokunbo A. O. Makanju, A. Nur Zincir-Heywood, and Evangelos E. Milios. 2009. Clustering event logs using iterative partitioning. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Paris, France. ACM, 1255–1264.

Digital Library

[38]

Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. The term vocabulary and postings lists. Introduction to Information Retrieval. Cambridge University Press, Cambridge, England. 22–27 pages.

[39]

McAfee LLC. 2004. Network Protocol Analysis using Bioinformatics Algorithms. (2004). Retrieved August 20, 2019 from www.4tphi.net/~awalters/PI/pi.pdf.

[40]

Geoffrey McLachlan, Kim-Anh Do, and Christophe Ambroise. 2004. Analyzing Microarray Gene Expression Data. Wiley, Hoboken, NJ. 213–218 pages.

[41]

Salma Messaoudi, Annibale Panichella, Domenico Bianculli, Lionel Briand, and Raimondas Sasnauskas. 2018. A search-based approach for accurate identification of log message formats. In Proceedings of the 26th Conference on Program Comprehension. Gothenburg, Sweden. ACM, 167–177.

Digital Library

[42]

Mike Mintz and Andrew Sayers. 2003. MSN Messenger protocol. (2003). Retrieved January 8, 2022 from http://www.hypothetic.org/docs/msn/index.php.

[43]

Masayoshi Mizutani. 2013. Incremental mining of system log format. In IEEE International Conference on Services Computing. IEEE, 595–602.

Digital Library

[44]

Edward F. Moore. 1956. Gedanken-experiments on sequential machines. Automata Studies 34 (1956), 129–153.

[45]

Meiyappan Nagappan and Mladen A. Vouk. 2010. Abstracting log lines to log event types for mining software system logs. In 7th IEEE Working Conference on Mining Software Repositories (MSR’10), Cape Town, South Africa. IEEE, 114–117.

[46]

Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 3 (March 1970), 443–453.

[47]

Masatoshi Nei, Fumio Tajima, and Yoshio Tateno. 1983. Accuracy of estimated phylogenetic trees from molecular data. Journal of Molecular Evolution 19, 2 (March 1983), 153–170.

[48]

David L. Olson and Dursun Delen. 2008. Performance Evaluation for Predictive Modeling. Advanced Data Mining Techniques. Springer Science & Business Media, Berlin, 138.

Digital Library

[49]

Claude E. Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1 (Jan. 2001), 3–55. DOI:https://doi.org/10.1145/584091.584093

Digital Library

[50]

Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 785–794.

Digital Library

[51]

Philip S. Tellis, Steve McAndrewSmith, Michaël Kamp, Wayne Parrott, Ray Van Dolson, and Siddhesh Poyarekar. 2010. libyahoo2: A C library for Yahoo! Messenger. (July 2010). Retrieved January 8, 2022 from http://libyahoo2.sourceforge.net/.

[52]

Alok Tongaonkar, Ram Keralapura, and Antonio Nucci. 2013. Santaclass: A self adaptive network traffic classification system. In Proceedings of the International Federation for Information Processing (IFIP’13) Networking Conference.Brooklyn, NY. IEEE, 1–9.

[53]

Alok Tongaonkar, Ruben Torres, Marios Iliofotou, Ram Keralapura, and Antonio Nucci. 2015. Towards self adaptive network traffic classification. Computer Communications 56 (Feb. 2015), 35–46.

[54]

Andrew Tridgell. 2003. How Samba was written. Retrieved January 8, 2022 from https://www.samba.org/ftp/tridge/misc/french_cafe.txt.

[55]

Twitter Inc.2014. Twitter REST API. Retrieved March 22, 2018 from https://developer.twitter.com/en/docs/api-reference-index.

[56]

Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In Proceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM’03), Kansas City, Missouri. IEEE, 119–126.

[57]

Risto Vaarandi and Mauno Pihelgas. 2015. Logcluster-a data clustering and pattern mining algorithm for event logs. In 11th International Conference on Network and Service Management (CNSM’15), Barcelona, Spain. IEEE, 1–7.

Digital Library

[58]

Steve Versteeg, Miao Du, John Bird, Jean-Guy Schneider, John Grundy, and Jun Han. 2016. Enhanced playback of automated service emulation models using entropy analysis. In Proceedings of the International Workshop on Continuous Software Evolution and Delivery (CSED’16). Austin, TX. IEEE, 49–55.

Digital Library

[59]

Steve Versteeg, Miao Du, Jean-Guy Schneider, John Grundy, Jun Han, and Menka Goyal. 2016. Opaque service virtualisation: A practical tool for emulating endpoint systems. In Proceedings of the 38th International Conference on Software Engineering Companion. Austin, TX. ACM, 202–211.

Digital Library

[60]

Lusheng Wang and Tao Jiang. 1994. On the complexity of multiple sequence alignment. Journal of Computational Biology 1, 4 (June 1994), 337–348. DOI:

[61]

Yipeng Wang, Xingjian Li, Jiao Meng, Yong Zhao, Zhibin Zhang, and Li Guo. 2011. Biprominer: Automatic mining of binary protocol features. In Proceedings of the 12th International Conference on Parallel and Distributed Computing, Applications and Technologies. Gwangju, South Korea. IEEE, 179–184.

Digital Library

[62]

Yipeng Wang, Xiaochun Yun, M Zubair Shafiq, Liyan Wang, Alex X. Liu, Zhibin Zhang, Danfeng Yao, Yongzheng Zhang, and Li Guo. 2012. A semantics aware approach to automated reverse engineering unknown protocols. In Proceedings of the 20th IEEE International Conference on Network Protocols (ICNP’12). Austin, TX. IEEE, 1–10. DOI:

[63]

Yong Wang, Nan Zhang, Yan-Mei Wu, and Bin-Bin Su. 2013. Protocol specification inference based on keywords identification. In Advanced Data Mining and Applications (ADMA’13), Lecture Notes in Computer Science, Vol. 8347,Hiroshi Motoda, Zhaohui Wu, Longbing Cao, Osmar Zaiane, Min Yao, and Wei Wang (Eds.). Springer, Berlin,443–454. DOI:https://doi.org/10.1007/978-3-642-53917-6_40

[64]

Yipeng Wang, Zhibin Zhang, Danfeng (Daphne) Yao, Buyun Qu, and Li Guo. 2011. Inferring protocol state machine from network traces: A probabilistic approach. In Applied Cryptography and Network Security (ACNS 2011), Lecture Notes in Computer Science, Vol. 6715,Javier Lopez and Gene Tsudik (Eds.). Springer, Berlin, 1–18. DOI:https://doi.org/10.1007/978-3-642-21554-4_1

[65]

Zhi Wang, Xuxian Jiang, Weidong Cui, Xinyuan Wang, and Mike Grace. 2009. ReFormat: Automatic reverse engineering of encrypted messages. In Proceedings of the 14th European Symposium on Research in Computer Security. Saint-Malo, France. Springer, 200–215.

Digital Library

[66]

Shameng Wen, Qingkun Meng, Chao Feng, and Chaojing Tang. 2017. Protocol vulnerability detection based on network traffic analysis and binary reverse engineering. PloS One 12, 10 (Oct. 2017), e0186188.

[67]

Gilbert Wondracek, Paolo Milani Comparetti, Christopher Kruegel, Engin Kirda, and Scuola Superiore S. Anna. 2008. Automatic network protocol analysis. In Proceedings of the Network and Distributed System Security Symposium (NDSS’08). San Diego, CA. The Internet Society, 1–14.

[68]

Wengyik Yeong, Tim Howes, and Steve Kille. 1995. Lightweight Directory Access Protocol. RFC 1777. Internet Engineering Task Force (IETF’95),Fremont, CA. ISOC. http://www.rfc-editor.org/info/rfc1777.

Digital Library

[69]

Zhuanghui Yu, Yongzhong Huang, Shaozhong Guo, Bei Zhou, and Hua Ren. 2007. Extracting information from unknown protocols on campusNet. In Proceedings of the 1st IEEE International Symposium on Information Technologies and Applications in Education. Kunming, China. IEEE, 535–539.

Index Terms

Extracting Formats of Service Messages with Varying Payloads
1. Networks
  1. Network protocols
  2. Network services
2. Security and privacy
  1. Network security

Recommendations

Transforming heterogeneous messages automatically in web service composition
APWeb'06: Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development

When composing web services, establishing data flow is one of the most important steps. However, still lack of solution is proposed for the fundamental problem in this step about how to link two services with heterogeneous message types. It results in ...
The Talking Cloud: A Cloud Platform for Enabling Communication Mashups
SCC '14: Proceedings of the 2014 IEEE International Conference on Services Computing

The recent proliferation of API hosting frameworks has dramatically eased the development of interesting web mashups and provided monetization opportunities for enterprises offering high value APIs. Most of these mashups are based on request/response ...
A positional keyword-based approach to inferring fine-grained message formats
Abstract
Message format extraction, the process of revealing the message syntax without access to the protocol specification, is important for a variety of applications such as service virtualization and network security. In this paper, we ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Internet Technology

ACM Transactions on Internet Technology Volume 22, Issue 3

August 2022

631 pages

ISSN:1533-5399

EISSN:1557-6051

DOI:10.1145/3498359

Editor:
Ling Liu
Georgia Institute of Technology, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 February 2022

Accepted: 01 November 2021

Revised: 01 September 2021

Received: 01 April 2021

Published in TOIT Volume 22, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Australian Research Council Linkage Project

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
211
Total Downloads

Downloads (Last 12 months)18
Downloads (Last 6 weeks)1

Reflects downloads up to 14 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents