Chapter Six - Using clickstream data to enhance reverse engineering of Web applications

https://doi.org/10.1016/bs.adcom.2019.07.006Get rights and content

Abstract

Due to advances in Web technologies, existing Web applications are rewritten or replaced by new ones. As a result of either ad hoc or agile development, many of them lack proper technical documentation. Nevertheless, the domain knowledge built into these applications is valuable, which is why the reverse engineering, an activity aimed at detecting software components and their interrelationships to provide multiple views of software systems at a higher level of abstraction, of existing Web applications is becoming an important issue.

Apart from the static reverse engineering based on examining the system's source code, analyzing the dynamic aspect of Web applications often proves worthwhile. One important source of data the dynamic analysis of a Web application can be based on are HTTP server log files. User sessions, results of clickstream analysis, and session reconstruction in particular, can be used as the basis for the first automatic step of reverse engineering, employed in order to gain a quick insight into Web application's source code.

It is shown how clickstream data can be used to reveal not only the intensity of connections between individual Web application's source code artifacts but also the overall structure of a Web application. The extracted structure, based either on code artifacts names or their usage, is presented visually as an ATG, with code artifacts belonging to the same application module grouped together. Because session reconstruction is an inherently probabilistic process and thus in general produces noisy data, clustering the code artifacts becomes a challenging task. It is shown that multidimensional scaling and even a simple graph drawing approaches yield better representation of the application transition graph than hierarchical clustering.

The method was tested against the results obtained by an expert (the author of the Web application used as a test case). Additionally, the method can also be used for verifying the structure obtained by manual reverse engineering of the application's source code.

Introduction

With the advance of Web technologies, many existing Web applications are rewritten or replaced by new ones, which means that they are totally reengineered. In many cases, the source code is available, but reliable and comprehensive documentation is missing.

There are various reasons why the documentation might be missing:

  • 1.

    Some Web applications were written in a hurry without using any proper software development methodology which would ensure proper documentation.

  • 2.

    The existing documentation might be insufficient as the application was produced using agile software development methodology [1] which gives priority to working software over comprehensive documentation [2].

  • 3.

    In an application's lifetime the source code was modified either during maintenance or when new functionalities were added, but quite often many modifications are insufficiently documented or not documented at all.

It happens all too often that in such cases, reverse engineering must be performed to extract business logic from the application's source code.

Reverse engineering is a software reengineering activity aimed at detecting software components and their interrelationships to provide multiple views of software systems at a higher level of abstraction [3]. It could be considered a postcomprehension method if compared to simulation, which is a pre-comprehension method [4]. The other two major activities are comprehension and enhancement of software systems. The former is about finding out what the software components detected during reverse engineering actually do and the latter is about modifying and improving software systems.

Reverse engineering of a Web application seeks to understand its structure and functionality, and in the end it should produce (visual) representation of the application's structure. Visual representation is preferred as it has been shown that graph-based comprehension tools offer a more efficient basis for comprehension than, for example, relational database-based tools [5].

In many cases the number of components, i.e., individual Web pages and/or behind the scene codes for generating Web pages, might be relatively large. Therefore, simply displaying connections between components is not enough. One should, if at all possible, find which components form a certain part of the Web application, e.g., modules implementing individual functionalities, so that later comprehension is easier.

The structure of the article is as follows. After the description of reverse engineering topics specific to Web applications in Sections 2 and 3 gives an overview of different approaches to reverse engineering. As this article aims at extracting the structure of a Web application using clickstreams, Section 4 introduces application transition graphs (ATG), while Section 5 describes preprocessing of clickstreams and user sessions. The metrics used for clustering and visualization of ATGs are defined in Section 6, and Section 7 presents methods used for clustering and final visualization of ATGs.

Section snippets

On reverse engineering of Web applications

When the World Wide Web appeared in 1989/90, it consisted of a number of interconnected Web pages. Soon after Web applications appeared, their development and deployment only accelerated in time. Regarding the way a Web application interacts with a user on one side and with the server on the other side, Web applications can be divided into two main groups:

  • 1.

    Traditional Web applications

    As shown in Fig. 1 (left), traditional Web applications consist of (i) a Web browser on the client side and of

Related work

Due to the technological importance of reverse engineering, it is no surprise that it is a well investigated field in software engineering and that many tools for reverse engineering have been produced so far. The list (by no means exhaustive) includes Moose [12], GUPRO [3], Columbus [13], CodeCrawler [14], SolidFX [15], and SQuAVisiT [16]. However, these tools are made for reverse engineering of traditional, i.e., no-Web applications written in languages like C, C++, Java, and Cobol (although

Application transition graph of a Web application

In reverse engineering, the structure of a Web application should be uncovered and its (visual) representation should be produced. To represent the structure of a Web application, the atomic section model (ASM) was selected in this study [61]. Initially, the ASM was defined for testing Web applications, but it has been shown that it is useful in reverse engineering as well [11].

The ASM consists of two parts. At the higher level of abstraction, it consists of the ATG which describes how

From raw clickstream data to user sessions

If an ATG of a Web application is automatically generated from the application's source code, i.e., using the static approach, the obtained result as the one shown in Fig. 2 is likely to be of no use. To improve the presentation of the ATG, a dynamic component of all information which can be gathered about the Web application must be taken into account. Web server access log files should therefore be considered as well.

The HTTP server usually serves multiple users at the same time and records

Metrics for clustering the ATG of a Web application

To cluster Web application pages and thus relevantly visualize the application's ATG, the distance between these pages must obviously be defined. Among many possible definitions of distance, two definitions are used:

  • 1.

    the one based on page names and

  • 2.

    the other based on application usage data.

The first definition relies on the names found in the source code and thus exposes the static information about the Web application, whereas the second definition uses Web server access log files and exposes

Clustering and visualizing Web application's ATG

Out of many different clustering and visualizing methods, the following three were used to cluster and visualize the Web application's ATG in this study:

  • hierarchical clustering,

  • the graph drawing algorithm proposed by Kamada and Kawai [83], and

  • multidimensional scaling (MDS) [84].

In these methods, the distances as defined earlier were used.

Hierarchical clustering was used first because it combines objects at lower levels into clusters on higher levels. Since what was sought was the multilevel

Conclusion

Reverse engineering of old Web applications with inadequate or nonexistent documentation can be challenging. In this text we presented a reverse engineering approach for acquiring a model of a Web application from Web log files. We have proven that Web log files can be a useful source for the reverse engineering process, and that the possibility of reverse engineer's incremental building of the final application model may significantly enhance their understanding. In addition, as seen in our

Marko Poženel is a Teaching Assistant at the Faculty of Computer and Information Science at the University of Ljubljana, Ljubljana, Slovenia. His teaching and research interests include agile software development methods, empirical software engineering as well as Web data mining and user behavior analysis. He received his Ph.D. in Computer Science from the University of Ljubljana in 2010.

References (84)

  • R. Pérez-Castillo et al.

    A family of case studies on business process mining using MARBLE

    J. Syst. Softw.

    (2012)
  • F. Trias et al.

    Migrating traditional web applications to CMS-based web applications

    Electron. Notes Theor. Comput. Sci.

    (2015)
  • A.S. Bozkir et al.

    Layout-based computation of web page similarity ranks

    Int. J. Hum. Comput. Stud.

    (2018)
  • J.M. Conejero et al.

    Re-engineering legacy web applications into RIAs by aligning modernization requirements, patterns and RIA features

    J. Syst. Soft.

    (2013)
  • A. Shatnawi et al.

    Reverse engineering reusable software components from object-oriented APIs

    J. Syst. Soft.

    (2017)
  • M. Levene et al.

    Why is the snowflake schema a good data warehouse design?

    Inf. Syst.

    (2003)
  • M. Munk et al.

    Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor

    Procedia Comput. Sci.

    (2010)
  • N.R. Carvalho et al.

    From source code identifiers to natural language terms

    J. Syst. Soft.

    (2015)
  • T. Kamada et al.

    An algorithm for drawing general undirected graphs

    Inf. Process. Lett.

    (1989)
  • R.C. Martin

    Agile Software Development, Principles, Patterns, and Practices

    (2002)
  • K. Beck, M. Beedle, A. Van Bennekum, A. Cockburn, W. Cunningham, M. Fowler, J. Grenning, J. Highsmith, A. Hunt, R....
  • J. Ebert et al.

    GUPRO—generic understanding of programs (an overview)

    Electron. Notes Theor. Comput. Sci.

    (2002)
  • B. Nikolic et al.

    A survey and evaluation of simulators suitable for teaching courses in computer architecture and organization

    IEEE Trans. Educ.

    (2009)
  • C. Lange et al.

    Comparing graph-based program comprehension tools to relational database-based tools

  • I.-H. Ting et al.

    A pattern restore method for restoring missing patterns in server side clickstream data

  • L.D. Paulson

    Building rich web applications with Ajax

    Computer

    (2005)
  • C.T. Lopes et al.

    Higher education web information system usage analysis with a data webhouse

    ICCSA (4)

    (2006)
  • I. Rožanc et al.

    Using reverse engineering to construct the platform independent model of a web application for student information systems

    Comput. Sci. Inform. Syst.

    (2013)
  • S. Ducasse et al.

    MOOSE: an extensible language-independent environment for reengineering object-oriented systems

  • R. Ferenc et al.

    Columbus—reverse engineering tool and schema for C++

  • M. Lanza et al.

    Polymetric views—a lightweight visual approach to reverse engineering

    IEEE Trans. Soft. Eng.

    (2003)
  • A. Telea et al.

    An interactive reverse engineering environment for large-scale C++ code

  • M. van den Brand et al.

    SQuAVisiT: a flexible tool for visual software analytics

  • J. Martin et al.

    Web site maintenance with software-engineering tools

  • F. Ricca et al.

    Understanding and restructuring web sites with ReWeb

    IEEE MultiMedia

    (2001)
  • M. Monroy et al.

    An approach to recovery and analysis of architectural behavioral views

  • P. Tramontana et al.

    Reverse engineering techniques: from web applications to rich internet applications

  • C. Riva et al.

    Combining static and dynamic views for architecture reconstruction

  • A. Bergmayr et al.

    fREX: fUML-based reverse engineering of executable behavior for software dynamic analysis

  • M.L. Bernardi et al.

    Model driven evolution of web applications

  • L.A.P. Rabelo et al.

    An approach to business process recovery from source code

  • K. Garcés et al.

    White-box modernization of legacy applications: the Oracle forms case study

    Comput. Stand. Interfaces

    (2017)
  • Marko Poženel is a Teaching Assistant at the Faculty of Computer and Information Science at the University of Ljubljana, Ljubljana, Slovenia. His teaching and research interests include agile software development methods, empirical software engineering as well as Web data mining and user behavior analysis. He received his Ph.D. in Computer Science from the University of Ljubljana in 2010.

    Boštjan Slivnik is an Assistant Professor of Computer Science at the University of Ljubljana, Faculty of Computer and Information Science, where he received the M.Sc. and Ph.D. degrees in Computer Science in 1996 and 2003, respectively. His research interests include compilers and programming languages with the special focus on parsing algorithms and formal languages, scheduling and distributed algorithms, and software engineering. He has been a member of the ACM since 1996.

    View full text