Chapter Six - Using clickstream data to enhance reverse engineering of Web applications
Introduction
With the advance of Web technologies, many existing Web applications are rewritten or replaced by new ones, which means that they are totally reengineered. In many cases, the source code is available, but reliable and comprehensive documentation is missing.
There are various reasons why the documentation might be missing:
- 1.
Some Web applications were written in a hurry without using any proper software development methodology which would ensure proper documentation.
- 2.
The existing documentation might be insufficient as the application was produced using agile software development methodology [1] which gives priority to working software over comprehensive documentation [2].
- 3.
In an application's lifetime the source code was modified either during maintenance or when new functionalities were added, but quite often many modifications are insufficiently documented or not documented at all.
It happens all too often that in such cases, reverse engineering must be performed to extract business logic from the application's source code.
Reverse engineering is a software reengineering activity aimed at detecting software components and their interrelationships to provide multiple views of software systems at a higher level of abstraction [3]. It could be considered a postcomprehension method if compared to simulation, which is a pre-comprehension method [4]. The other two major activities are comprehension and enhancement of software systems. The former is about finding out what the software components detected during reverse engineering actually do and the latter is about modifying and improving software systems.
Reverse engineering of a Web application seeks to understand its structure and functionality, and in the end it should produce (visual) representation of the application's structure. Visual representation is preferred as it has been shown that graph-based comprehension tools offer a more efficient basis for comprehension than, for example, relational database-based tools [5].
In many cases the number of components, i.e., individual Web pages and/or behind the scene codes for generating Web pages, might be relatively large. Therefore, simply displaying connections between components is not enough. One should, if at all possible, find which components form a certain part of the Web application, e.g., modules implementing individual functionalities, so that later comprehension is easier.
The structure of the article is as follows. After the description of reverse engineering topics specific to Web applications in Sections 2 and 3 gives an overview of different approaches to reverse engineering. As this article aims at extracting the structure of a Web application using clickstreams, Section 4 introduces application transition graphs (ATG), while Section 5 describes preprocessing of clickstreams and user sessions. The metrics used for clustering and visualization of ATGs are defined in Section 6, and Section 7 presents methods used for clustering and final visualization of ATGs.
Section snippets
On reverse engineering of Web applications
When the World Wide Web appeared in 1989/90, it consisted of a number of interconnected Web pages. Soon after Web applications appeared, their development and deployment only accelerated in time. Regarding the way a Web application interacts with a user on one side and with the server on the other side, Web applications can be divided into two main groups:
- 1.
Traditional Web applications
As shown in Fig. 1 (left), traditional Web applications consist of (i) a Web browser on the client side and of
Related work
Due to the technological importance of reverse engineering, it is no surprise that it is a well investigated field in software engineering and that many tools for reverse engineering have been produced so far. The list (by no means exhaustive) includes Moose [12], GUPRO [3], Columbus [13], CodeCrawler [14], SolidFX [15], and SQuAVisiT [16]. However, these tools are made for reverse engineering of traditional, i.e., no-Web applications written in languages like C, C++, Java, and Cobol (although
Application transition graph of a Web application
In reverse engineering, the structure of a Web application should be uncovered and its (visual) representation should be produced. To represent the structure of a Web application, the atomic section model (ASM) was selected in this study [61]. Initially, the ASM was defined for testing Web applications, but it has been shown that it is useful in reverse engineering as well [11].
The ASM consists of two parts. At the higher level of abstraction, it consists of the ATG which describes how
From raw clickstream data to user sessions
If an ATG of a Web application is automatically generated from the application's source code, i.e., using the static approach, the obtained result as the one shown in Fig. 2 is likely to be of no use. To improve the presentation of the ATG, a dynamic component of all information which can be gathered about the Web application must be taken into account. Web server access log files should therefore be considered as well.
The HTTP server usually serves multiple users at the same time and records
Metrics for clustering the ATG of a Web application
To cluster Web application pages and thus relevantly visualize the application's ATG, the distance between these pages must obviously be defined. Among many possible definitions of distance, two definitions are used:
- 1.
the one based on page names and
- 2.
the other based on application usage data.
The first definition relies on the names found in the source code and thus exposes the static information about the Web application, whereas the second definition uses Web server access log files and exposes
Clustering and visualizing Web application's ATG
Out of many different clustering and visualizing methods, the following three were used to cluster and visualize the Web application's ATG in this study:
- •
hierarchical clustering,
- •
the graph drawing algorithm proposed by Kamada and Kawai [83], and
- •
multidimensional scaling (MDS) [84].
In these methods, the distances as defined earlier were used.
Hierarchical clustering was used first because it combines objects at lower levels into clusters on higher levels. Since what was sought was the multilevel
Conclusion
Reverse engineering of old Web applications with inadequate or nonexistent documentation can be challenging. In this text we presented a reverse engineering approach for acquiring a model of a Web application from Web log files. We have proven that Web log files can be a useful source for the reverse engineering process, and that the possibility of reverse engineer's incremental building of the final application model may significantly enhance their understanding. In addition, as seen in our
Marko Poženel is a Teaching Assistant at the Faculty of Computer and Information Science at the University of Ljubljana, Ljubljana, Slovenia. His teaching and research interests include agile software development methods, empirical software engineering as well as Web data mining and user behavior analysis. He received his Ph.D. in Computer Science from the University of Ljubljana in 2010.
References (84)
- et al.
A systematic approach to generation of new ideas for PhD research in computing
- et al.
Characterizing user navigation and interactions in online social networks
Inf. Sci.
(2012) - et al.
Rigi—an environment for software reverse engineering, exploration, visualization, and redocumentation
Sci. Comput. Program.
(2010) - et al.
Systematic literature reviews in software engineering—a tertiary study
Inf. Soft. Technol.
(2010) - et al.
Migration of information systems in the Italian industry: a state of the practice survey
Inf. Softw. Technol.
(2011) - et al.
Automated extraction of GUI models for testing
Chapter 1—Reverse-engineering software behavior
- et al.
MoDisco: a model driven reverse engineering framework
Inf. Softw. Technol.
(2014) - et al.
Knowledge discovery metamodel-ISO/IEC 19506: a standard to modernize legacy systems
Comput. Stand. Interfaces
(2011) - et al.
Business process archeology using MARBLE
Inf. Softw. Technol.
(2011)
A family of case studies on business process mining using MARBLE
J. Syst. Softw.
Migrating traditional web applications to CMS-based web applications
Electron. Notes Theor. Comput. Sci.
Layout-based computation of web page similarity ranks
Int. J. Hum. Comput. Stud.
Re-engineering legacy web applications into RIAs by aligning modernization requirements, patterns and RIA features
J. Syst. Soft.
Reverse engineering reusable software components from object-oriented APIs
J. Syst. Soft.
Why is the snowflake schema a good data warehouse design?
Inf. Syst.
Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor
Procedia Comput. Sci.
From source code identifiers to natural language terms
J. Syst. Soft.
An algorithm for drawing general undirected graphs
Inf. Process. Lett.
Agile Software Development, Principles, Patterns, and Practices
GUPRO—generic understanding of programs (an overview)
Electron. Notes Theor. Comput. Sci.
A survey and evaluation of simulators suitable for teaching courses in computer architecture and organization
IEEE Trans. Educ.
Comparing graph-based program comprehension tools to relational database-based tools
A pattern restore method for restoring missing patterns in server side clickstream data
Building rich web applications with Ajax
Computer
Higher education web information system usage analysis with a data webhouse
ICCSA (4)
Using reverse engineering to construct the platform independent model of a web application for student information systems
Comput. Sci. Inform. Syst.
MOOSE: an extensible language-independent environment for reengineering object-oriented systems
Columbus—reverse engineering tool and schema for C++
Polymetric views—a lightweight visual approach to reverse engineering
IEEE Trans. Soft. Eng.
An interactive reverse engineering environment for large-scale C++ code
SQuAVisiT: a flexible tool for visual software analytics
Web site maintenance with software-engineering tools
Understanding and restructuring web sites with ReWeb
IEEE MultiMedia
An approach to recovery and analysis of architectural behavioral views
Reverse engineering techniques: from web applications to rich internet applications
Combining static and dynamic views for architecture reconstruction
fREX: fUML-based reverse engineering of executable behavior for software dynamic analysis
Model driven evolution of web applications
An approach to business process recovery from source code
White-box modernization of legacy applications: the Oracle forms case study
Comput. Stand. Interfaces
Cited by (1)
Marko Poženel is a Teaching Assistant at the Faculty of Computer and Information Science at the University of Ljubljana, Ljubljana, Slovenia. His teaching and research interests include agile software development methods, empirical software engineering as well as Web data mining and user behavior analysis. He received his Ph.D. in Computer Science from the University of Ljubljana in 2010.
Boštjan Slivnik is an Assistant Professor of Computer Science at the University of Ljubljana, Faculty of Computer and Information Science, where he received the M.Sc. and Ph.D. degrees in Computer Science in 1996 and 2003, respectively. His research interests include compilers and programming languages with the special focus on parsing algorithms and formal languages, scheduling and distributed algorithms, and software engineering. He has been a member of the ACM since 1996.