1 Introduction

Web applications are among the most challenging software systems to test [27]. On one side, developing web applications is becoming easier thanks to recent frameworks (e.g., AngularJSFootnote 1), which hide the complexity behind an expressive and readable web programming environment, and allow even newbie programmers to quickly develop highly interactive and complex applications. On the other hand, this comes at a price, because the programmers’ inexperience with error-prone languages like Javascript, and the combination of new technologies may introduce new kinds of faults, which have unpredictable effects and are hard to detect [18].

End-to-end (E2E) test automation is commonly adopted in such context, often justified by continuous integration and test driven approaches. Test scripts simulate typical end-users’ interactions by delivering mouse clicks and keystrokes to the browser at a pace that would be likely infeasible to perform manually. The GUI responses are recorded and validated through assertions to check the web application for functional correctness.

A disadvantage of test automation is the poor maintainability of the test code throughout the development process. In fact, test scripts are often highly customised and coupled with the technical details of the underlying web pages, which make them quite difficult to read and maintain when features are added or altered in the web application under test. Web testers try to prevent these issues by using the Page Object design pattern, which provides a simplified interface towards the web application. All the technicalities the test scripts refer to, such as low-level operations or web elements locators (e.g., an XPath to select an input field [14]), are moved to the page objects. The test code is thus separated from the implementation details, because test scripts interface with page objects methods, rather than directly with web page elements.

Building page objects for web applications is an activity which is performed manually [25]. Our prototype tool Apogen  [24] is the first solution able to provide a considerable degree of automation, hence reducing the effort for the creation of page objects. However, the initial version of Apogen  [25] suffered two major limitations: (1) in the presence of highly dynamic web pages, it creates a huge number of page objects that should be conceptually regarded as a single page object; (2) it does not support the creation of getter methods in any way. A getter method retrieves textual portions of a web page that can be used to verify the behaviour of the web application (e.g., with assertions) through the results displayed to the user.

In this paper, we overcome such limitations with the following novel contributions, implemented in the new version of the tool Apogen:

  • the automatic detection of cloned and semantically similar web pages, based on clustering, to be associated with the same page object;

  • the Cluster Visual Editor (CVE), a web-based interactive cluster visualiser and editor, allowing the tester to inspect and modify the clustering results;

  • the automatic creation of page object getter methods, capable of detecting and reporting Document Object Model (DOM) differences observed between web pages within the same cluster.

We have applied Apogen to six web applications and we have studied how different clustering algorithms, working on different syntactic features (e.g., DOM), are able to group similar web pages that should be conceptually mapped onto a single page object. Our results indicate that: (1) hierarchical clustering provides clusters of web pages close to those manually produced by a human, (2) 75 % of the code generated by Apogen can be used as-is by the tester, reducing the manual effort for page object creation and (3) 84 % of the automatically generated getter methods correspond to methods the tester needs when creating test case assertions.

The paper is organised as follows: Sect. 2 provides some background on the Page Object design pattern, our original tool Apogen, and the challenges in the automatic creation of page objects for web applications. Section 3 describes the clustering-aided version of Apogen and the features we evaluated in our experiment. Section 4 presents the quantitative and qualitative results of the experiment we conducted to evaluate the effectiveness of our approach. Section 5 describes the related work, while conclusions and future work are drawn in Sect. 6.

Fig. 1.
figure 1

AngularJS Phonecat web application (left), with its abstract representation in terms of Web Elements and Functionalities (Navigations, Actions, Getters) (center), and associated Java page objects (right)

2 Background

In web development, E2E functional testing is a widely adopted practice thanks to the increased popularity of powerful test automation tools, such as SeleniumFootnote 2. Automated tests created with these tools operate by instructing a browser to click or type on page elements. Whereas the biggest advantage is an accurate simulation of the user’s behaviour, one of the major drawbacks is that such tests tend to be fragile and highly coupled with the web pages. To prevent this, testers use the Page Object design pattern.

A page object is a class that abstracts a web page hiding the technical details about how the test code interacts with the underlying web page behind a more readable and business-focused facade. This brings two main advantages: (i) tests are more readable, and (ii) the page access logic is centralised in one place, making test suite maintenance easier [9, 10].

Let us consider the running example in Fig. 1, based on the AngularJS Phonecat web applicationFootnote 3, one of the experimental objects considered in this work. On the top-left part there is the home page, displaying a list of phones (we limited the figure to two), whereas in the bottom-left part are shown the web pages obtained after clicking on the links in the home page. In the central part there are the web page abstractions that the page objects should provide, showing the web elements test cases may interact with, and the functionalities the two pages offer (action, navigation, and getter functionalities). The right part shows the page object representations of such pages, written in Java. We can notice how the two web pages displaying phone details for two phones (bottom-center part) have exactly the same abstract representation: only their textual content varies. Correspondingly, only one Java page object can represent both of them (class PhonePage, bottom-right).

In Fig. 2 we can see an example of how page objects improve the readability of the test cases by encapsulating functionalities. The test steps shown on the left (without page objects) are directly coupled with the web page internals, while the steps on the right (with page objects) map directly to human-readable, use case scenario’s steps (e.g., open the index page, go to the first phone page, etc.).

Fig. 2.
figure 2

Two test cases created to test the Add Owner functionality of PetClinic. On the left is the test code without the adoption of page objects, whereas on the right is the same test case, using the automatic page objects generated by Apogen

2.1 First Steps in the Automatic Generation of Page Objects

Our tool Apogen is the first effective prototype able to generate automatically a set of page objects for a web application. Apogen consists of three main modules: a Crawler, a Static Analyser, and a Code Generator. The input is a web application, together with the data (e.g., login credentials) required to navigate it. The output is a set of Java files, representing a code abstraction of the web application, organised using the Page Object and Page Factory Footnote 4 design patterns, as supported by the Selenium WebDriver framework.

The Crawler generates a state-based model (graph) of the web application, in which nodes are dynamic states of the web pages and edges are event-based transitions between nodes. In particular, we use Crawljax  [17], a state of the art open source Java tool for automatically crawling highly-dynamic web applications. The Static Analyser of Apogen takes the Crawler outputs and for each dynamic state builds an abstract state object-based representation. The graph and the DOMs are parsed to collect the necessary information for building comprehensive and readable classes. The class name is generated from the URL, whereas the web elements stimulated by the Crawler are inserted as WebElement instances in the page object class. For each of them, a meaningful variable name and a locator (XPath or CSS) are generated. For what concerns the methods, each transition in the graph is turned into a navigational method between pages, and every data-submitting form is parsed to acquire information about its web elements and the associated functionality. The output of the Static Analyser is an abstract representation of the web pages and their interactions. In the last step, the Code Generator transforms such model into a set of Java page objects, tailored for the Selenium WebDriver framework.

2.2 Major Limitations

While experimenting with the initial version of Apogen, we noticed two major issues that limit its applicability [25]. The first issue depends on the default state abstraction of the Crawler, which is affected by minor UI changes. Indeed, Crawljax was designed to perform an extensive exploration and when it visits the same page filled with different input data, it often creates different dynamic states, even though the page is conceptually the same. We refer to these duplicate pages as “clones” (e.g., the two phone detail pages of the running example of Fig. 1 bottom-left). As a direct consequence, when crawling a non-trivial application, the size of the extracted model is often huge, with Apogen generating a high amount of page objects, some of which are conceptually clones of each other. The second issue is the lack of assistance in the automatic creation of getter methods, necessary when defining test case assertions.

3 Clustering-Aided Page Object Generation

In order to address the limitations discussed in Sect. 2.2, we applied clustering as a post-processing technique (after the crawling phase) for a triple aim: (1) grouping pages related to the same functionality, e.g., all the pages concerning user authentication; (2) grouping clone pages, i.e., different versions of the same page, only differing by minor, dynamic details, as the textual content (see, for instance, those in Fig. 1 bottom-left); (3) exploiting the differences between clones to retrieve information useful for getter methods.

We extended Apogen with an additional module, the Clusterer, which runs a clustering algorithm over the Crawler output. We opted for three popular clustering algorithms from the literature: K-means++ [1], Hierarchical Agglomerative [7], and K-medoids [6]. For the first two we used the implementations available from the popular Java machine learning library Weka [30], whereas K-medoids was not available, thus we implemented it from scratch. The Clusterer is able to automatically calculate different kinds of syntactic feature matrixes from the web pages (e.g., tag frequency), that are then used by the clustering algorithms to compute the similarities.

Since there is no perfect clustering technique working for all web applications, the result might be somehow imprecise and might need to be manually refined. To this aim, Apogen supports the tester with the Cluster Visual Editor (CVE), an interactive cluster visualisation and editor facility, allowing testers to inspect and modify the clustering results, as shown in Fig. 3.

Fig. 3.
figure 3

Cluster Visual Editor (CVE), a web-based tool developed using the D3 library

3.1 Feature Extraction and Matrix Creation

Clustering algorithms rely on the concept of similarity between web pages. There exist a number of works studying the factors affecting web page similarity [2, 23, 26], in which authors observed that structural features are related with semantic properties of the data and provide meaningful means of comparison between web pages. The Clusterer considers the following features: Tag Frequency, Word Frequency, URL and Document Object Model (DOM).

Tag Frequency (TF) measures the frequency at which tags occur in a web page. The general intuition is that such frequency provides an indication of the general layout and structure, and may be effective for detecting structurally similar web pages. TF for a web application W is calculated as follows: (1) extract a Tag List TL of the tags from all the pages in W, (2) for each page \(p \in W\) and for each tag \(t \in \) TL, calculate TF(tp), as the normalised frequency of occurrence of tag t in page p (after min-max normalisation); and (3) create the output matrix TL \(\times W\) of the normalised TF values.

Word Frequency. The textual content of a page captures information that may be salient for such web page. We assume that two web pages sharing similar textual content shall have some degree of topical relatedness and thus should be grouped together. The Clusterer can calculate the word frequency in two ways, considering: (1) only words within the tag body (WF1); (2) only words within the tags title, h1–h6, table, li–ul–ol (WF2). With the former we take into account the full main content of the page, whereas with the latter we follow the intuition that these tags may contain a succinct representation of the page semantic content [26]. WF1 and WF2 for a web application W are calculated as follows: (1)  extract the Word List WL including the words from all pages in W; (2) remove stop-wordsFootnote 5 from WL; (3) for each page \(p \in W\) and for each word \(w \in \) WL, calculate WF1(w, p) and WF2(w, p), as the normalised frequency of occurrence of word w respectively found in the page p within tag body or tags title, h1–h6, table, li–ul–ol (after min-max normalisation); and (4) create two output matrixes WL \(\times W\), in our study associated respectively with WF1 and WF2.

URL (Uniform Resource Locator) may also be a good indicator of similarity between web pages [23]. Two pages sharing a part of the URL are likely to be semantically close. Although this is not always true (e.g., Ajax single page applications), there are works showing the effectiveness of URLs for structural clustering [2]. Parameters are stripped before computing the Levenshtein distance [15], to reduce their potentially disruptive effects. Given W as the set of web pages of the web application, the output is a matrix \(W\times W\) (later indicated as URL-Lev) of values ranging in [0..1], where an entry equal to 0 indicates two totally dissimilar URLs, while 1 indicates a perfectly matching pair of URLs.

DOM (Document Object Model) is a dynamic hierarchical structure representing the user interface elements of a HTML page. We assume that two web pages sharing similarities between their DOMs are likely to represent pages having analogous functionalities and that they should be grouped in the same cluster. The DOM can be treated either (1) as a tree-like structure, or (2) as a string. Given W, the set of web pages of the web application, two distance matrixes \(W\times W\) can be calculated, in the first case using the Robust Tree Edit Distance (RTED) algorithm [19], whereas in the second case using the Levenshtein distance between the string representation of the DOM (after word/text removal, to preserve only the structure). In our study, we refer to these two matrixes as DOM-RTED and DOM-Lev, respectively.

Summary. To wrap up, the Clusterer extracts raw features (TF, WF1, WF2) from the web pages, or features representing distance measures (URL-Lev, DOM-RTED, DOM-Lev), to be given in input to the clustering algorithms. It is important to highlight that K-means++ needs to compute the mean feature vector (centroid) from the set of feature vectors in the same cluster. This is not possible in the case of URL-Lev, DOM-RTED, DOM-Lev, since feature vectors represent distance measures.

3.2 Potential Getter Methods Detection

In the Static Analyser of Apogen we have integrated a differencing engine, based on XMLUnitFootnote 6, that takes into account the results of clustering and supports the automatic creation of getter methods based on the DOM differences between web pages within the same cluster (e.g., clones). We believe that such intra-cluster differences point to dynamic portions of web pages, on top of which a tester might be interested in creating an assertion. For instance, in Fig. 1, getter methods are created for the phone details fields that vary across web pages in the same cluster. In order to minimise the number of false positives (i.e., irrelevant differences), the differencing engine ignores case sensitivity, white spaces, attribute value order and white-spaces between values, retaining only the differences in the textual node elements which were modified or added.

Fig. 4.
figure 4

Pictorial view of Apogen page object merging strategy, applied to PetClinic web pages

3.3 From Web Page Clusters to Page Objects

We use hard clustering, i.e., each web page is a member of exactly one cluster, because we want to map each cluster into a page object and each web page must be represented by a unique page object. Let us consider Fig. 4, showing a cluster of web pages \(C = \{state35, state39, state44\}\) from the PetClinic web application, one of the experimental objects considered in our study. State35 and state39 contain the same navigation web element (e.g., a link that can be clicked) and two different textual elements, while state44 contains the same navigation web element and an action (e.g., a text field that can be filled in).

Without considering the results of clustering, Apogen would generate three page objects \(PO_1, PO_2, PO_3\) for state35, state39 and state44. The same navigation method navigation1 is replicated three times in all page objects; no getter methods are available for text1 and text2, and the third page object has one method, to perform action1. For the web tester would be quite difficult to decide when to use \(PO_1, PO_2\) or \(PO_3\). Moreover, manual corrections and adjustments to the automatically generated page objects should be repeated three times.

By using clustering, instead, Apogen generates a sole page object, corresponding to the entire cluster. The navigation method navigation1 appears only once in such page object. The action method action1 is also included. For what concerns getter methods, only textual elements belonging to structural clones and differing across such clones are turned into getters. In our example, state35 is a clone of state39 (i.e., their DOMs are structurally equivalent) and text1 differs from text2. Hence, a getter method to retrieve the value of the dynamically changing textual element, namely getter1, is generated. The result is a merged page object \(PO_{1-2-3} = \{navigation1, action1, getter1\}\), exposing all functionalities of state35, state39 and state44 relevant for web test creation.

4 Empirical Evaluation

We present the empirical study conducted to evaluate the effectiveness of clustering in grouping similar web pages conceptually associated with the same page object, and the quality of the page objects generated by Apogen. We follow the guidelines by Wohlin et al. [31] on designing and reporting empirical studies in software engineering. Our tool and demo videos are available at: http://sepl.dibris.unige.it/APOGEN.php.

Table 1. Experimental objects

4.1 Experimental Objects

We selected six real-world web applications covering different application domains, whose properties are shown in Table 1. PetClinic is a veterinary clinic information system which allows veterinarians to manage data about pets and their owners. It has been developed using Java Spring Framework and makes use of technologies as JavaBeans, MVC presentation layer and Hibernate. AddressBook is a PHP/MySQL-based address and phone book, contact manager, and organiser. PPMA is a web based password manager. Claroline is a collaborative learning environment which allows teachers or education institutions to administer courses online. The software provides group management, forums, document repositories, calendar. Phonecat is a web-based phone catalog using the AngularJS framework. FluxBB is a fast and light PHP forum application.

4.2 Research Questions

We conducted our empirical study to address the following research questions:

RQ1 :

(effectiveness): What clustering algorithm provides the best result and how do different algorithms compare with each other?

RQ2 :

(reduction): What is the maximum reduction achievable in the number of generated Page Objects when using clustering with Apogen ?

RQ3 :

(quality): How successful is the clustering-aided Apogen in generating high quality Page Objects, i.e., Page Objects similar to those a developer would write?

4.3 Metrics

A human expert has manually defined the Gold Standard for clusters and page objects, i.e., the ideal grouping of web pages into clusters (Clusters Gold Standard, C-GS) and the ideal page object classes associated with the clusters (Page Objects Gold Standard, PO-GS).

Both Gold Standards require the intervention of a human for their construction. To limit any bias or subjectivity, we asked an external third party (hereafter referred as EXP) to define the Gold Standards. EXP is a programmer with strong professional experience in developing and testing web applications using page objects. EXP has substantial industrial experience and was not involved in the development of Apogen.

The metric we used to answer RQ1 is the Partition Edit Distance (PED), which in our case measures the minimum number of web pages that must be moved between clusters to make two web page partitions (i.e., the output of clustering and C-GS) the same. We chose PED because it provides a direct measure of the tester’s manual actions necessary to produce the target clustering (i.e., C-GS) starting from the output produced by any of the considered clustering algorithms. In fact, a high value of PED means that many web pages need to be reassigned, whereas a low value of PED means that the clusters are close to C-GS (with few moves required). In the following, we introduce the concepts behind PED and how to calculate it. Let us assume to have a set of six web pages W = \(\{p_1, p_2, p_3, p_4, p_5, p_6\}\) and that we want k = 4 separate clusters (\(gs_0\), \(gs_1\), \(gs_2\), \(gs_3\)). Suppose we have the following C-GS:

$$ gs_0\rightarrow \{p_1\} \qquad gs_1\rightarrow \{p_3, p_4\} \qquad gs_2\rightarrow \{p_2\} \qquad gs_3\rightarrow \{p_5, p_6\} $$

whereas a hypothetical clustering algorithm C gives the following partitions:

$$ c_0\rightarrow \{p_1, p_2\} \qquad c_1\rightarrow \{p_3\} \qquad c_2\rightarrow \{p_4, p_6\} \qquad c_3\rightarrow \{p_5\} $$

We first compare each cluster \(gs_i\) from C-GS with each cluster \(c_j\) from C using the Jaccard similarity coefficient:

$$ J(c_i, gs_j) = \frac{\left| c_i \cap gs_j \right| }{\left| c_i \cup gs_j \right| } $$

where 0 indicates no element in common; 1 total agreement. For instance, the Jaccard similarity between \(c_0\) and \(gs_0\) is \(J(c_0, gs_0) = \left| \{p_1\} \right| / \left| \{p_1,p_2\} \right| = 0.5\).

The Jaccard similarity matrix for all possible pairs \(\langle c_i, gs_j\rangle \) is shown in Table 2. Given the similarity matrix between two partitions, PED can be obtained by solving the following linear assignment problem:

Given two partitions C and C-GS, find the partial bijection between the elements of C and C-GS (i.e., partial, unique assignment of elements from C to elements of C-GS) that maximises the total similarity between paired elements.

In our example, a linear assignment algorithm (we used the Hungarian Method [8]) would produce the following best pairs BP (highlighted in bold in Table 2):

$$ BP = \{\langle c_0, gs_0\rangle , \langle c_1, gs_1\rangle , \langle c_2, gs_3\rangle , \langle c_3, gs_2\rangle \} $$
Table 2. Jaccard similarity matrix. The best pairs that a linear assignment algorithm would produce are highlighted in bold

Given BP, the asymmetric set difference cardinality between each pair gives us the number of pages that must be moved to unify the two partitions. Formally, PED is computed as follows:

$$ PED(C, \textit{C-GS}) = \sum \limits _{\langle c_i, gs_j\rangle \in BP} \left| {c_i} \setminus {gs_j}\right| + |\text {unassigned}(C, BP)| $$

If there are unassigned clusters in C, due to the size of C being different from that of C-GS, the total number of pages contained in such unassigned clusters are also added in the formula given above. Although the asymmetric set difference operator (\(\setminus \)) has been used in the formula to compute PED, it can be easily shown that PED is symmetric: \(PED(C, \textit{C-GS}) = PED(\textit{C-GS}, C)\). In our example: \(PED(C, \textit{C-GS})=\left| {c_0} \setminus {gs_0}\right| + \left| {c_1}\setminus {gs_1}\right| \)+\(\left| {c_2} \setminus {gs_3}\right| +\left| {c_3} \setminus {gs_2}\right| = 1 + 0 + 1 + 1 = 3.\) Thus, a tester would need to move three web pages from the clusters of C to obtain C-GS: \(p_2\) from \(c_0\) to \(c_2\), \(p_4\) from \(c_2\) to \(c_1\) and \(p_6\) from \(c_2\) to \(c_3\). This gives a rough estimate of the effort required for the manual correction of the clustering output.

To answer RQ2, we counted the number of generated page objects first disabling and then enabling the clustering in Apogen.

To answer RQ3, for each page object of PO-GS, we manually inspected all methods: (i) classifying the kind of functionality as navigational, action or getter; (ii) determining whether the method has a semantically equivalent counterpart in the automatic page objects (we tag such methods as Equivalent); (iii) determining whether the method has a counterpart in the automatic page objects that needs minor modifications (we tag such methods as To Modify); (iv) determining any missing methods (we tag such methods as Missing). Further, we are interested in determining if Apogen leads to the generation of extra methods, e.g., methods not contained in PO-GS. The number of Equivalent, To Modify, Missing and Extra methods allows us to estimate the possibility to use the code produced by Apogen as-is, and the effort needed to manually correct the methods to be modified, or to be added/deleted.

4.4 Experimental Procedure

To answer RQ1, we proceeded as follows:

  1. (i)

    We ran the Crawler over each web application to infer its model. We fed the Crawler with the data necessary to explore each application, such as login credentials. EXP manually inspected the crawling outcomes and created a C-GS for each web application.

  2. (ii)

    Clustering algorithms need the specification of the number of clusters k as input. Such a value can be either provided manually or can be obtained by automated methods, such as the Silhouette method [22]. We have compared the optimal number of clusters, \(k_{opt}\) available from the C-GS, with the number produced by the Silhouette method and the two are very close to each other in all experimental objects (with median difference 3, maximum difference 5 and minimum 0). Hence, we ran Apogen on each web application with each (algorithm, feature) pair searching for exactly \(k_{opt}\) clusters. We compared the clusters obtained from Apogen with C-GS.

  3. (iii)

    We calculated PED for all (algorithm, feature) pairs, in order to assess: (1) what is the best (algorithm, feature) pair, and (2) how far the best algorithm is from the C-GS.

To answer RQ2, we ran Apogen on each web application twice, both enabling and disabling the Clusterer, and we counted the number of generated page objects.

To answer RQ3, we proceeded as follows. For each web application:

  1. (i)

    EXP manually created PO-GS from the optimal clusters in C-GS;

  2. (ii)

    we inspected and compared PO-GS with the page objects automatically generated by Apogen. In detail, for each page object, we manually classified all methods as navigational, action or getter, and as Equivalent, To Modify, Missing, or Extra.

Table 3. Comparison between automatic clusters and gold standard (PED)

4.5 Results

Table 3 reports the values of PED for the admissible combinations of algorithm (Hierarchical, K-means++, K-medoids) and feature (TF, WF1, WF2, DOM-RTED, DOM-Lev, URL-Lev).

Globally, the best algorithm is Hierarchical, which occupies the first, second and fourth positions of the rank. It scores 18 (DOM-RTED), 24 (URL-Lev), and 28 (DOM-Lev). K-means++ has variable performances: it is ranked third with a value of 26 (TF) but also fifth with a value of 32 (WF2) and sixth with a value of 33 (WF1). K-medoids stabilises in the worse positions of the rank, independently from the input data matrix. Its values range between 42–49.

RQ1 (effectiveness): considering all the applications, Hierarchical clustering resulted to be the optimal choice, being in the first, second and fourth position of the PED rank and undergoing little oscillations in its performance across different web applications. In our experiment, the effort to align its clusters with C-GS consists on average of two–four page moves per application. K-means++ also proved to be a good choice when used with the data matrix representing the tag frequencies. Indeed, its performance is aligned with that of the Hierarchical algorithm on WA5 (PetClinic). To summarise:

figure a

RQ2 (reduction). Table 4 shows data about the reduction in the number of generated page objects when using clustering in Apogen. The first column shows the experimental objects, whereas in the second column are the number of page objects generated by Apogen without considering clustering, which is equal to the number of dynamic states retrieved by the Crawler. The third column displays the number of clusters defined in the C-GS, which is equal to the number of page objects produced by Apogen, since \(k_{opt}\) was provided as input to the clustering algorithm (the value of k obtained from Silhouette would be only slightly different).

Table 4. Reduction of generated page objects when using clustering

To summarise, in our experiment:

figure b

Beyond the mere quantitative data, the substantial reduction achieved by clustering gives an idea of the reduction in page object maintenance that is expected to occur during software and testware evolution. Empirical studies that assess human costs associated with test maintenance are required, however, to substantiate our belief.

RQ3 (quality): Table 5 shows the number of methods (navigational, action or getter) that we tagged as Equivalent (Eq), that need to be modified (TM), missing (Mis) and extra (Extra) w.r.t. PO-GS. The first column shows the experimental objects. The second, third and fourth macro-columns show the cardinality of navigational, action and getter methods generated by Apogen (macro-columns are split into Eq, TM, Mis and Extra). The fifth macro-column shows the amount of methods contained in PO-GS (i.e., the key functionalities a web tester would put as methods in the page objects). The sixth macro-column reports the sum over all kinds of methods.

Based on the data, we can notice that on average about 75 % of the methods are equivalent, 7 % are to modify, and 18 % are missing. Looking at results by type, for what concerns navigational methods, most are directly usable, as produced by Apogen, (about 80 %), none is to be modified and 16 % are missing. About the actions, we can notice that roughly 51 % are equivalent, 28 % need to be manually modified and 21 % are missing. For the getter methods, which are generated on top of intra-clusters differences (see Sect. 3), we can notice that 84 % are equivalent, none is to be modified and 16 % are missing. Concerning the methods tagged as extra, i.e., methods that are not explicitly present in the C-GS, we have a total of 53 methods, all falling in the getter category.

figure c
Table 5. Comparison between automatic and manual page objects

4.6 Qualitative Analysis

For space reasons, we focus the qualitative analysis on the main page of the FluxBB web application (Fig. 5 top). This example is representative because the page object automatically generated by Apogen (Fig. 5 bottom) for such page includes all the cases (Equivalent, ToModify, Missing, Extra) described in Sect. 4.3.

The navigational method goToUserlist() is an example of Equivalent method (a). In fact, it replicates exactly what the tester would do while performing a navigation from the current page toward the user list page: click on the menu item and change the state by instantiating a new proper page object (Userlist) and by passing it the WebDriver instance.

The action method qjump(), instead, is an example of method ToModify (b). First, the name retrieved from the form attributes is not very expressive (the label “Jump To” would have been a better choice for the name, in this case). Second, the return parameter with the target object is missing. The reason is that static analysis misses the next dynamic state. The returned page object should be a TestForum page object, whereas if an incorrect parameter is passed as args0, the page object should manage the error. The second getter method is an example of Extra method (c), because it refers to a web element within the page representing the same information targeted by the first getter (see the two span “Pages: 1” fields in Fig. 5). In this case, the tester may keep only one of the two getters (e.g., the first one), deleting the second. On the other hand, the tester may check for any inconsistency between the two values, so having two separate methods might be regarded as useful. We decided to leave this choice to the tester. In fact, we believe that the generation of extra getter methods does not impact so negatively the readability of the page objects. On the other hand, no clones in the cluster exposed any differences, while some variability might occur, for instance, in the Replies field (d). Thus, we marked such field as a Missing getter.

Fig. 5.
figure 5

The main page of FluxBB web application (top), and a portion of the page object generated by Apogen (bottom)

4.7 Discussion

Hierarchical agglomerative clustering offered stable performance across all web applications, possibly because of the single linkage (min) method, which aggregates clusters when their minimum similarity is the highest among all possible pairs of clusters being aggregated, hence leading to aggregation choices that we think are close to those made by a human when defining the Gold Standard. Content-based features (WF1 and WF2) seem to capture a significant amount of information related to the semantic content and sometimes improve the effectiveness of clustering, though they are not the best choice, according to our study. The features calculated over the DOM (RTED and Lev) work best with Hierarchical clustering, while they perform quite poorly with K-medoids. Thus, we can conclude that structural properties have the best performance, in particular DOM-RTED, TF and DOM-Lev, especially if coupled with Hierarchical clustering.

Concerning the results for RQ3, we can notice that there were no methods to be modified in the navigational and getter categories. This is mainly due to the code transformation phase, in which the mapping is 1-1 for these kinds of methods (see Sect. 2.1). On the other hand, 28 % of action methods needed a manual refinement, usually to add some complex interaction pattern (e.g., a mandatory click on a checkbox before triggering a form submission). These patterns cannot yet be captured by the current version of Apogen, which is not able to automatically add the missing statements. Although this is an interesting challenge for future work, it represents a minor issue, since the majority of the actions are correctly generated and ready for use by developers.

For similar reasons (static analysis of the DOM and 1-1 model to code transformation), we have a complete absence of Extra action and navigational methods. Concerning the getters, instead, there are 53 extra getter methods. This result is not surprising, since the problem of identifying the getters that are potentially relevant for the construction of assertions, is a challenging problem. We implemented a heuristic, which suffers from the problem of false positives. On the other hand, the use of clustering and intra-cluster differencing captured most of the web page dynamic sections, producing a high proportion of the getter methods in the gold standard (84 %). As noticed before, the generation of additional getter methods is not expected to impact so negatively the activity of the tester. It should also be noticed that Phonecat and PetClinic have no extra getters and that there are on average 9 extra getters per web application over all the page objects, an amount which we think is acceptable for web testers.

4.8 Threats to Validity

For what concerns the external validity and the generalisation of results, we selected real size web applications spanning different domains, which makes the context realistic, even though studies with other applications are necessary to corroborate our findings.

About the internal validity, a possible issue is represented by the manually created gold standards, both for the clusters and the page objects. It should be noticed that we must necessarily rely on a manual gold standard for evaluating the output of Apogen, because no automated method can provide us with the ideal clusters and page objects. We minimised this threat by having the gold standards produced by a third subject independent from us and from Apogen.

For what concerns the construct validity, for the evaluation metrics used to answer RQ3 we did not adopt Precision-Recall measures, because they rely on a boolean classification of the output (either correct or incorrect), while in the case of page object methods labelled as To Modify or Extra it is not completely appropriate to deem them as incorrect (neither as correct). We preferred to present the data as they are, split into four categories (Equivalent, To Modify, Missing, Extra), and to discuss them in terms of usability, benefits and expected manual actions required for the refinement of the automatically generated page objects.

5 Related Work

The automated creation of page objects for E2E web testing is a completely new research field, so, to the best of our knowledge, there are no strictly related previous works. However, there are related works that deal with applications of clustering techniques to support web testing and engineering [35, 16, 20, 21, 28].

State Objects. Van Deursen [29] describes a state-based generalisation of page objects. From a testing viewpoint, moving a page object to the state level makes the design of test scenarios easier. Besides the mere terminological difference, the work by van Deursen describes a series of guidelines and good practices (e.g., let each state correspond to a state object) that we share and tried to incorporate in the development of Apogen, since our ultimate goal is the automatic generation of meaningful page/state objects.

Clustering. Crescenzi et al. [5] present an algorithm to cluster web pages, exploiting the structural similarities of the DOMs. In this paper, we studied several structural similarity measures beyond the DOM, with the aim of supporting the clustering of web pages from a testing perspective. Tonella et al. [28] provide two methods for web clustering evaluation, the gold standard and a task oriented approach, together with guidelines and examples for their implementation. In our paper, we compared the results of web page clustering against a gold standard, in order to ensure its meaningfulness from the web testing viewpoint. In another work, Ricca et al. [20] utilise keyword-based clustering to improve the comprehension of web applications. In our paper, we did not limit ourselves to content-based metrics. Actually, structural properties (e.g., DOM or TF) showed to be more effective.

Crawling/Differencing. Choudhary et al. [4] present a dynamic technique based on differential testing to automatically detect cross-browser issues (XBI) and assist developers in their diagnosis. The approach operates on single web pages and focuses on visual analysis, whereas we perform intra-cluster DOM-differencing. Mesbah et al. [16] analyses an entire web application, using dynamic crawling, also for the retrieval of XBIs. Similarly, we adopt crawling and web page differencing, but our approach is constrained to finding textual differences between intra-cluster web pages, on top of which a tester can build meaningful assertions. Choudhary et al. [3] combined and extended the two above-mentioned approaches for XBI detection in the tool CrossCheck. Even though this paper shares some methods with us, such as the reverse engineering of a web application model with a crawler, and performs DOM differencing between web pages, we use clustering, which is an unsupervised machine learning technique, instead of a classifier, and we target a completely different goal, automated page object construction.

6 Conclusions and Future Work

We presented a novel approach, based on web page clustering, to automatically generate page objects for web testing. The tool Apogen, which implements the approach, has been applied to six existing web applications. Experimental results indicate that our clustering approach is effective to group semantically related web pages. Furthermore, the page objects obtained from the output of clustering are very similar to the page objects that a developer would create manually. Indeed, 75 % of the code generated by Apogen can be used as-is by a tester, breaking down the manual effort for page object creation. Moreover, a large part (84 %) of the page object methods created to support assertion definition corresponds to meaningful and useful behavioural abstractions.

As future work, we plan to improve the heuristics used to create the getter methods, which cannot be applied to single page clusters. We will investigate a complementary approach, for input data generation, capable of exposing the variable part of multiple as well as single pages in each cluster. We will also study visual mechanisms, based on image processing, to retrieve dynamic page portions [3] and produce visual page objects [11]. Finally, we plan to improve the maintainability of the page objects by enhancing Apogen with robust web element localisation techniques [1214].