Keywords

1 Introduction

When we are asked to fulfill a particular task on a Web page, we can make use of all of our human experience. If we are asked to shop a product on a Web site, for instance, we can easily fulfill the task if we have shopped on the Web before: We add “products” to a “cart” and finally “checkout” and “proceed to payment”. (Actually, if we have ever shopped outside the Web, recognizing these concepts will help, too.) However, if we want a robot to go shopping on a complex Web application, devoid of such experience, the robot may easily get lost in the multitude of possible actions. Exhaustively exploring (“crawling”) a Web shop with millions of products (and links) can take days to weeks; and while a robot may well cover a multitude of products along the way, the chance of hitting the one path that actually completes the purchase is marginally slim. If we want to explore relevant functionality (for instance, to generate nontrivial tests), we need to be smarter than that.

In this paper, we speed up crawling of Web sites dramatically by reusing existing test interactions from other, semantically related Web sites. Assume we want to test the eBay Web shop and we are given tests for the Amazon Web shop. In the Amazon test (and on the Amazon page), we use natural language processing to identify the central features a crawler should later on search for—namely, “products”, “add to cart”, “checkout”, and “payment”. This path to successful completion of the task is what the crawler can use as step-by-step guidance when exploring the eBay page: Rather than searching across all possible links for a way to enter payment information, it would first search for actions semantically related to “products”, then for “add to cart” actions, and thus follow breadcrumb by breadcrumb until it finally reaches the eBay “payment” page. We can thus effectively translate the interaction path of Amazon into a semantically similar path on eBay.

Such an automatic translation has a number of uses. Any kind of random testing of Web applications immediately profits if we already have a set of known interactions from another Web application in the same domain: Rather than having to spend days to follow all possible links, a crawler can follow given “standard” paths in minutes. Reusing existing paths also makes it possible to continue exploration from a deep state—say, apply random test generation on a payment page—where the chances of finding bugs may be much higher as more context is available. Finally, functionality and robustness tests written for some Web page can be translated again and again for other applications—all a developer would have to do is to add a new check for whether the test was successful.

After introducing technical background and related work (Sect. 2), the remainder of this paper is organized along its contributions:

  1. 1.

    We introduce novel techniques for identifying and mapping natural language features across two Web applications (Sect. 3). The resulting mapping is the base for guiding crawling in the target application and thus rediscovering the paths from the source application. To the best of our knowledge, ours is the first technique ever to create and leverage such mappings across different applications, and thus to transfer tests from one to another.

  2. 2.

    We evaluate our technique on twelve industry-sized Web applications from three domains (Sect. 4). We show that our technique can map up to 90% of features from the source application accurately to features in the target application. The precision enables practical and accurate directed test generation even on complex interactive applications—in sharp contrast to random exploration, which hardly ever reaches deep functionality.

After we discuss threats to validity in Sect. 5, we close with conclusions and future work (Sect. 6).

2 Background

Automatically testing the functionality of applications, e.g. by crawling, is the target of many research projects. While humans use semantic concepts to interact with applications, modern automated random crawlers [3, 13, 16, 18, 19] are unable to leverage this information. Generating valid input, i.e. user actions, for complex applications and differentiating the resulting output states is crucial to any crawling technique or automated testing solution. Random crawlers use randomly generated (or manually configured) inputs to test an application exhaustively. When applied to real industrial applications that have huge state spaces, they often face the state explosion problem. Lin et al. [8, 9] recently presented a semantic crawling technique that is trained on a large set of manually labeled states (DOM-trees), which allows them to re-identify complex form fields and even similar states within an application. Cross-browser testing [3] and cross-platform testing approaches [17] use existing application models (also in the form of test cases) to gain knowledge of the application under test (AUT) on one platform under the assumption that the same observations apply on another test platform. So far these approaches are limited to be executed on the same application.

By transferring domain specific knowledge across applications, a crawler can identify testable features and their desired output. Current solutions cannot transfer knowledge across applications without specifying additional executions for each AUT. Lin et al.’s [9] semantic crawling technique can be adapted for that purpose. But their model has to be trained for each new domain with respective sets of labeled states—thus requiring human expertise to create this mapping. Our approach makes it possible to automatically generate such mapping without manual effort.

Thummalapenta et al. [20, 21] mine “how-to” instructions for feature extraction and how to translate these into instructions usable by a crawler. In their work, instructions are interpreted syntactically and therefore limited to the specific application and version they were mined for. However, the general idea could be used to complement our technique for mining semantic concepts instead, which allow for a test transfer across applications.

Modern applications use graphical user interfaces, to abstract complex technical events. The natural language content of UIs helps users to achieve their goal by exposing the functionality, i.e. the features of the application, e.g. by presenting a describing label next to input fields or by filling fields with expressive default data. Humans can easily grasp the underlying semantic meaning of such descriptions. Even with a vague idea of a semantic concept, a human can transfer domain specific knowledge from one application to another by matching semantically similar concepts (such as selecting a proper payment method and providing valid input).

Semantic similarity itself has been intensively researched in the field of human-machine-interaction (HMI) and document classification methods. It has been shown that one can find semantic similarities between words or strings [5] by training a word vector model (short word2vec) on a large set of documents, i.e. large text corpora. The key idea of these models is to express the words as vectors in a vector space. Based on the training data, words expressing similar concepts are mapped to nearby points. These word vectors capture meaningful semantic regularities, i.e. within the given documents one can observe constant vector offsets between pairs of words sharing a particular relationship. The cosine-similarity (\(\cos (\theta )\)) of two given word vectors a and b is in the range in the interval \(\left[ -1,1\right] \), where negative values converging to \(-1\) express a high dissimilarity and values close to 1 express semantic equivalence.

Fig. 1.
figure 1

Generating tests from tests of other apps: After identifying features (as UI elements and textual labels) from an existing application, we match these features semantically while exploring a new application.

To explore functionality within Web pages, we leverage the Selenium  [4] framework. Selenium is an automation engine based on JavaScript. It serves as an interface for crawling and testing engines to a large set of test-browsers, translating commands to simulated user actions executed on the browser under test. Selenium is controlled by HTTP requests and can execute arbitrary user actions (e.g. Click, Insert Text, Hover) or just plain JavaScript code. Selenium encapsulates these commands into JavaScript (executed in the browser under test) to directly manipulate the DOM elements (UI elements) or retrieve information. DOM elements can be identified by their element properties, such as the Xpath or CSS-properties, attributes, or textual content. Crawlers can use JavaScript or Selenium to mimic real user actions and can also capture DOM modifications to model state changes internally as a finite state machine (FSM). The nodes of the FSM contain the individual DOM-trees as presented in the client browser, while edges reflect the user actions.

3 Methodology

The core of our presented technique consists of two steps as depicted in Fig. 1. In the feature identification phase (Sect. 3.1), we identify essential features given an existing Selenium test suite. Thereby a feature is defined as any user UI element, e.g. input fields, buttons and other elements with attached event handlers. To identify features, we analyze on which UI elements a given test suite executes commands and group them together with their describing labels with use of a visual clustering algorithm [1]. After extracting the textual content of these elements, we can match these features across applications (Sect. 3.2) using semantic text similarity as a metric. The matched features then guide crawling.

3.1 Identifying Features

In order to identify features, we learn an application model by executing a Selenium test suite of application A. From this execution, we abstract the DOM content presented in the client browser into application states. The test execution is modeled as a finite state automaton. The nodes of the graph represent the state of the application shown in the browser interface. The edges represent the actions that are executed to change from one state to another.

To interact with elements on the page, Selenium tests contain search statements for elements displayed in the DOM. Commands can be executed on the identified web element (e.g. click, sendKeys, hover). By recording the executed test command, we learn which elements have been interacted with i.e. the edges of the graph. We derive the set of tested features as the list of UI elements on which Selenium actions have been executed upon. Taking for example a payment use case: the UI elements used to register a new debit card plus the buttons Done or Cancel are triggered and build the features for this test case.

Depending on the nature of the extracted UI elements, the extracted list may not contain any analyzable natural language text content. Although certain DOM elements (e.g. of type INPUT) store the displayed text in their attributes (e.g. the value or placeholder field), even more information is stored in the surrounding elements, i.e. the element context. We extract the text content of the feature (interacted elements) plus the surrounding labels by grouping elements according to their visual alignment [1]. This information is used in the next phase to identify matches in the target application.

3.2 Mapping Features in the Target Application

In order to identify features learned in the first phase in a target application B, we analyze the DOM of each state in the target application, e.g. while crawling B. Using semantic text similarity as a metric, we now check the target for potential labels describing the same semantic concepts. We identify the closest semantic entity to the label in the target application by computing the word-wise cosine similarity to guide exploration to the closest semantic matches.

figure a

Instead of training a machine learning algorithm on a large set of pre-labeled documents, we use the googleNewsVector model [14] off-the-shelf, henceforth referred to as the word2vec model. Models trained on a large set of unspecific documents are superior to models trained on very specific, but small data sets [14]. Additionally, the non-domain-specific model has the advantage to be easily applicable to any arbitrary application domain, while specialized models need to be retrained for each new domain.

Our methodology follows the assumptions that (i) the inflectional form introduced by the grammar of a language and (ii) even the order of words is of minor importance for a human to understand the meaning of a given label (bag-of words assumption). Before the labels are matched, they have to be pre-processed in order to sanitize the given input. Algorithm 1 presents all pre-processing steps. In the first step all illegal characters are replaced by a whitespace character to get rid of invalid/incomplete html-tags (e.g. ‘<’, /) or special symbols (e.g. special Unicode characters). The strings of each label are tokenized into single words. Due to established grammatical language rules, the processed labels use different forms of a word, e.g. proceed, proceeds, or proceeding. As mentioned before, our technique ignores the order of words. It is thus reasonable to also reduce the inflectional form of each given word (and also of derivationally related forms) by morphological stemming [15] and lemmatization [6] to a common base. As a consequence, words like am, are, or is are reduced to their base form be. In the last pre-processing step, we filter out the most common words of the language (stopword removal [11]) and remove unknown or invalid words, which are not in the corpus.

Fig. 2.
figure 2

Sample matrix for calculating the semantic similarity of two strings using word-pair-wise cosine similarity. The result vector v is composed by the best matching word pairs (bold values).

The pre-processed labels can now be matched against the UI elements in the target application in which all UI element labels have been pre-processed in the same way. We use the word vectors of the word2vec model to compute the cosine similarity between all possible word pairs. Figure 2 shows the sample output for two given labels. The cosine similarity of each word pair is expressed in an \(n \times m\) matrix, where n and m are the respective lengths of the strings. We determine the result vector by selecting the best matching word pairs (ignoring the order of words) under the conditions that every word is only matched once. Although the order of words is neglected, both strings express a different amount of concepts. By excluding multi-matches, the given method reflects this. For our example, the procedure does not allow the words “username”, “email”, and “address” to be matched to the same word “username” in the second string, even if the resulting normalized sum of the vector would be higher. The entries of the result vector v are summed up and normalized using the dimension of v. The computed value \(\equiv _\textit{sem}\) is in the interval \(\left[ -1,1\right] \). Again, values close to \(-1\) indicate highly dissimilar concepts, while values close to 1 indicate a semantic match.

In summary, for each feature discovered in application A in the previous phase we find the closest semantic match by checking the textual content of each DOM element in B. To guide exploration, we then group them together with the closest visually aligned interactive element. The algorithm returns a list of potential matches together with the calculated matching index \(\equiv _\textit{sem}\), which serves as the certainty factor. The more describing labels we can find in a single DOM, the higher the chance that the correct feature is identified This list, sorted in descending order, guides the crawler to discover those features learned in the previous feature identification phase faster.

4 Evaluation

To evaluate the performance of our semantic feature mapping, we selected 12 industry-sized real world applications from 3 different domains (see Table 1) representing the most widely used applications according to the Alexa index [2]. While executing the tests, the states presented in the client browser contain 155,858  DOM elements, 32,905 of them interactive and 37,252 elements with natural language content. For each of these applications, a custom Selenium test suite has been developed.

As already described, the matching method does not require two application models as an input, but just analyzes one (source) application model A to identify features and takes the DOM states of the target model B to identify the same features. The test setup thus runs for instance the test suite for Amazon, extracts the tested features together with their describing labels and identifies the features in ebay for which we provide the DOM data. To verify the correctness of the matching, we manually labeled each of the 551 features and test if the prediction is correct.

Table 1. Evaluation subjects for matching features across applications. #Features describes the number of features, which were identified when executing the test cases. #NLC Elements reflect the total number of UI elements per application with natural language content, which were analyzed for matching the given features.
Fig. 3.
figure 3

Results of feature matching. The x-axis shows the threshold applied on \(\equiv _{\textit{sem}}\), the y-axis displays the precision values for P@1 (blue dash-dot), P@3 (green dashed), P@10 (grey dotted), and Recall (orange solid). (Color figure online)

Each application is tested against all other applications within the same domain. Figure 3 presents the results of our evaluation averaged by the domain. Analyzing this data, we address the following two research questions (RQ):

RQ1. Can we identify features using semantic UI element similarity? :

This is the core contribution of our technique: Mapping features from the source application, as exercised in the source tests, towards the features of the target application. We evaluate the accuracy of the mapping; the better the accuracy, the more effective guidance for subsequent exploration and test generation.

RQ2. Can automated feature identification be used to improve crawling? :

What difference does it make to a crawler if guidance exists? We conservatively evaluate the worst case speedup of guided crawling vs. non-guided crawling.

4.1 Mapping Accuracy

We start with RQ1, evaluating the accuracy of the mapping of features from source application to the target application. In the area of information retrieval the precision at k [10] denotes how many good results are among the top k predicted ones. P@1 for instance indicates the precision for a perfect match (the top element is the correct feature), while P@10 indicates that the correct result was in the top 10. The semantic similarity calculation returns for every given feature a list of elements, which might be possible matches together with the semantic similarity index \(\equiv _{\textit{sem}}\). A \(\equiv _{\textit{sem}}\) value of 0 marks the point where only semantically similar concepts are taken into consideration. Figure 3a shows that even when merely excluding dissimilar features (i.e. threshold 0), the semantic matching allows to identify features with a precision of 69% (accumulated result over all domains) since the list is ordered and best matches are on top. Increasing the threshold, i.e. raising the minimum semantic similarity to identify features as matching, not surprisingly increases the precision up to a value of 1, which means that the given features have identical describing labels. The recall quickly decreases disproportional compared to the increase in precision, meaning that a low threshold of 0.15 is producing the best ratio.

Overall we correctly identify 75% of the 551 features in the test data with a precision of 83%. When checking the matching success of the individual domains the result of the domain Knowledge Base (Figure 3d) is noteworthy compared to the other results. The features here can be matched with an average precision and recall of more than 90%. This means that they not only share the majority of features, but that they are represented in the same semantic context. The first three sample pages use the same underlying content management system (i.e. the mediawiki frameworkFootnote 1); the tested features are often encapsulated within the same describing labels. The precision values are even higher than in other domains. But even features such as filling a shopping cart or filling payment information are identifiable by the semantic feature matching. Figure 3b shows that direct matching has a precision of 77% at the same recall rate.

figure b

4.2 Crawling Speedup

We continue with RQ2, assessing the scale at which automated feature identification improves crawling and subsequent test generation. Random crawling techniques often face two problems: (i) the number of possible actions in a given state is large, and (ii) generating valid actions to interact with the elements. With the presented technique, we address both accounts. To evaluate the impact of semantic feature identification on crawling we imagine the most simple scenario for generating tests. Even without considering interdependencies between features or how to reach a feature (leading to state explosion), the chance to randomly interact with the elements representing a feature is negligible.

Our models are as conservative as possible, modeling and reporting worst-case scenarios for our technique. Real-world crawlers operating on real-world web applications may perform differently, sporting heuristics to focus on specific UI elements first, and leverage similarity between Web pages to perform better. All these optimizations are orthogonal to our approach and could be combined with it; we kept them out of our evaluation for generalization.

Guiding the crawler to the correct feature does not require a perfect matching, but instead profits from a ranked list of recommendations to derive a crawling strategy. For this reason, we evaluated (Fig. 3) the probability that the correct feature is within the top 3 (P@3) or top 10 (P@10) results. The P@10 value for eCommerce shows for instance that at a recall rate of 88% semantic matching can identify features with a precision of 92%. In the domain of search engines and mail clients the matching is capable of identifying 77% of all features with a precision of 97%. Choosing different thresholds for the prediction impacts these values tremendously. Nevertheless, the values do not tell the direct impact on the crawler effectiveness.

To this end, we evaluate how effective guided crawling is compared to non-guided crawling. Applied on our test set, none of the available random UI crawlers [3, 12, 13, 18] are able to capture the complex usage scenarios in a reasonable time frame (i.e. days) even with pre-configured input values. Consequently, we evaluate how our technique reduces the search space by a smart ordering which UI elements should be explored first. This conservative analysis does not capture the effect of dependent actions and thus serves as a lower bound for a speedup calculation.

We start with a simple crawler that has no guidance (i.e., not using our technique). Assuming that the application has a number \(\left| \textit{UI }\right| \) of interactive user interface elements. A non-guided crawler has to test each UI element individually to find all features:

$$\begin{aligned} \textit{number of UI actions without guidance} = |\textit{UI }| \end{aligned}$$
(1)

Let us now look at a guided crawler that aims to test explicit features first instead of exploring randomly. Our guidance provides a recall r across the UI elements in the target that are matched in the source. Within the set of \(r \times |\textit{UI }|\) matched features, we have a set of  k top matches, and the probability p@k that the correct feature is within these top matches. The number of correct matches to explore can thus be estimated as \(p@k \times k\); whereas the number of incorrect matches would be the converse \((1 - p@k) \times (|\textit{UI }| - {\text {k}})\). (With perfect guidance, \(p@k = 1.0\) would hold, and the crawler would only have to examine k matches).

If we do not find the target within the matched features, we have to search within the unmatched features, going through an average number of \((1-r) \times \left| \textit{UI }\right| \) elements. The total number of UI interactions for testing all features thus is:

(2)

With the number of interactions for both unguided and guided exploration, we can now compute the average speedup as

$$\begin{aligned} \textit{speedup} = \frac{\textit{number of UI actions without guidance}}{\textit{number of UI actions with guidance}} \end{aligned}$$
(3)

Figure 4 shows the speedup results, as determined from our earlier evaluation result data. The average speedup over all domains is 740%. Testing knowledge bases, which already showed good performance in the earlier evaluation step, can even be improved by a factor of 11.75. Figure 4 also shows that a high threshold slows down the exploration process—that is, if backtracking and re-executing an action is not penalized in terms of execution time. A high recall is thus more important than a high precision. P@3 and P@10 outperform the direct match by orders of magnitude and are preferred when it comes to guide test generation.

Fig. 4.
figure 4

Worst case speed-up for guided crawling using semantic feature identification compared to pure random crawling. The orange (dotted) line shows the average speed-up using P@1, the blue (solid) line P@3, and the grey (dash-dot) line P@10 for every test domain; the x-axis shows the threshold applied on \(\equiv _{\textit{sem}}\). (Color figure online)

While a conservative speedup of seven is already impressive, one should note that this speedup applies for the exploration of one single state only—that is, providing guidance towards all the UI elements that are directly reachable from this state. We assume the crawler already was able to somehow reach the state with the features under test. If the functionality we search is deep within the target application, semantic guidance speeds up every exploration step along this path.

figure c

Given the complexity of business processes and modern Web applications, we actually do not interpret these cumulated speedups as an improvement over existing random crawlers; instead, we see our guidance as actually enabling their exploration and testing—in particular as it comes to cover deep functionality. This opens the door for writing a test case once for the source application, and reusing it again and again to test the same functionality in target apps, even if it is hidden deep in the system.

5 Threats to Validity

Most noteworthy is the feature identification. Although the experiments were run on 12 real world applications out of different domains, all featuring a distinct functionality, the scope of this analysis is far from being complete. There exist domains, say for instance online office tools, which offer special functionalities which are not expressed in the given set of domains. Certain features are not represented with natural language content (and use icons or images instead) and therefore cannot be used by this approach for analysis purposes. Possible solutions might be the analysis of alternative tags in images, often displayed for the sake of accessibility, or to train a neural network on a dataset of labeled images [7].

Another threat to the validity of this study is feature selection. Our evaluation is based on a manually written set of Selenium test cases, which introduces analysis bias, as not all features of the applications are tested due to the huge number of potential interactions per state. Additionally, there is no way to measure how many features exist in the AUT and calculate the test coverage in the real world applications. All applications are treated as a black-box and the testing framework can only observe changes on the client browser. To reduce bias in the presented dataset, we captured 551 typical features within the applications based on the most common use cases.

The presented work does not focus complete testing, but rather guides test generation to enable the exploration of certain usage scenarios (i.e. functionalities deep in the exploration path) and test them faster. Accordingly, the quality of the generated crawling strategy is bounded to the quality of the original test suite and the fact that the tested features indeed exist in the target application. Since applications from the same domain typically share “similar” functionality as the AUT, the presented transfer technique can effectively achieve speedups over random testing.

6 Conclusions and Future Work

Modern interactive applications offer so many interaction opportunities that automated exploration is practically impossible without some guidance towards relevant functionality. We therefore propose a method that reuses existing tests from other applications to effectively guide exploration towards semantically similar functionality. This method is highly effective: Rather than spending hours or days exploring millions of redundant paths, our guidance allows for discovering deep functionality in only a few steps. In the long run, this means that one needs to write test cases for only one representative application in a particular domain (“select a product”, “choose a seat”, “book a flight”, etc.) and automatically reuse and adapt these test cases for any other application in the domain, leveraging and reusing the domain experience in these test cases.

Despite these successes, there is still lots of work to do. Our future work will focus on the following topics:

Reusing Oracles. :

Our matching approach is able to match actions that lead to specific functionality; however, we cannot check whether the particular action was successful or not. Although this problem is shared by all test generators (which had therefore better be called execution generators), we have the opportunity not only to reuse actions, but also the oracles in existing test cases—that is, the code that checks whether a test is successful or not. The challenge here is to identify, match, and adapt appropriate criteria for success or failure.

Additional Features. :

Our matching approach exploits the semantic similarity of textual labels that are either part of UI elements or in their immediate vicinity. We plan to exploit further sources, such as default and alternate texts, internal identifiers, or icon names and shapes to obtain additional semantic information. Additional sources can also be used to identify elements that likely yield mostly equivalent functionality, such as the several product links on shopping sites.

Multiple Sources. :

So far, our approach leverages one test as a source to guide test generation. Human experience, though, is formed from many examples, and experience with multiple user interfaces helps interacting with the next one. We are looking into techniques that allow further abstraction over multiple examples to guide test generation.

Pruning Generated Tests. :

Guiding a crawler through an application by re-using existing tests allows us to speed up testing tremendously or even cover features which were previously untestable by random crawlers at all. Still, the generated traces are far from being optimal. After the initial test suite has been covered, we want to prune the generated paths by exploring alternative routes and reducing the path distance.

To facilitate replication and comparison, all the data referred to in this paper is available for download. The package comprises the exploration graphs for all evaluated applications, the mappings as determined by our approach, as well as our established ground truth evaluated against. For details, see: https://www.st.cs.uni-saarland.de/projects/tdt/.