Keywords

1 Introduction

For the design of many technical systems, it can be very valuable to know where people are directing their visual attention. Especially with operators in dynamic and safety-critical systems, such as car drivers or aircraft pilots, the way in which visual attention is distributed is essential for safety. Investigations on how changes to the system (e.g., new assistance systems or cockpit designs) affect attention distribution are therefore commonplace. The knowledge about the attention distribution can be helpful even in non-safety-critical areas, e.g. in the ergonomic design of websites or the optimal placement of advertising.

The examination of the attention distributions is usually done with eye trackers. This is the typical way to get reliable results. However, eye tracking studies have the disadvantage that they are complex, time-consuming and not always easy to perform. Particularly in safety-critical areas, new systems are difficult to test in the real environment, thus they have to be studied in secure test fields or simulation environments. Predicting the distribution of attention is an alternative solution. There is a large body of visual attention models in the literature that focus on very different aspects of attention [1]. Among them is the SEEV (Salience, Effort, Expectancy, Value) model [2], which is relatively easy to use. It was used, among other things, to predict the distribution of attention among motorists [3, 4], pilots [2], seafarers [5], and scrub nurses [6]. Even though using the SEEV model is easy, it can still be difficult to make accurate predictions with it. Information sources (or Areas of Interest) must be defined to apply the SEEV model. Parameters are determined for each information source that describe the influence of the factors Salience, Effort, Expectancy, and Value. It has been shown that accuracy of predictions depends strongly on the subjective evaluations of the modeler, which are subject to very high variability and thus also to a high expected error [7].

However, according to the Diversity Prediction Theorem formulated by Page [8], one can make a pretty accurate prediction, even with large individual errors, when a larger number of individual assessments are aggregated. The prerequisite for this is that the individual assessments are independent of each other and are normally distributed around the actual value. In such a case, aggregated prediction of a group of individuals is significantly better than that of one individual. The influence of misjudgements of a single analyst is thereby minimized. This is often referred to as the “Wisdom of the Crowd” effect.

Fig. 1.
figure 1

Modelling process for the multi-model approach. Multiple models are independently created and aggregated. The aggregated model is analyzed to predict attention distribution.

Such an effect was also observed in the prediction of human attention distribution [7]. In a multi-model approach, several people were asked to individually model the attention distribution of a driving situation using the SEEV model. The model forecasts were then aggregated. This process is depicted in Fig. 1. This highly increased the prediction accuracy compared to the average individual prediction [7]. For the aggregation, it is necessary to group the information sources defined by the individual modelers into classes. Figure 2 shows an exemplary information source class. The white rectangles are information sources defined by different people, that all belong to the same information source class “road ahead”. Grouping these information sources into classes of information sources is a manual process that is problematic for several reasons. One problem is that the process is very time-consuming and doe not scale well. All other aspects of the process depicted in Fig. 1 can either be performed in parallel or are not affected by the number of modelers. In the multi-model approach, several modelers can model at the same time (see Fig. 1). This can considerably reduce the time required for modeling. The number of modelers therefore does not affect the time needed to create the model. Also, the effort for the analysis of the aggregated model is independent of the number of modelers. However, the cost of aggregating the models increases with the number of modelers. The reason for this is the manual grouping of information sources. The effort of all other steps is independent of the number of modelers, or can be automated.

Fig. 2.
figure 2

Information source class “road ahead” marked by different modelers for the same situation

Another problem with grouping is that it is a manual process. It is thus subject to individual errors of the human classifier. However, individual mistakes in modeling are exactly what you want to avoid through the multi-model approach. One solution is to have the classification done by several people and to indicate the quality of the results using a concordance measure such as Fleiss’ Kappa [9]. A disadvantage of this solution, however, is that the effort increases even more, because multiple people have to do the same manual classification process.

The problems mentioned above might be one of the reasons, why such a reliable and traceable model-based attention prediction process is rarely adopted for user interface evaluation. The objective of this work is to make the multi-model approach more easily applicable by reducing the effort of manually grouping information sources. For this we present a semi-automated procedure for clustering information sources into classes of information sources. We tested a geometric clustering approach that is based on the region covered by an information source, and a semantic clustering approach that is based on textual labels which were assigned to the information sources by the modelers.

First, we show some general approaches to the cluster analysis and highlight specificities of clustering based on textual descriptions. Then we introduce our clustering algorithm. We evaluate the algorithm using manually clustered reference data in Sect. 3. Next, we discuss what a reasonable practical application of the algorithm in a software tool chain can look like.

2 Clustering of Information Sources

General clustering algorithms for the automatic grouping of objects are well known [10]. A central aspect of a clustering algorithm is a measure of the similarity of objects. This measure is defined over the attributes of the objects to be grouped and is called distance function. The objects are grouped so that the distances within a group are as small as possible and those between groups are large. In hierarchical methods, the objects are either hierarchically subdivided into smaller and smaller groups until the desired number of clusters is reached, or they are merged starting from single element groups until the desired number of clusters is reached. In non-hierarchical methods, an initial grouping is optimized until there is no improvement [11].

To define a suitable distance function for information sources, we are pursuing two approaches that use different attributes of information sources:

  • Geometry The location and form of information sources that are defined by different modelers but are meant to define the same source should be similar.

  • Semantics In the modeling of attention distribution and also in eye tracking studies, information sources are typically named or provided with identifiers describing what information can be extracted from the source or what the nature of the information source is. The labels of information sources that were defined by different modelers but are meant to define the same source should be semantically similar.

2.1 Geometry-Based Distance Function

In principle, the shape of an information source can be defined arbitrarily complex. Therefore, different morphometric measures can be used for the distance function. In many cases, sources of information can be described sufficiently accurate by very simple forms. In the studies used for evaluation in this work (Table 1), sources of information are only drawn as rectangles using the attributes x-position, y-position, height (h), and width (w). Since these attributes are cardinal scaled and have the same unit, the Euclidean distance is a good candidate for a distance function of two information sources A and B [11]:

$$\begin{aligned} d_g(A,B) = \sqrt{(x_A-x_B)^2 + (y_A-y_B)^2 + (w_A-w_B)^2 + (h_A-h_B)^2} \end{aligned}$$
(1)

This approach is quite simple and works well for many information source classes. However, as Fig. 2 demonstrates, sometimes the areas of information sources of the same class can differ strongly.

2.2 Semantic-Based Distance Function

To measure the semantic distance of the labels of information sources we used the lexical-semantic nets WordNet [12] for English labels and GermaNet [13, 14] for German labels. GermaNet was built following the example of WordNet and is compatible to it. These nets provide information about the semantic relatedness of single words (nouns, verbs and adjectives) [15]. We used them to calculate a distance between the information source labels. In most of the cases the labels consist of multiple words. Therefor we use a multi word expression distance measure proposed by Huang and Sheng [16].

WordNet is a graph-like structure. Its nodes are sets of synonym words called synsets e.g. the words “foyer” and “lobby” both mean a large entrance or reception room or area. Therefore, they are part of the same synset. Those single words contained in the synsets are called lexical units. Furthermore, each synset has a textual description of its meaning. The edges of the WordNet graph are constructed by ten different semantic relations that connect either two complete synsets (conceptual relations) or two specific lexical units (lexical relations). The four lexical relations of WordNet and GermaNet are synonymy, antonymy, pertonymy and participle. Pertonymy connects adjectives with nouns from which they were derived and the participle relation does the same for adjectives and verbs. Besides these there are the six conceptual relations hyperonymy, hyponymy, meronymy, holonymy, causation and association. Hyperonymy descripes an “is-a” relation between two concepts, e.g. a carrot is a vegetable and the hyponymy describes the opposite relation. Accordingly meronymy and holonymy are also opposites where meronymy describes a part - whole relation, e.g a branch is part of a tree. Inversely the holonymy relation means for example that a tree consists of branches. The causation relation connects verbs with adjectives that express the result of the verb, e.g. the words “(to) close” and “closed” are related because a door is closed after you close it.

In WordNet there are nine measures of semantic relatedness between two synsets. GermaNet implements all of these except for the measure vector. These can be distinguished into six similarity measures and two relatedness measures, whereby the similarity measures use only the hyponymy relation and the relatedness measures make use of all relations. The six similarity measures are called res, lin, jcn, ich, wup and path. The path measure for example defines the similarity of two synsets as the length of the shortest path along the hyponymy edges between them. Another example is the measure wup that uses the least called subsumer (LCS) to calculate the similarity of two synsets. The LCS is the most specific synset that is an ancestor of the two synsets whereby the specificity is measured using the information content of the common ancestor nodes. The measure wup defines the semantic similarity of two synsets A and B as follows [17]:

$$\begin{aligned} s_\mathrm {wup}(A, B) = \frac{2 \cdot \mathrm {depth}(\mathrm {LCS}(A, B))}{\mathrm {dist}_\mathrm {LCS}(A, B) + \mathrm {dist}_\mathrm {LCS}(B, A) + 2 \cdot \mathrm {depth}(\mathrm {LCS}(A, B))} \end{aligned}$$
(2)

Here, \(\mathrm {LCS}(A, B)\) denotes the LCS of A and B, \(\mathrm {depth}()\) means the distance from the root of the hierarchy to a given node and \(\mathrm {dist}_\mathrm {LCS}(X, Y)\) means the distance from a node X to the LCS of X and Y. Because the similarity measures use only the hyponymy relation they can only be used to measure the distance between synsets of the same word category and not between synsets of different word categories, e.g. a noun synset and a verb synset. The measures res, lin and jcn also use the LCS to compute the similarity of synsets and ich is a path based measure like path mentioned above.

Besides these, hso and lesk are the two relatedness measures implemented for WordNet and GermaNet. The hso measure uses all relations to search for a path between two synsets whereby the length of the path should not exceed a certain threshold and that changes its direction as rarely as possible. The second relatedness measure analyses the descriptions of the synsets that shall be compared, searches for accordances and calculates the relatedness based on the number of these accordances.

To calculate semantic similarities between multi word expressions we use a measure proposed by Huang and Sheng [16] that uses the similarity measure wup. Basically it calculates the pairwise similarity between all words of two expressions using the wup measure. If the similarity of two words can not be calculated, i.e. the words don’t belong to the same word category an edit distance like the Levenshtein distance is used instead. Let \(e_1\) and \(e_2\) be two multi word expressions. For each word w in \(e_1\) the measure by Huang and Sheng then searches for the highest similarity of w with a word out of \(e_2\). All similarities that are found in this way are summed up and the sum is called CostSub. It is also counted how often the highest found distance is zero (Skip). After all words of \(e_1\) have been compared to \(e_2\) the same process is repeated vice versa. The similarity between \(e_1\) and \(e_2\) is then calculated as follows using also the total number of words in both expressions (Total) and two weight parameters \(W_\mathrm {Skip}\) and \(W_\mathrm {Sub}\) [16]:

$$\begin{aligned} d_s(e_1,e_2) = 1 - W_\mathrm {Skip} * \frac{\mathrm {Skip}(e_1,e_2)}{\mathrm {Total}(e_1,e_2)} - W_\mathrm {Sub} * \frac{\mathrm {CostSub}(e_1,e_2)}{\mathrm {Total}(e_1,e_2) - \mathrm {Skip}(e_1,e_2)} \end{aligned}$$
(3)

We chose \(W_\mathrm {Skip}=1\) and \(W_\mathrm {Sub}=2.5\) as proposed by Huang and Sheng [16]. The range of values for the calculated similarity is \(0 \le d_s \le 1\) in which a value of zero means that the expressions \(e_1\) and \(e_2\) have no similarity and a value of one means that the expressions are identical. This measure allows to compare two information sources A and B based on their labels: \(d_s(\mathrm {label}(A), \mathrm {label}(B))\).

2.3 Combined Distance Measure

As a third approach we combined the geometric distance measure \(d_g\) and the semantic distance measure \(d_s\). The combined distance measure \(d_c\) is calculated as the root mean square of \(d_g\) and \(d_s\).

Table 1. Overview of the evaluation data sets.

3 Evaluation

We investigated whether the geometry-based or the semantic-based measure yields better results. We did this based on data from previous studies with manually clustered information sources.

3.1 Data Sets

To evaluate the proposed clustering methods, seven manually grouped data sets were used as test data (see Table 1). The data comes from studies with 6 to 40 modelers who have modeled the distribution of attention in 3 to 5 different situations. These were all automotive studies focused on analysing attention distribution to different HMIs or in different driving situations, like overtaking, parking, and approaching traffic lights. On average, about 5 to 10 sources of information were defined per situation. These were grouped in each study by 3 raters into 13 to 37 classes of information sources. The inter-rater agreement was measured using Fleiss’ \(\kappa \) and was always between 0.78 and 0.9. Participants in most studies were Germans and therefore labeled the information sources in German, except for dataset 6, which involved English participants. For the semantic clustering of dataset 6 we used the English WordNet library [12]. For the other data sets we used the German counterpart GermaNet [13, 14].

3.2 Results

Our main research interest was to analyze whether a geometry-based distance function or a semantic-based distance function yields better results. For the data sets listed in Table 1 we evaluated three different distance measures: (1) a geometric distance measure based on information source shapes, (2) a semantic measure based on information source labels, and (3) a combined measure. Furthermore we tested it with two different clustering algorithms: (1) a hierarchical cluster algorithm and (2) the K-Medoids algorithm.

We evaluated the clustering quality in reference to the manually clustered data using the adjusted rand index (ARI) [21] as the measure for the clustering quality. An ARI of 0 refers to clusters one would get by chance and an ARI of 1 to clusters equivalent to the reference data.

Fig. 3.
figure 3

Mean and standard deviation of adjusted rand index (ARI) over all datasets, clustering algorithms and distance measures.

Fig. 4.
figure 4

Exemplary ROC curves for the hierarchical cluster algorithm with geometric distance measure for all seven datasets.

Fig. 5.
figure 5

Datasets.

It is in the nature of the data, that the number of clusters is not known in advance, because there is no restriction in what information is marked by participants and especially in what level of detail. Therefore, we first analyzed how the number of clusters created by the algorithms for the different distance measures and datasets affected the clustering quality. Figure 3 shows the mean ARI and standard deviation over all datasets, clustering algorithms and distance measures. The x-axis plots the number of clusters divided by the number of IS classes used in the manual classification. At a cluster/IS-class ratio of 1, as many clusters were generated by the clustering algorithms as IS classes were defined for the manual clustering. In the graph it is easy to see that clustering quality tends to be highest at a ratio of 1. This can also be seen in Fig. 4, which shows the Receiver Operator Characteristic for different cluster numbers (1–100) for all data sets exemplary for the hierarchical algorithm with geometric distance function. At a cluster/IS-class ratio of 1 (marked by the black triangles), both the sensitivity is relatively high and the specificity is high. But it can also be seen that the clustering quality is far from perfect.

Finally, we analyzed our main research question and compared the clustering quality using different distance measures. We used the geometric distance function, the semantic distance function and a combined distance function as described in Sect. 2 to separately cluster all seven datasets. The target number of clusters was always set to the number of IS-classes defined by the raters during manual clustering of the respective dataset (cluster/IS-class ratio = 1). The results are shown in Fig. 5.

It can be seen that the geometric distance measure outperforms the semantic distance measure. Combining both measures using an euclidean distance does not result in higher ARI scores than using the geometric distances measure alone. The choice of clustering algorithm does not change this general finding. However, the k-medoids algorithm performed better for the semantic and the combined distance measure, but never better than any algorithm relying on the geometric distance measure. However, the geometric distance function does not seem to be equally optimal for all data sets, since the standard deviation of the ARI is significantly higher for the geometric distance function than for the other distance functions. Though it is lower for the hierarchical clustering algorithm compared to the k-medoids algorithm using the geometric distance function. For future applications, it therefore makes sense to use the hierarchical clustering algorithm with geometric distance function.

4 Tool-Support for Semi-automatic Clustering

The analysis of clustering quality in the previous section showed, that the clustering quality is far from being perfect. From Fig. 4 it can be seen, that especially the sensitivity is low, meaning that the clustering algorithms often fail to recognize that two information sources should be in the same cluster according to the manually classified information sources. For the application of predictive attention modeling such a high numbers of erroneous classifications is not acceptable. We therefore use the automatic cluster algorithms just as a first step to create initial clusters. We developed a small software application, which visualizes the clusters and allows the user to inspect the clusters and reorganize them. A screenshot of the application is shown in Fig. 6. The tool needs as input the list of information sources and the target number of clusters. It uses the automatic clustering method and displays the clusters as proposals on the right side of the application. The clusters can now be reviewed by inspecting the regions that were marked and also the labels that were provided by the modelers for each information source within a cluster. The user now choose between the three options to

  • accept the cluster by moving it to the left side of the application that shows the list of IS-classes, which is initially empty.

  • merge the cluster with an already existing IS-class. This is necessary if the algorithm has created a cluster that the user believes belongs to an already existing IS class. It happened quite often for our data sets. The resulting high number of false negatives is also the reason why the sensitivity of the algorithms is low (see Fig. 4). Merging the clusters is simply done by using drag-and-drop.

  • split the cluster. If the user believes that the cluster contains elements from more than one IS-class (false positives), then the cluster needs to be divided. The user can do this by rerunning the cluster algorithm, but only on the elements of the current cluster. In the user interface s/he can specify the number of target clusters (typically 2). The original cluster disappears and the new clusters appear in the proposal list on the right side.

This semi-automatic process is far more efficient then the manual clustering process. The most time consuming operation is to split clusters, because it requires to select a target number of clusters and afterwards reviewing the new clusters again and accepting or merging them. Because of this, we suggest to initially select a target number of clusters that is larger then the expected number of IS-classes. The resulting larger number of clusters requires more merging then splitting operations.

Fig. 6.
figure 6

Clustering support tool.

5 Discussion

One result of our study is that clustering based on our semantic distance measure is inferior to a geometric distance measure. However, we think that the semantic distance measure has the highest potential for improvement.

For example, WordNet’s antonymy relation could be used to identify similar information sources that definitely belong to different IS classes, like left side mirror and right side mirror.

Another approach could be to use WordNet’s meronymy relation to identify different levels of abstraction in the models. One person might create a detailed model by marking the speedometer, the revolution counter, and the fuel gauge. Another person might simply mark the entire dashboard.

The result of this work is a semi-automatic approach for clustering information sources for analyzing models of attention distribution. It turns out that the clustering quality is not sufficient for a fully automatic approach. We therefore created and presented a software for manually revising the automatically created cluster. This approach heavily reduces the clustering effort.