Introduction

The authors are researching to aid in the provision of relevant information to engineers working in complex, high-technology environments, where there is a wide range of rapidly increasing technical information. The output of work reported in this paper is associated with the use of data captured implicitly from a computer user's working session to aid in the identification of contextual relationships between documents.

In this section a literature review of the primary areas of related research is presented. In Section 2 the primary drivers for the research are discussed along with some additional background; in Section 3, the method used to capture the data used in the implicit evaluation is presented; and, in Section 4, the naïve Bayesian classifier used to evaluate document relationships is discussed. In the remainder of the paper, the evaluation methodology, results and analysis are presented. The paper concludes with a discussion of the expected applications and limitations of the overall approach.

Implicit indicators and relevance feedback

A proportion of research has been carried out investigating the use of “implicit indicators” (Claypool et al., 2001) to aid in the inference of a computer user's current objectives. A typical approach taken is to use event data relating to interactions that can be captured in the background of a user's working environment to form the basis of implicit evidence for some aspect of their current state. For example, their current intentions, focus, or interest in a document.

Various indicators have been evaluated for their usefulness in the detection of contextual features associated with user behaviour and motivations. Horvitz et al. (1998) investigated using event data relating to menu selections and dialog usage for modelling users' needs and goals in software applications. This work involved the evaluation of patterns in event data using Bayesian networks to ascertain where the user might be struggling to achieve some aspect of software functionality, in order that related support information could be automatically provided.

Implicit indicators extracted from aspects of electronic information browsing behaviour have also been investigated for improving the effectiveness of search processes. In “implicit relevance feedback” approaches, the initial objective is to identify user interest in a particular document or topic. White et al. (2006) adopt an approach utilising data relating to a searcher's selection of one information object over others within the search space to accumulate evidence for relevant information in the result set. Once sample material of apparent interest has been identified during search, it is then possible to use content similarity measures to refine the result set to include more closely related content.

In the relevance feedback approach above, the intention has been to assist in refining the context of a search by extending the original query, utilising the feedback gained from implicit ratings. In our approach, rather than serving as a mechanism for refining the search query, implicit indicators are used to identify and annotate where contextual relationships exist between documents in an information retrieval system in order that they can be stored for improving the effectiveness of future information retrieval scenarios.

Sources of implicit evidence

In addition to the examples cited above, a number of other sources of evidence have been evaluated including document viewing time, scrolling activity and webpage bookmarks (Hill et al., 1992; Morita and Shinoda, 1996; Lieberman, 1995; Seo and Zhang, 2000). Oard and Kim (1998) identify a useful categorisation of implicit evidence which covers three general areas of observable behaviour. The first two categories of information “examination” (e.g., viewing time) and information “retention” (e.g., bookmarking) refer to behaviours which indicate a user's interest or preference for information. The third category of “reference” relates to activities which indicate some form of link between two objects (e.g., a citation). In this work, a variety of temporal indicators which principally relate to the “examination“ and “reference” categories are used to identify document interest and relationships. These are discussed in more detail later.

Viewing time is worthy of further discussion as it is used in this research and has been given significant attention in the literature. In particular, studies on this attribute have highlighted considerable variations in its effectiveness. In studies carried out by Morita and Shinoda (1996) and Konstan et al. (1997) viewing time was shown to be reasonably successful in the identification of interesting or useful news items. However, further studies by Kelly and Belkin (2001, 2004), showed that the time a user spends viewing a document was not significantly related to the user's subsequent relevance judgement. The authors of the latter studies report that issues such as topic familiarity and task type confound the relationship between display time and relevance in complex ways. Furthermore, they suggest that the way that display time is determined is likely to influence strongly its utility—and that simple, easily collected “proxy-side” data may well give misleading results. In our approach, viewing time is collected “client-side” along with other evidence for a more complete specification of user actions. It is hoped that in using this combination, more accurate assessments of document interest will be achievable.

Recommender systems

The theoretical approach is related, in some ways, to collaborative filtering approaches used in recommender systems. In its most common formulation, the recommendation problem is reduced to the problem of estimating ratings for the items that have not been seen by a user (Resnik et al., 1994; Shardanand et al., 1995; Hill et al., 1995). Recommender systems typically rely on content filtering, collaborative filtering or a hybrid approach employing both. Content-based approaches select the right information for the right people by comparing representations of content contained within unseen documents to representations of content that the user is interested in Herlocker et al. (2002). Collaborative filtering (CF) works by estimating the likelihood of a document being of interest to a certain type of user based on existing ratings from other users. This approach is based on the heuristic that people who agreed in the past will probably agree again (Resnik et al., 1994).

The Tapestry system (Goldberg et al., 1992) is cited as was one of the earliest implementations of collaborative filtering. This system, which was developed for the purpose of electronic mail filtering, required that each user specify like-minded users manually. Grouplens (Konstan et al., 1997; Resnik et al., 1994), and Ringo (Shardanand et al., 1995) were among the first systems to use CF to automate this prediction. Since early developments, collaborative filtering has been extensively researched in academia and successfully implemented in commercial applications for recommending web pages (Joachims et al., 1997; Lieberman et al., 1995), news items (Konstan et al., 1997) and music (Shardanand et al., 1995) to name a few.

In the collaborative filtering approaches above the implicit (or explicit) user interest evaluations of documents or other items are used to identify users with similar preferences. Our approach differs from this in that the relationships that are assigned in the implicit evaluation are designed to encapsulate the value of a document in the context of its usage (principally defined by other documents) that might relate to specific information need or activity, rather than its value to the user. Put another way, the approach investigated updates document indices for the purpose of aiding future task-oriented retrieval scenarios, and not for the purpose of personalization (i.e., responding to users' preferences for information).

Herlocker and Konstan (2001) have also investigated a similar “task-focused” approach. They suggest that a weakness in some traditional CF based recommender systems is that they are based solely on historical ratings data and that these approaches assume that a user's interest is independent of the task at hand. In their approach they utilise existing recommender system technology to develop a system which requires a manually input task specification to provide the task-focused recommendations. The task specification indicates what kind of task the user has in mind and consists of a list of example items which is referred to as a task profile. Associated items are identified using a conventional user-item ratings matrixFootnote 1 where the correlation between items is computed.

This research follows along the same general line of enquiry but aims to make further contributions in this area. Firstly, a different method for making associations between items is presented, which is not based on independent user interest ratings, but an observation-based assessment of where items have historically been used in the same tasks—the machinery for making these observations is the area of primary concern in this paper. An additional contribution is made through the elicitation of possible working domains and circumstances where task-focused approaches to recommendation may be appropriate.

Link analysis

The work presented also has parallels with link analysis methods investigated in web information retrieval. A primary contribution in this area was made by Brin and Page's (1998) PageRank algorithm. PageRank, provides for an extended, web-based application of citation analysis originally developed by Geller (1978), where citations or “backlinks” to a web page are used to give an approximation of its importance or quality. Kleinberg (1999) furthered research into the structure of hyperlinks with the development of the HITS algorithm. In this research, an algorithmic formulation of the notion of authority is evaluated, based on the relationship between a set of relevant authoritative pages and the set of “hub pages” that join them together in the link structure.

In addition to establishing the authority or prominence of information, link analysis approaches have also been developed for the purpose of identifying similar pages (Jeh et al., 2002), or pages biased to a specific topic (Haveliwala et al., 2002).

The authors feel that conceptually, there are similarities in that these cited methods and the approach presented in the article both utilize evidence of “links” between separate items of information. However, it is thought that a primary difference lies in the sources of evidence for these links. Specifically, the links in the cited examples are in the form explicitly defined evidence, which is typically embodied in hyperlinks. In contrast, the approach investigated in this research utilizes mainly implicit evidence of links, based on how the user interacts with electronic information.

Related clustering approaches

In related clustering approaches, records of user transactions with Internet search engines are used to identify of clusters of queries and URLs. In Beeferman and Berger's (2000) agglomerative clustering approach, records that consist of a user's search query and the subsequent URL selected, are processed to identify clusters. Essentially, the algorithm is used to identify (i) different search queries that lead to the same information being accessed and (ii) the same search queries that result in different information being accessed. This work shares a feature in common with the work presented here in that it is not reliant on the content of information retrieved. Instead, the approach utilizes evidence of user actions to make inferences about related items of information. Wen et al. (2001) have also carried out similar research, but in addition combine document and query contents to determine similarity.

Again, a primary difference with the cited research is that different sources of evidence are used to identify relationships between information. Specifically, the evidence does not relate to queries that result in certain information being retrieved, but temporal characteristics of document use that suggest relationships between information. A further difference is that the evidence is not solely related to search activity. In our research, we incorporate and evaluate evidence for different types of task involving the manipulation and creation of information as well as retrieval tasks.

Motivations

In this section some general background is given on the motivations and objectives of this research. Initially, some coverage is given on how the authors view the issue of context, particularly in relation to problems associated with responding to context in design information retrieval. The general objectives of the research and the expected applications of the approach are then presented.

“Context” in design information retrieval

The authors address the issue of “context” in design information retrieval through a broader consideration of aspects that extend beyond a searcher's information query. Specifically, the more general perspective of a “knowledge worker”Footnote 2 (Drucker, 1959) who is involved in a variety of information handling tasks is considered in this research, as opposed to the rather more specific single task of a “searcher” in retrieving information.

It is proposed that some important parts of a knowledge worker's “context” relate to their intentions and motivations, and ultimately the outcomes from a given activity. Finding ways in which information retrieval systems can interpret and respond to these aspects is the primary subject of this research. Clearly, a complete specification of these aspects is not easily achievable without some form of dialog between the knowledge worker and an IR system about their activity. However, it is suggested that a partial specification of these aspects can be achieved utilizing implicit evidence that is evident from their surrounding environment, interactions with information, and other events relating to user actions. The aspect under investigation in this work relates to the implicit capture of valuable, reusable records relating to the context in which knowledge workers interact with information from multiple sources.

In addition, it is intended that the approach presented in the article demonstrates how an aspect of context-sensitive information retrieval may be supported. The main argument discussed here is that one way in which information retrieval systems can respond to a searcher's current context is to evaluate the way in which information that is currently in use had been used in previous circumstances. These previous circumstances can be defined in terms of metrics that indicate where separate information items had been used or played a role in a common purpose or goal.

General research objectives

This research has been carried out in the design engineering domain, where document management and information retrieval systems are an essential requirement due to the information intensive nature of design tasks. In this domain, the routine tasks of designers and knowledge workers can involve access to electronic documents from a variety of sources in personal archives, and on networked repositories. In addition, new information from external sources such as the Internet is referred to and used in the development of concepts and designs. For example, the Internet can provide an invaluable resource for the discovery of component-supplier data, engineering-standards data, and competitor information. In summary, information use in such environments is often characterised by large amounts of documents used, consumed, created and manipulated. The effective sharing and dissemination of this information is of primary importance where design information is referred to and updated many times by designers during the life of a product.

An aspect of particular importance in design tasks is that the rationale for design decisions is captured to minimize the loss of expert knowledge and understanding. Although much of this rationale is supported by information contained within the formal records produced (e.g., design drawings, reports etc), research has shown that significant amounts of valuable informal and unstructured information referred to is not recorded or associated with the resulting records of the design process (Lowe, 2002).

This issue is most prominent in the latter stages of design. In these tasks, designers draw on numerous information sources, which inform decisions and provide general support for their activities. For example, the design of a component or mechanical assembly may involve the following activities in a given working session:

  • Assessment of literature for identification of the correct design procedure or protocol to be followed. Relevant information may be obtained from the engineering corporation's information repositories, or from external sources such as regional or international standards organizations.

  • Evaluation of supplier data for the selection and incorporation of standard components into a new or modified design.

  • Reference to historical records of previous designs to inform decisions relating to a current design.

  • Reference to patent databases, design guides, product specifications, competitor information etc.

  • The creation or manipulation of product design data contained within geometric models, reports, specifications etc.

The focus of this work is to target computer-based activities involving the mix of information handling activities described above and establish whether meaningful relationships between documents can be captured during the course of normal working routines. We have defined two types of document relationships which we are looking to extract from activities. Firstly, a bidirectional Common Utility Dependency where two documents have use or are of interest in the same task carried out by the computer user. For example, where documents x and y have both been useful in satisfying an information need in a search carried out. Secondly, a unidirectional Reference Dependency where one document has been a reference source in the creation or editing of another.

In both circumstances, the objective is to establish whether the relationships can be stored and indexed within information retrieval systems to improve their ability to respond to future information retrieval tasks. It is envisaged that such a system might have potential applications in two types of retrieval process to be investigated later in this work. In conventional manual retrieval it may provide a mechanism for the retrieval of a ranked list of contextually related documents, given an instance of a document that is of interest in the current search. This type of application could possibly be used in combination with conventional content-based retrieval algorithms. Alternatively, it may provide a mechanism for dynamic retrieval scenarios where a combination of the document in use and previously indexed relationships can provide an automated system with all the information needed to provide the user with additional related documents without any need for a search query.

The greatest difficulty associated with the extraction of such relationships is to make accurate assessments of document interest or utility without placing additional burden on the computer user's activities. This has been the subject of a considerable amount of investigation over recent years—as discussed in the review of related literature. In our research we employ a combination of temporal traits related to document use, such as the time a document is in view, and other metrics associated with the applications used for browsing and editing documents. These attributes, which are captured implicitly in the background of users' working environments, are introduced in greater detail in Section 3.

Scope and expected applications of the approach

A key feature of the approach is that it is intended to help provide evidence for how a given item of information has evolved as a result of the actions of workers in using, adding or refining it to meet their objectives. Accordingly, it is thought that the approach is most appropriate in domains where a core set of information is the subject of usage, revision and refinement over time. In addition, working environments where a number of persons are involved in this process may be particularly appropriate, so that a new user of an item of information can better understand its history and the way in which it had been used by others.

Engineering design is an obvious candidate domain because the characteristics of design information use fit closely with the criteria stated above. For example, in the aerospace sector, where large complex products are involved, the design lifecycles are long and the associated design information base is developed by a large number of people.

The authors suggest that these information use characteristics may also be present in a number of other domains and industries, particularly where: (i) intellectual processing of information takes place, and (ii) a core set of information is developed and worked on over an extended duration. For example, working environments in software development, legal services, areas of academic research, and banking may also provide appropriate domains. However, it is acknowledged that the suitability of this approach is likely to be limited to these or similar areas. Investigation into possible applications in other domains is the subject of further research.

Capture of document interaction data

Figure 1 illustrates the two main stages of the process being evaluated in this paper. It is intended that this process would typically be carried out off-line, at the end of a computer user's working session. In this section, the first stage relating to the capture of interaction data is discussed.

Fig. 1
figure 1

Flow diagram of the process for assigning relationships between documents used in activities

Some of the data is used for an initial grouping of documents by the task in which they had been used—this aspect is discussed in Section 3.1. The remainder of the data extracted is used for the implicit evaluation of document relationships undertaken using naïve Bayesian classification. In 3.2 some coverage is given on the reasoning for the chosen sources of implicit evidence. The data attributes forming this evidence are then introduced in Section 3.3.

Task switching data

In order to identify documents in use, that are serving a common goal or purpose, a mechanism is required that can distinguish which documents are used in a particular task—of which there maybe several being carried out concurrently in a working session. Furthermore, in a single working session there may be a number of tasks involving the searching or editing of documents that are not related in any way. This behaviour, of moving from one aspect of focus to another during computer use is usually referred to as task switching in the literature (Czerwinski et al., 2004).

An approach utilizing implicit evidence of task switching is not presented in this article, although it may possibly be a worthwhile area for further research. The approach we have adopted for the initial grouping of documents by the task in which they had been used requires that the user enter a small amount of task related information in interactive onscreen dialogs. This simple, accurate, but slightly cumbersome approach has been adopted primarily for the purpose of validating the overall process.Footnote 3 Greater details on the specific attributes of the dialogs are presented in the evaluation in Section 5.

An alternative strategy for gathering task switching data may be to develop an approach that does require the explicit selection of current task, but in a way that does not greatly impinge on the user's activities. One notable example of a user interface design feature accommodating task switching, developed as a result of a study by Smith et al. (2003), is the GroupBar. Their prototype has been designed to add convenience for the user by responding to their explicit selection of task. In particular, the system responds to a user's task switch by organizing, and bringing into focus all the windows or documents that are currently being used for that task.

Choice of implicit evidence

An overview of implicit indicators being investigated in related research was presented in Section 1.1. In this work, a number of temporal indicators were chosen as the primary source of evidence to fully investigate aspects not explored to a great extent in previous studies. Specifically:

  • This study investigates the use of a certain type of temporal indicator for examining how successfully it predicts possible relationships between documents. In previous studies the focus has primarily been restricted to evaluating document interest.

  • Where previous studies have restricted analysis to a specific activity such as web browsing or searching, the aim of this study was to evaluate the usefulness of temporal indicators in a broader range of tasks and general activities such as document editing as well as information retrieval.

  • This study also incorporates additional evidence, for example, the number of times the same information is brought into view to investigate the importance or prominence of an item of information in a given activity.

Document utility and use case data

Table 1 shows the data attributes captured during a computer user's working session. The document referral descriptors consist of a URL, or file path in the case of locally accessed files, and a document title extracted from metadata. The purpose of capturing these data attributes is simply to provide direct links to documents specified in a relationship.

Table 1 Document data attributes captured in session

The temporal metrics captured serve 2 functions. The primary function is to provide an implicit measure of interest or usefulness in the current task. This metric is used in combination with others including the frequency that a document is brought into view and the time spanning the first and last view within the session. It is thought that the latter attributes may help to capture the significance of documents being worked on or manipulated—particularly where a number of documents are being used interchangeably.

A secondary function of the temporal metrics is to enable the calculation of an additional parameter—temporal proximity, that refers to the time span between the use or viewing of 2 documents under consideration. It is expected that this parameter may provide the classifier with an additional mechanism for determining the likelihood of one document being related to another in the context of the current task. The temporal proximity parameter is formally introduced in the next section.

The use-case attributes give an indication of the circumstances in which an electronic document containing information is used. Specifically, the two attributes used in the evaluation identify whether an electronic document was created, modified or just referred to, and also whether it was accessed from a local or remote file store.

The reason that they are included in the assessment is that they are aspects which can provide evidence to support which type of dependency is appropriate between two documents. For example, if a remote web page is referred to and then a locally stored document is modified a “reference dependency” is more likely to be appropriate (as the reference to information may have been made to support the user's actions in modifying the document). If two documents are referred to in succession then a “common utility dependency” may be more appropriate (as both documents may have been valuable in a search carried out to satisfy an information need).

The application type attribute is represented by a value that is assigned depending on the software application that is being used with the document. In this work, the software prototype (to be discussed later) has been configured so that the value returned is either “Browse” or “Edit” to distinguish between these high level use-cases. The document origin attribute simply specifies whether the document in use originated from a local or remote file store.

Naïve Bayesian classification of document relationships

A naïve Bayesian classification method (Coppin, 2004) is used to evaluate document relationships from the data attributes described above. Essentially, the classifier is provided with examples of pre-classified document interaction data for pairs of documents used in the same task. The classifications relate to the possible types of relationship to be identified. A list of these classifications is shown in Table 2. For a given document pair being evaluated, the classifier will find the closest classification match with the examples provided in the training set. In effect, the classifier learns the data attribute configurations that are likely to indicate a given relationship from examples in the training set and then applies this knowledge to predict unknown document relationships.

Table 2 List of possible classifications assigned to document pairs

Details on the choice of probabilistic algorithm are given in Section 4.1. The technical aspects of the implementation are then provided including: details of data discretization using fixed k-interval method in 4.2, the population of training set data in 4.3 and naive Bayesian classification of dependencies with the m-estimate for incomplete data in 4.4. The section concludes with some discussion on possible ways in which the relationships established could be stored or indexed within information retrieval systems.

Choice of probabilistic reasoning algorithm

A number of classification approaches have been evaluated in the Information Retrieval community for the purpose of text classification (TC). Popular TC methods that have been evaluated include naive Bayesian classification (Lewis, 1998), decision trees (Dumais et al., 1998), example based approaches such as k-NN (Yang, 1999), neural networks, support vector machines (SVM) (Joachims, 1998) to name a few. Sebastiani (2002) provides a review of these and other studies. The main findings show that SVM, and example-based approaches are particularly strong in this domain. Batch linear and naive Bayesian approaches appear to be weaker approaches, and, classifiers with no learning component seem to perform worst of all.

In this work (where classification is not based on document text, but on attributes relating to document use) naive Bayesian classification is adopted. The beneficial aspects of this approach include its simplicity to implement and that there is no requirement for the specification or learning of dependencies between attributes (as there typically are when using a Bayesian networking approach). An aspect of further research will be to investigate the other cited classification algorithms in an effort to find the most effective and suitable approach for this particular application context.

Formation and discretization of data

A requirement of using the chosen algorithm is that the data attributes be discretized before being evaluated and classified. Before discussing the discretization method it is necessary to show how individual data attributes are organised before being processed. Essentially, every document pair form a subset of documents that relate to the same task (as described in Section 3.1) is evaluated for a possible relationship. For each document pair a corresponding dataset is formed. Table 3 shows example data from two datasets, each relating to a document pair.

Table 3 Example document pair datasets to be evaluated in continuous and discrete form

The attributes relating to the individual documents are a subset of those discussed in Section 3.2. Additionally, the temporal proximity attribute is calculated which is unique to the document pair under evaluation. This attribute is calculated as follows: given two documents x and y, accessed in succession, the assigned value is the time in seconds, between the last recorded time that x had been in view and the first recorded time y had been in view. In the case where y is accessed between the first and last viewing time of x, or vice versa, the attribute is assigned a special discrete value of 0. Finally, in the case where two documents are both used in the same task, but before and after a period of time on another task, then the temporal proximity value remains the absolute time difference between the usage periods. This last scenario could possibly be handled in an improved way by subtracting the time the user had spent working on the other task. The authors hope to investigate this possibility in future research.

The data for v, s, f, and p need to be discretized in a way to ensure that the intervals or bins for the various states of these attributes contain adequate training data. More specifically, as reported by Yang et al. (2001), there is a trade-off between two conflicting objectives. On one hand it is preferable that there be as many intervals as possible to increase the representation power of the attribute (the more intervals used the greater number of distinct values the classifier can distinguish between). On the other hand, it should be ensured that there are enough training instances in each interval, so that there is enough information to accurately estimate the probabilities required by Bayes' theorem.

The data is discretized using the fixed k-interval method (Dougherty et al., 1995). In this method, the continuous attribute is divided into a fixed number of intervals (k) or bins. The interval ranges are determined using observed instances of the continuous data attribute contained within the training data. Given n observed instances, each interval contains n/k adjacent values. In this work k is set as 4 (a decision primarily based on the amount of available training data) so each interval represents a quartile of observed values in the training set. Depending on the value of the new attribute to be evaluated, it is assigned a discrete value ranging from 1 to 4 depending on its respective position within the prescribed interval ranges. However, if the value of the attribute lies outside the range of observed values it is assigned a value of 1 or 4 depending on whether it is lower or higher in magnitude than the existing range of known values respectively.

Training set

As stated previously, the training set consists of a number of pre-classified examples of document pair data.Footnote 4 These examples are gathered using prototype software developed as a consequence of the research. Specifically, the software enables a computer user to enter data on their intentions and the value of documents they are using in addition to the capture of data required for the implicit evaluation. From a combination of the user-provided data and the data captured automatically, training examples complete with a document relationship classification can be used to populate the training set. The exact details of this process are covered in the evaluation in Section 5.

Classification of relationships

Once the data attributes for a document pair have been discretized, a classification (c i ) can be assigned. Following the standard method, the posterior probability for each classification is evaluated for a document pair under consideration, as shown in Formula 1.

$$ P(c_i | {\it v}_x ,s_x ,f_x ,t_x ,o_x ,{\it v}_y ,s_y ,f_y ,t_y , o_y ,p_{\it xy}) $$
(1)

Where c i is the ith classification from the set of |c| classifications described above.

The classification whose posterior probability is highest (the maximum a posteriori) is chosen as the correct classification for the document pair. In order to evaluate the above prior probability Bayes theorem is used, as shown in Formula 2.

$$ \frac{{P({\it v}_x ,s_x ,f_x ,t_x ,o_x , {\it v}_y ,s_y ,f_y ,t_y ,o_y ,p_{\it xy} \left| {c_i } \right)\cdot P(c_i )}}{{P({\it v}_x ,s_x ,f_x ,t_x ,o_x ,v_y ,s_y ,f_y ,t_y ,o_y ,p_{\it xy} )}} $$
(2)

As \(P({\it v}_x ,s_x ,f_x ,t_x ,o_x , {\it v}_y ,s_y ,f_y ,t_y ,o_y ,p_{xy} )\) is a constant value and the only purpose is to find the highest probability from the set of classifications this can be eliminated from the equation, as shown in Formula 3.

$$ P({\it v}_x ,s_x ,f_x ,t_x ,o_x ,{\it v}_y ,s_y ,f_y ,t_y ,o_y ,p_{\it xy} \left| {c_i } \right)\cdot P(c_i ) $$
(3)

As Naïve Bayesian classification assumes that all data attributes are conditionally independent the equation can be rewritten as shown in Formula 4.

$$ P(c_i )\cdot \prod\limits_{j = 1}^z {P(a_j \left| {c_i } \right)} $$
(4)

Where a 1 , …, a z refers to the z items in the dataset (v x ,…,p xy ).

Using data obtained from the training set to evaluate the probabilities, the equation can be written in the form, as shown in Formula 5.

$$ \frac{{n_{ci} }}{{n_t }}\cdot \prod\limits_{j = 1}^z {\frac{{n_{aj} }}{{n_{ci} }}} $$
(5)

Where \( n_t \) is the total number of examples in the training set, \( n_{ci} \) is the number of examples that satisfy \( C = c_i \), and \( n_{aj} \) is the number of examples that satisfy \( (C = c_i ) \wedge (D = d_j )\). In some cases, in the calculation of \( \frac{{n_{aj} }}{{n_{ci} }}\) their may be insufficient data contained within the training set to make an adequate evaluation. For example, where there are no training examples matching a given classification and attribute. The m-estimate method is used to avoid this problem which introduces an equivalent sample size parameter to ensure that a value of 0 is never returned. Finally, the equation is altered to accommodate this method as shown in Formula 6.

$$ \frac{{n_{ci} }}{t}\cdot \prod\limits_{j = 1}^z {\frac{{n_{aj} + mp}}{{n_{ci} + m}}} $$
(6)

Where p is an estimate of the probability for the current classification and m is a constant referred to as the equivalent sample size. In our work, p is assigned a value (1/4) for each classification, and m is assigned a value of 5.

The posterior probabilities for each classification returned using this equation are then compared and the most probable relationship is assigned to the document pair. The process is then repeated for all document pairs within the group of documents under consideration, as illustrated in the matrix shown in Fig. 2. This completes the process for the document group under consideration. At this point, the useful relationships assigned by the algorithm can optionally be reviewed by the computer user before being indexed/stored. Any alterations to the relationships reviewed by the user can then be appended to the existing training set to improve the accuracy of future predictions.

Fig. 2
figure 2

Matrix of document pair classifications

Indexing document relationships in information retrieval systems

Where useful relationships (i.e., the document pairs assigned c 1, c 2, c 3) have been identified, the next objective is to index them in an information retrieval system so that they can be used by future users in their searches for contextually related information. In our research, the following relationship data is stored for future use: document title, document URL, dependency type, and the dependency strength (obtained from the normalized posterior probability of the assigned classification). The ways in which this data might be usefully employed by information retrieval systems is discussed in Section 7.2.

Evaluation

Although an evaluation of the usefulness and usability of relationships assigned is the ultimate goal in this research, an objective assessment of this is difficult without a large scale deployment of the approach, where the relationships within a given document set are allowed to accumulate to a level where knowledge workers can make use of relationships implicitly assigned by others. The identification of a suitable domain where the approach can be fully tested, and the implementation of a pilot deployment is currently being undertaken.

In this evaluation we validate an intermediate but equally important aspect of the approach—how accurately the algorithm can assign relationships using the implicit evidence with which it is provided. In order to perform this evaluation, a software prototype system has been developed, enabling the capture of document interaction and task switching data. It also provides a means for the automated processing of data, so that the relationships can be assigned using the algorithm described above.

Also, the prototype had been designed to collect additional data provided by the computer user, relating to document interest and motivations. The data collected directly from the user, was manually processed in the first instance to assess document relationships from the user's perspective. These manually assigned relationships serve two functions in the evaluation, namely to:-

  1. (i)

    Populate the training set used by the algorithm with pre-classified document interaction data

  2. (ii)

    Verify the accuracy of relationships established using the algorithm

A number of controlled experiments were carried out using the software. Half of the data collected was used to populate the training set with pre-classified examples. The remaining document interaction data was processed by the algorithm (using the established training set), and then the relationships assigned by the algorithm were compared with the corresponding user provided data to evaluate the accuracy of the classifications

Overall, the aims of the evaluation were to verify the following research questions:

  1. (i)

    Could the algorithm differentiate between useful and irrelevant documents related to a user's search tasks?

  2. (ii)

    Did the document relationships assigned by the algorithm correspond to relationships, evident from the data provided by the user?

  3. (iii)

    Did the strength of document relationships assigned by the algorithm correspond to the combined utility of the documents specified in the relationships, evident from the data provided by the user?

In Section 5.1 details relating to the design and use of the software prototype are presented; in 5.2 details on the participants and the general procedure followed in collecting data is presented, and, in 5.3 the method used to manually assess document relationships from the user provided data is discussed.

Software prototype

As stated previously, the software prototype enables the capture of data for both the implicit evaluation and user-based evaluation of document relationships. In both circumstances, the data is captured or collected from the user when pre-determined events are triggered by the operating system or the software applications being used. Figure 3 illustrates the events and data involved.

Fig. 3
figure 3

Illustration of prototype data capture in response to operating system or application triggered events

Data capture for implicit evaluation

The data captured for the implicit evaluation is governed by three components of the prototype:

  • The first component captures data relating to the applications open or in focus.

  • The second captures data related to page navigation events in a proprietary web browser.

  • The third captures data on the access, creation or manipulation of local files.

The data attributes captured automatically for each component include a time stamp reference and a number of event descriptors. A different set of data attributes is captured depending on the nature of the event. For example, when a new application is initialised, the application title and name is captured. Similarly, the web page title and URL are captured in response to page navigation events occurring in the web browser, and the name and location of local files are captured in response to local file changes (the user is able to define the root folders to be monitored by this component). The data is written to log files which can then be read by the naïve Bayesian classifier algorithm to perform dependency classification.

Data collection for user-based evaluation

A significant amount of consideration was given to the design of a suitable method for collecting user-provided data for the manual evaluation. Unlike the implicit data capture, the prompting and capture of user data needed to be restricted in order to minimise the impact of the resulting intervention on the computer user's activity—but also provide enough information so that useful documents and their relationships could be identified in the manual assessment. In order to minimise interruption to the user's task, the required information was collected using a number of passive onscreen data entry dialogs. The onscreen dialogs were designed so that after a specified period of time, if no response was made by the user, it was assumed that no data entry was necessary and the dialog disappeared until the next event occurred. The dialogs were also designed to appear on a small portion of the screen outside of the main viewing area to minimise disruption.

Again, three components of the prototype govern the collection of user data:

An application usage dialog collects information on the purpose of the activity being carried out within the current application. When an application is initialised the user is required to provide a definition of the task purpose, which can be selected from a number of categories. Once the initial definition of the application purpose has been made no further user data is required unless the user decides to update this information, as the periods when the application is active or inactive can be monitored automatically. A browser navigation dialog collects information on searches carried out over the Internet. When a browser is initialised the user is typically required to make a definition of their information need by providing details on the general purpose of their search, and select categories relating to the vendor or type of information. In subsequent navigation the user can optionally rate the pages of interest and signify the page(s) that satisfy the original information need. Finally, a file usage dialog collects information on motivations for the access, creation and editing of local documents. Examples of the on-screen dialogs can be seen in Figs. 4(a) and 5(a) where user provided data is collected in response application usage and browser navigation events respectively.

Fig. 4
figure 4

Screenshots of (a) example dialog for collecting user data on software application use and, (b) prototype component for manual evaluation of document relationships

Fig. 5
figure 5

Screen shots of (a) example dialog for collecting user data on document utility, (b) prototype component for training or evaluating (using algorithm) document relationships

Experiment participants and tasks

It was considered important that the approach be evaluated in settings representative of the user's everyday tasks. For this reason the activities captured for analysis were undertaken as part of the normal working routines of the participants involved in the study. Initially, each participant agreed to have the prototype software installed on their machine over a period of one to four weeks. Following the installation of the software, they were provided with some preliminary training on how to start and stop data capture, and also how they should complete the onscreen forms requesting user data. They were then instructed to initiate the data capture functionality in any future “working sessions” where activities were likely to consist of predominantly information intensive computer-based work (There was no formal restriction on the definition of a working session although it typically ranged from half a day to one days work and consisted of a variety of sub-tasks.) The participants were then requested to return the text- based log files for analysis when sufficient data, over a number of sessions, had been collected.

A total of five participants (labelled A to E) were involved in the study, from different backgrounds in industry and academia. Participant A's activities involved a range of engineering tasks including information search, document editing and computer-aided-design. Participants B and C, from the knowledge management department at the aerospace firm Airbus UK, undertook activities primarily involved corporate research into information handling and related issues. Finally, participants D and E's activities were involved in academic research, primarily consisting of information search activities.

Manual assessment of document relationships from user provided data

Having collected data from a series of routine activities in sessions carried out by the participants, the data was separated into two sets, one was used for training the algorithm, and the other was used for evaluating classification accuracy. Each participant was assigned a separate training set which was based on half of the total data collected from their activities. In order to achieve a more closely representative training set, a proportion of data was collected from each working session, rather than assigning the first half of sessions to training and the second half to evaluation.

The training set was compiled by entering relationship data to the existing data records using the software prototype component shown in Fig. 5(b). This was carried out at a later date when all user sessions had been completed. Each session, typically consisting of between 10–40 documents was evaluated and analysed in turn for possible relationships. The relationships assigned were based on a combination of evidence from the user-provided data and additional information gleaned from the interaction chart shown in Fig. 4(b). Specifically, from this data it was possible to distinguish between individual tasks and also the corresponding documents of interest or utility. Dependencies were manually assigned to document pairs where the interest or utility of both documents was high and where both documents related to the same task. A parameter relating to the relationship type (as defined earlier) and one relating to the relationship strength (based on the combined utility of both documents) was included as part of the dependency specification.

It may have possibly been more accurate if the user had performed this compilation at the end of each session. It is intended that this alternative strategy may be employed in further research to address this weaker aspect of the evaluation. However, there were some measures taken to ensure that the context in which documents had been used was recorded accurately for later analysis. In addition to the on-screen prompts for task specification and document relevance or interest, the user was asked to record any additional notes about their activity using pen and paper. These notes typically included motivations or reasoning for actions not covered by the onscreen dialogs.

Having compiled the training set, the same manual mark-up process was repeated on the remainder of data, which was also then processed using the algorithm. Finally, the manually assigned relationships were compared with those assigned by the algorithm to complete the evaluation. The results of the comparison, for each participant involved in the experiments, are discussed in the next section.

Results and analysis

In the following sections the research questions outlined previously are evaluated.

Question 1: Differentiation between useful and redundant documents

The objective of this analysis was to establish how well the algorithm could differentiate between useful and redundant documents in the various tasks carried out by the participants. In Table 4 the number of documents assigned by the algorithm as having an associated useful relationship is shown in comparison with the number of actual (from manual assessment) useful documents. The percentages of the totals are also presented in pie chart form in Fig. 6.

Perhaps the most striking observation is the high percentage of non-useful documents, a feature of all the participants' tasks. On inspection of the associated records, these non-useful documents often relate to the web pages that provide the links to the relevant pages, for example, search results pages, contents pages or intranet homepages.

Overall, the results indicate that although the identification of useful documents is by no means perfect, the algorithm performs reasonably well, particularly when considered in proportion to the total number of documents accessed. For example, of the 127 documents used by participant A, 32% were of actual interest or utility and the algorithm correctly assigned 21% of these. It should be noted that there is also considerable scope for improvement, both through the compilation of larger training sets, and possibly through the introduction of additional implicit indicators to be incorporated in the evaluation. For example, the amount of scrolling or other activity on a document has been shown to be a reasonably effective implicit indictor (Claypool et al., 2001) of document interest which was not included in our evaluation.

When these results are compared with other studies (e.g., Kelly and Belkin, 2001), where viewing time alone is used in the assessment of interest, these results fare relatively well. This may be a result of the combinatory approach to implicit evidence taken in this research. However, a consequence of taking this approach is that it has been difficult to evaluate which aspects of evidence were most effective, as they were all simply inputs to the classification process which were not considered independently. An interesting area for further research may be to assign different configurations of weightings to each aspect of evidence so that a better understanding of their individual impact can be sought.

Table 4 Summary of correct and incorrect classifications assessed using the algorithm
Fig. 6
figure 6

Pie charts illustrating the total percentages in the manual and algorithm assessment

Question 2: Identification of useful relationships

Table 5 shows how successfully the algorithm identified relationships between useful documents. In this analysis, the results are likely to be indicative of (i) the accuracy of the temporal proximity attribute for predicting the documents involved in a given relationship based on the relative viewing times of documents, and (ii) the accuracy of the use-case attributes for predicting the relationship type. Where a relationship is shown as being correctly assigned in the table, both the type of the relationship and the documents involved are inline with the relationship assigned in the manual processing of documents. It was found during the manual mark-up that the relationship type (either common utility dependency or reference dependency) was nearly always governed by specific configurations of use-case attributes for the documents involved and so the algorithm was effective in distinguishing between them.

Table 5 Summary of correct and incorrect dependencies assessed using the algorithm

The overall assessment in Fig. 7 shows that 42% of dependencies were correct from the total number assigned plus those incorrectly rejected. Alternatively the number of correctly assigned dependencies can be viewed as 67% of all relationships assigned in the manual evaluation which seems reasonable performance. However, the number of incorrectly classified dependencies and the variability in accuracy for different participants are areas for concern. One possibility for reducing the level of incorrect classifications may be to set a threshold based on relationship strength to filter out lower probability relationships. The variability in accuracy for different participants is thought to be related to the differing nature of the participants' tasks. This aspect is discussed further in Section 6.4.

Fig. 7
figure 7

Presentation of correct and incorrect dependency classifications for individual participants and overall

Question 3: Accuracy of relationship strength weighting

In a final evaluation of the data, a non-parametric test of correlation of dependency strength weighting, between the automatically and manually processed datasets was conducted. The test evaluated how closely the dependency strength weighting returned by the algorithm correlated with the dependency strength weighting assigned in the manual evaluation, for correctly assigned dependencies.

The data generated as a result of this analysis was in an appropriate form to conduct hypothesis testing to investigate the validity of the results in consideration of the sample sizes used. In this case, the null hypothesis, H 0, is that there is no correlation, between datasets:

$$H_0 :\rho = 0$$

Where ρ is the correlation coefficient.

The alternative hypothesis put forward is that a positive correlation exists, due to the combined effectiveness of the data attributes used by the algorithm for enabling strength weighting predictions that are in line with the user evaluation.

$$H_1 :\rho > 0$$

A significance level α, of 0.05, corresponding to the probability of wrongly rejecting H 0 was used in the experiment. Table 6 shows the sample size n, for each participant, the test statistic obtained using Spearman's rank order correlation ρ calc, and the critical value ρ crit, calculated from the significance level and the degrees of freedom (n – 2). Where \( \rho _{{\rm calc}} > \rho _{{\rm crit}}\), H 0 can be rejected in favour of H 1.

Table 6 Spearman's rank order correlation of strength weighting for actual vs. algorithm assigned relationships

The data for participants A, C and E suggest that there are grounds for rejecting the null hypothesis, although greater sample sizes are clearly required for a more confirmatory study. In the case of participant A, where a positive correlation of strength weighting is most apparent as a valid claim, the results indicate that this attribute could be usefully employed to filter out less useful or uncertain relationships. The results for participant B show that there is very little evidence to suggest a positive correlation exists in this case. This finding is thought to be related to the specific nature of B's tasks which are discussed further in the next section.

Summary of main findings

One prominent feature of the results was that the classifier correctly identified document interest and document dependencies more successfully for participant A than for the other participants. This is particularly interesting as A's tasks were generally of a slightly different nature to the others' tasks. Specifically, A was involved with activities which in many cases align closely with the engineering tasks outlined in Section 2. A large proportion of time was spent either modifying design data in associated documents or making references to information form external or corporate sources. In many cases search and retrieval was characterized by specific information or data requirement, in order that a given task could proceed to the next stage. This is in contrast to the tasks of some of the other participants where there was often no distinguishable stages to a given task.

The results for participant B were the poorest overall, having the largest percentage of incorrectly classified documents, dependencies and the lowest correlation of relationship strength weighting. On inspection, the data files for B's activities showed that there were a greater number of shorter, disconnected activities, typically where general administration activities were carried out such as web-based email checks or, for example, booking office equipment using the corporate intranet. In these activities, particularly where there was a larger amount of task switching, document interest classifications were often incorrect and the associated dependencies meaningless or of little future value.

However, there were also a number of instances where valuable dependencies were identified in participant B's tasks. For example, a proportion of B's time was spent carrying out web design and programming. In some cases there were internet searches carried out for specific programming functions relating to the web applications being created. The algorithm correctly identified the useful information sources referred to that were inputs for these tasks.

Overall, it may be concluded from these observations that the classification algorithm responds more effectively to longer tasks involving a variety of documents which are more likely to be driven by specific objectives and information needs.

Discussion

In this section, taking into consideration the results from the evaluation, the original issues relating to context-sensitive retrieval are revisited and some of the likely applications and limitations of the approach are discussed.

Implications for responding to task context in design information retrieval

At the start of the paper some coverage was given on how the authors' view the issue of context, particularly in relation to problems associated with responding to context in information retrieval. In the review of related research and existing systems the authors highlighted some general aspects of a knowledge worker's context which may be modelled or represented to support context-sensitive information retrieval.

Systems responding to a knowledge worker's preferences for information have been the subject of extensive research in both collaborative and content-based recommender systems. A feature of these systems is that they typically attain information in respect of a user's long-term information needs (O'Riordan et al., 2002), which are static or slow to change.

Systems responding to focus or intent in a knowledge worker's current tasks generally have a more difficult job as information needs are continually changing as the task focus changes. In this paper, the authors have argued that current task focus can be more easily interpreted and responded to with a record of how information that is currently in use had been used in previous circumstances. This allows systems to be supported with a network of potentially associated documents for retrieval tasks related to a given activity. Accordingly, the approach presented illustrates one way in which these records may be achieved.

Favourable features of the approach include the fact that the evidence used in the evaluation is largely implicit and so less intrusive on the user's activities. In addition, the dependencies specified as a result of the evaluation are in a machine readable format and so their utilization in information retrieval systems is readily achievable. However, the results show that this particular approach may only be beneficial in certain application contexts which are dependent on aspects of a computer user's interaction style and their working domain.

The approach has been evaluated with applications in design engineering information management in mind. In this domain, information inputs and outputs for a given activity (that can be captured using the approach) are particularly important for tracing design rationale and capturing detailed records of the design process. In particular, more complete records of task inputs and outputs provide an additional step towards managing information about argumentation leading to decisions. Some researchers (e.g., Ullman (2002) believe this to be an important factor for the future of design information management.

Possible additional benefits of improved responsiveness to task or context-centred retrieval may include: (i) a reduction in the amount of duplicated work in cases where information related to the same or similar tasks would otherwise not be found, and (ii) reduced retrieval times where a lesser amount of searching is required to satisfy an information need. The latter is particularly important as the time spent searching for information is a significant barrier to productivity in engineering practices—a point which has been noted in various academic studies (e.g., Lowe, 2002; Hales, 1991).

Overview of expected applications

Essentially, the approach investigated enables a task-centred similarity matrix Footnote 5 to be constructed from documents used in a given document set, based on the context of document usage. More specifically, the similarity matrix defines the contextual relationships between documents based on their strength and classification, calculated using the algorithm.

In an example of a very simple application, the similarity matrix, stored within an information retrieval system, can allow a searcher to identify documents that are likely to have been used in the same or similar situations. For example, given a document of interest in a user's current search, a query can be executed to retrieve links to other documents defined in relationships with the existing document. In the case that a number of documents are returned as a result of the query, they can be presented as a ranked list, in order of relationship strength. An important aspect to emphasise is that the aim of the approach is to identify documents that share a common utility for a given purpose. It is this context-centred aspect of the approach which differentiates it from other user-centred approaches in conventional collaborative filtering systems.

In the simple example above the results returned are based on the instances of dependencies specified in the matrix. There also exist a number of possibilities for the application of document clustering methods, some of which are used in existing information retrieval systems (e.g., graph theoretic methods). By clustering documents within the similarity matrix, it may be possible to exploit indirect dependencies where document clusters define associated documents rather than just the individual relationships. The benefits of applying appropriate document clustering methods to the similarity matrix is an area for investigation in future research.

Long term issues relating to persistent references to information

A possibly problematic area associated with the approach discussed in this article is that of maintaining persistent references to documents specified in the relationships—a more general problem also for document use on the Internet where URLs referenced by hyperlinks and bookmarks become obsolete. The “referential integrity” problem on the Internet has been an active subject of investigation over recent years (Spinellis, 2003; Pitkow, 1999; Ashman, 2000). A number of URN (Uniform Resource Name) protocols are available for digital information, that are intended to serve as persistent, location-independent identifiers, but these typically require cooperation from the content creator and as they are not universally adopted it cannot be guaranteed that a given document on the web will be provided with a persistent URN.

Spinellis (2005) provide a promising alternative approach for creating persistent URNs without the active co-operation of content creators. The method involves having a search engine calculate, for a given URL, an augmented, persistent version containing the URL and a combination of words that uniquely identify the document. If the default access method (via the URL) fails, then the retrieval of the document is attempted using a search engine and word combination. A URN created in the same or similar manner may provide an effective method for storing document references in the dependencies assigned using the algorithm.

Overall conclusions

There are a wide variety of recommender systems for evaluating the utility or interest of documents in respect of users' preferences for information (Goldberg et al., 1992; Resnik et al., 1994; Wang et al., 2006). However, in the engineering domain (and possibly other technical domains) there is a specific requirement for being able to evaluate and respond to users' information requirements in the context of their current activities.

The essence of the research is to establish whether it is possible to automatically create a context driven document association system. For the approach presented here, four temporal metrics and two document use-case metrics have been evaluated for their usefulness in achieving this goal. A software test rig was created enabling the assessment of associated documents using the algorithm in addition to a manual user-based assessment. The experiments carried out using the test rig enabled the evaluation of the metrics for identifying documents of interest or utility within tasks and their corresponding contextual relationships.

The results show that in many cases, the manual assessment of relationships assigned between documents are in line with the algorithm's assessment. The evaluation has also indicated some general limitations in regard of appropriate application contexts requiring specific working domains and interaction styles.

It should also be noted however, that the approach is mainly based on aspects of user interactions which are captured implicitly and do not add any additional burden on activities. As such, the established relationships can be considered as a low cost resource that has the potential to improve the ability of information retrieval systems to respond to queries relating to the context in which documents are used—an aspect of search functionality not supported in conventional content-based retrieval methods.