Extended structural relevance framework: a framework for evaluating structured document retrieval

Ali, M. Sadek; Consens, Mariano; Lalmas, Mounia

doi:10.1007/s10791-012-9192-1

Extended structural relevance framework: a framework for evaluating structured document retrieval

Published: 06 March 2012

Volume 15, pages 558–590, (2012)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

Extended structural relevance framework: a framework for evaluating structured document retrieval

Download PDF

M. Sadek Ali¹,
Mariano Consens¹ &
Mounia Lalmas²

406 Accesses
3 Citations
Explore all metrics

Abstract

A structured document retrieval (SDR) system aims to minimize the effort users spend to locate relevant information by retrieving parts of documents. To evaluate the range of SDR tasks, from element to passage to tree retrieval, numerous task-specific measures have been proposed. This has resulted in SDR evaluation measures that cannot easily be compared with respect to each other and across tasks. In previous work, we defined the SDR task of tree retrieval where passage and element are special cases. In this paper, we look in greater detail into tree retrieval to identify the main components of SDR evaluation: relevance, navigation, and redundancy. Our goal is to evaluate SDR within a single probabilistic framework based on these components. This framework, called Extended Structural Relevance (ESR), calculates user expected gain in relevant information depending on whether it is seen via hits (relevant results retrieved), unseen via misses (relevant results not retrieved), or possibly seen via near-misses (relevant results accessed via navigation). We use these expectations as parameters to formulate evaluation measures for tree retrieval. We then demonstrate how existing task-specific measures, if viewed as tree retrieval, can be formulated, computed and compared using our framework. Finally, we experimentally validate ESR across a range of SDR tasks.

DTD Based Costs for Tree-Edit Distance in Structured Information Retrieval

Semistructured Data Search

Less is Less: When are Snippets Insufficient for Human vs Machine Relevance Estimation?

1 Introduction

Much of the work in document retrieval has focused on the goal of developing systems that retrieve relevant documents. In contrast, the goal of a structured document retrieval (SDR) system is to retrieve relevant parts of documents. We refer to document parts as sub-document results. SDR is particularly advantageous when dealing with long documents and those covering a wide variety of topics.

SDR systems exploit the structure of a document in two ways. First, referred to as a structural hint [40], sub-documents are ranked based on whether their encoding help users in locating relevant information. Second, referred to as a structural constraint [42], a user may direct the system to search for sub-documents with a desired encoding, using a query language such as NEXI [43] or XQueryFT [8].

Several types of sub-document results exist in SDR, each of them “modelling” how users locate relevant information. We illustrate these through examples taken from our earlier work [5]. Let a collection contain the extract of the book (formatted in XML) shown in Fig. 1a. The document structure of the book is, in this case, a tree, which is shown in Fig. 1b; the tags have been abbreviated as follows: book ( bk ), front matter ( fm ), body ( bd ), description ( d ), name ( n ), meta ( m ), and chapter ( c ). The line numbers of elements shown in Fig. 1a correspond to the node ID of each corresponding node in Fig. 1b. Consider the query “ship captain in Moby Dick”. The query matches terms in different structural parts of the book extract; specifically, node 4 (match on “Moby Dick”), node 15 (on “ship”) and node 16 (on “captain”) in Fig. 1b. For a document retrieval task, a system returns root nodes to model the user accessing the whole book. For a focused retrieval task [25], as illustrated in Fig. 2a, an SDR system may return nodes (encoded as elements or text passages) at separate ranks, which provides the user with focused information but at the cost of having to examine results from the same book at multiple rank positions. For a tree retrieval task [5], as illustrated in Fig. 2b, an SDR system returns subtrees at separate ranks (the first rank corresponds to a subtree taken from the book extract), which provides the user with single results that can direct the user to one or more relevant parts of a book.

The evaluation of the effectiveness of a classical information retrieval (IR) system (such as a document retrieval system) is derived from the number of hits (relevant documents retrieved) and misses (relevant documents not retrieved) in the system output. In contrast, the output of an SDR system consists of hits (relevant sub-documents retrieved), misses (relevant sub-documents not retrieved), and near-misses. Near-misses are retrieved sub-documents that may not contain relevant information, but from which relevant information can be accessed via navigation e.g. a user browsing, scrolling down in the user interface, or following links. Therefore, SDR evaluation must take into account, not only the relevance of sub-documents, but the fact that users may navigate within documents to locate relevant information. The latter is usually not considered in classical (document) IR evaluation.

We refer to user navigation as the effort a user spends to locate relevant information from search results. We illustrate how user navigation can cause redundancy. Consider the ranked list in Fig. 2a from the book extract in Fig. 1a (nodes 4 and 15 at ranks 1 and 2, respectively). The system first returns node 4. Upon seeing node 4, the user might navigate to other nodes in the document. If the user saw node 15 by navigating to it from node 4 then he or she would experience what we refer to as redundancy when accessing node 15 directly at rank 2. SDR evaluation must account for how navigation can cause users to see relevant information more than once (redundantly).

Much of the existing work in SDR evaluation has been done in the context of the Initiative for the Evaluation of XML retrieval (INEX) ^{Footnote 1}, a collaborative and international effort dedicated to the development of effective XML or focused retrieval systems. Since 2002, INEX has investigated a wide range of SDR search tasks. This has resulted in task-specific evaluation approaches for element retrieval [24, 28, 34, 35], passage retrieval [31] and tree retrieval [5].

It is widely known that the evaluation of the range of SDR tasks has challenged INEX since its beginnings [41]. For instance, in our earlier work [5], we showed how most of the current approaches cannot evaluate tree retrieval because they are not able to represent how users satisfy their information need with tree-structured results. Analogous limitations have also been observed when customizing measures to evaluate specific search tasks [32]. This situation has resulted in SDR evaluation measures (and performance results) that cannot be compared with respect to each other and across search tasks. There are three main reasons for this: (1) current SDR measures have different ways to consider and calculate relevance, user navigation and redundancy, (2) they rely on task-specific assumptions of how the user information need is satisfied, and (3) they depend on the relevance assessment methodology.

The main contribution of this work is to address the above limitations by proposing a single framework, called the Extended Structural Relevance (ESR) framework, that allows evaluation across SDR search tasks. ESR is related to our earlier work [5], where we show that tree retrieval is sufficient to capture all existing SDR approaches based on hits in the output. ESR extends our earlier work by considering not only hits, but also, near-misses and misses. More significantly, ESR revisits the relationship between relevance, user navigation and redundancy posited in our earlier work [5] to allow the development of measures that share the same set of parameters when evaluating SDR system performance. A substantial benefit is that it then becomes possible to compare the performance differences between SDR systems, where various models of user navigation and relevance are involved. The flexibility to support a wide variety of measures in a single framework is an important advancement in SDR for investigating future search tasks, where navigation has to be accounted for.

The outline of this paper is as follows. Section 2 reviews current approaches to SDR evaluation. Section 3 reviews tree retrieval and provides the notation for this work. Section 4 presents our ESR parameters for relevance, user navigation and redundancy. Section 5 presents the main contribution of this work, the Extended Structural Relevance (ESR) framework. Section 6 presents how to represent existing SDR evaluation measures in ESR. Section 7 presents experimental results comparing our ESR proposals to existing SDR measures. Finally, Sect. 8 concludes with remarks and future work.

2 Related work

The first SDR systems investigated in INEX were element retrieval systems. Their aim was to return relevant XML elements from a collection of XML documents as answers to a given query. The first measures used at INEX to evaluate element retrieval effectiveness consisted of adaptations of classical IR measures, where the notion of a document was replaced by the notion of an element. These early SDR measures considered the relevance of elements (as simple hits and misses), but ignored user navigation and redundancy. Later approaches to SDR evaluation proposed measures that capture user navigation and redundancy, as well as applying to other SDR tasks. We introduce some of the approaches next, while the actual measures are presented in Sect. 6.

Extended cumulated gain (XCG) [24] is a family of cumulated gain (CG) [20] measures for evaluating element retrieval. XCG is motivated on the observation that the effect of redundancy on the relevance of results is akin to wasted user effort because the same information, seen more than once, is not relevant to the user [27]. Effectiveness in XCG is defined by comparing the user gain in relevant information from a system to the gain obtained by spending the same effort in an ideal system. Ideal elements provide the best results for the user to see relevant information with the least effort. An ideal system ranks elements such that a user maximizes their information gain within a minimum number of ranks and experiences a minimum amount of redundancy. Kazai [23] noted two significant problems with respect to ideality. First, assessing ideal elements and an ideal ranking is a two-fold optimization, which is costly. Second, ideality introduces instability [11], stemming from the chosen assessment methodology determining what constitutes an ideal element.

Precision-Recall with User Modelling (PRUM) [35] is an extension of PRecall [37] where navigation to ideal elements is stochastic. PRUM measures precision based on the number of ranks in the output where the user obtains relevant information from ideal elements. Like XCG, PRUM requires knowing the ideal elements but, unlike XCG, it does not require an explicit ranking of ideal elements. The main contribution of PRUM is that it proposes a probabilistic model for user navigation that can be validated through studies of user behaviours. Its main drawback is that, like XCG, it is prone to instability, as it too relies on the adopted methodology for choosing the ideal elements.

A related measure, which solves some of the problems of PRUM by substantially reducing its complexity, and which also allows for graded relevance, is the measure of Expected Precision-Recall with User Modelling (EPRUM) [33]. For a given recall, it defines precision as a comparison between the minimum rank that achieves the given recall in an ideal system versus the minimum possible rank that achieves the given recall in the actual system. This approach, although simpler to calculate than PRUM, does not address instability because of its reliance, like PRUM, on ideality.

Highlighting XML evaluation (HiXEval) proposed in Pehcevski and Thom [31], and further finalized in Kamps et al. [22], was developed to evaluate the performance of systems that retrieve (or can be modelled as retrieving) passages, where a passage is a block of text, delineated or not with XML tags (when delineated, the passage is an XML element). We refer to this search task as passage retrieval. HiXEval measures are adaptations of classical IR precision and recall. Unlike XCG and PRUM, HiXEval does not rely on ideality. The main limitation of HiXEval is that it assumes that user navigation does not extend beyond the boundaries of retrieved passages and considers redundancy as only occurring between adjacent retrieved text passages that overlap each other. Overlap of text is a special-case of redundancy, and thus limits HiXEval in the investigation of the overall effect of redundancy when measuring system performance.

Structural Relevance (SR), proposed in our earlier work [5], evaluates tree retrieval. SR is a measure of the user expected gain in relevant information given that users may experience redundancy. SR does not rely on ideality. The main contribution of SR is that it proposes an integrated probabilistic model for expressing relevance, user navigation and redundancy. The key drawback of SR is that it is limited to the measurement of precision based on hits in the output.

User effort and redundancy have been investigated outside INEX. Salton et al. [39] investigated effort in passage retrieval in full-text search. Studies in web retrieval demonstrate how performance is improved by ranking results based on predictions of user navigation within web pages (either modelled from user clicks [17] or based on tracking navigation e.g. [1]). Other work includes Keskustalo et al. [29] who propose a relevance feedback mechanism based on simulating how users prefer to spend effort reading documents and providing feedback to the system to refine search results. In SDR, users see relevant information redundantly because of information fragmentation, i.e. documents are fragmented into sub-documents [30]. Redundancy has also been considered in search result diversification e.g. [14], which stems from the problem of information duplication, i.e. the same information appears in more than one document [9]. The aim is to rank documents to minimize the amount of redundant information contained in them [2]. Whereas our work focuses on the issue of redundancy in IR evaluation, research on diversification is concerned with the ranking of documents.

In the next section, we recall how tree retrieval can be used to model a range of search tasks (including element, passage, and document retrieval), and we introduce some notation. In Sect. 4, we describe how our proposed ESR framework extends SR to account for near-misses and misses.

3 Tree retrieval task

In our earlier work [5], tree retrieval is defined as the task of “returning trees that provide the user with access to the document nodes in the collection that are relevant to the user’s information need”. For a given query, the system outputs a ranked list of trees. The user seeks relevant information in a retrieved tree by looking at the content contained in its nodes. While doing so, the user may navigate from the nodes in the tree to other parts of the document. At any point, the user may choose to return to the system output to access the next lower-ranked tree. This process continues until the user either satisfies his or her information need or exhausts the set of trees retrieved by the system.

Tree retrieval is a general task that can model many SDR search tasks. Figure 2a (shown in Sect. 1) illustrates an example of how trees can model document retrieval (by retrieving the root node of documents), element retrieval (by retrieving a node from the document), and passage retrieval (by retrieving either single nodes or trees of sibling nodes connected by their lowest-common ancestor node). ^{Footnote 2}

The evaluation of tree retrieval tasks rests on the following three requirements originally posited in our earlier work [5]:

(i)
the relevance of retrieved trees in the output are not independent of each other and depend on whether users tolerate redundancy,
(ii)
the purpose of the system is to retrieve trees that afford a user access to relevant information by directly visiting a node in the tree or through navigating from a visited node into the rest of the document, and
(iii)
the same relevant information may be expressed in trees of varying structure.

To illustrate how these requirements affect the evaluation of SDR systems, consider the trees in the ranked list shown in Fig. 3 as the output from an SDR system for the query “ship captain in Moby Dick” submitted by a user seeking literary references. First, the nodes in the trees in ranks 1 and 2 appear in the tree in rank 3. The user who sees all three trees will see each retrieved node at least twice (i.e., redundantly). This illustrates Requirement (i), in that, the relevance of the tree in rank 3 must account for all of its nodes having been retrieved earlier in the trees at ranks 1 and 2 and thus seen by the user. Second, from the tree in rank 1, the user may navigate to the nodes that appear in the later trees in both ranks 2 and 3. This illustrates Requirement (ii), in that, the relevance of these later trees will be affected by the user navigating from the nodes in the earlier tree. Third, the trees in ranks 2 and 3 would be relevant as literary references because they contain the same relevant chapters. This illustrates Requirement (iii), in that, the evaluation must account for relevant information being retrieved in trees of varying structure.

The requirements above present significant challenges for using current approaches to evaluate tree retrieval systems (discussed at length in our earlier work [5]). Classical approaches to evaluation assume that results are relevant independently of each other, which invalidates Requirement (i). In the context of SDR evaluation, HiXEval does not consider user navigation beyond retrieved passages so it invalidates Requirement (ii). Measures based on ideality (as proposed for PRUM and XCG) do not meet Requirement (iii), of encoding the same information in different trees. This is because it is not practical to determine all possible and equivalent ideal trees. In contrast, SR meets Requirements (i), (ii) and (iii) by using node-level assessments of relevance and user navigation to infer relevance, and by capturing user navigation and redundancy in tree-structured outputs without relying on ideality. But, SR is limited to measuring precision because it does not consider near-misses.

We now define the notation used in this paper. We denote the output of the tree retrieval task as a ranked list $R=t_1,t_2, \ldots, t_k$ of k distinct subtrees t _i from a collection C of trees. We denote the sublist of R up to rank i as R _i. The collection C is a forest of trees where each tree represents a document. A tree T = {T _V, T _E} is a connected, directed, acyclic graph where T _V is a set of nodes, T _E is a set of edges between pairs of nodes from T _V.

Two subtrees from a collection are distinct if one contains a node not found in the other. Subtrees represent the sub-documents retrieved from the collection. A subtree t = (t _v, t _e) of tree T in collection C satisfies $t_v \subset T_V$ and $(e_1,e_2) \in t_e$ if there is a path from nodes e ₁ to e ₂ in T. Moreover, when we refer to the tree t as a set, it refers to its set of nodes t _v. A subtree is a tree, and we use the terms interchangeably, unless stated otherwise.

The simplest tree is a single node called a singleton. We model element retrieval as systems that retrieve singletons. A singleton is a subtree t with a single node t _v = {e} and no edges t _e = ∅. We refer interchangeably to subtrees with a single node as either singletons or nodes. A ranked list of nodes $R=e_1,e_2,\ldots,e_k$ is considered to be the same as a ranked list of singletons $R=t_1,t_2,\ldots,t_k$ where t _i = {e _i}. We differentiate between nodes (singletons) and trees using e and t, respectively. Specific to XML, we refer to nodes as elements. XML elements are nodes in the document tree of an XML document.

4 Relevance, user navigation and redundancy

Extended Structural Relevance (ESR) is a framework to calculate the user expected gain by conditioning the relevance of seen information with the probability of whether the information is both seen and not redundant to the user. Our framework encapsulates, as parameters, relevance, user navigation and redundancy, which we formalise in Sects. 4.1, 4.2 and 4.3, respectively.

4.1 Relevance

In IR evaluation, the relevance of information is a judgment made by a human assessor on whether the subject matter of the information is meaningful to a given information need. In classical retrieval, the relevance of information objects (e.g. documents) is assumed independent from each other. This is often not the case in SDR because users may navigate between sub-documents, and some information may be seen redundantly [44].

As posited in Piwowarski et al. [35], a user gains relevant information in SDR when it is seen by either retrieval, navigation, or a combination of both. However, a user may consider the information contained in the sub-document, albeit relevant, not useful, i.e. sub-optimal gain, because either it is redundant or its encoding format does not provide an ideal context [24]. In this paper, we refer to the gain from seeing a sub-document as a relevance value. In classical IR, because documents are assumed independent, relevance and relevance value coincide.

How to assess the relevance of sub-documents is an active area of SDR research [36]. Kazai [23] showed that the assessment methodology, based on ideality, introduces instability into the measures (discussed in Sect. 2). The author suggested that instability can be avoided by: (a) assessing the relevance of information independently of redundancy in the output, (b) assessing relevance without considering how a user may navigate to information, and (c) evaluating system effectiveness based on the effect of user navigation and redundancy on the user gain in relevant information. Suggestions (a) and (b) remove the need to assess ideality. Suggestion (c) implies that good SDR measures evaluate how users spend effort to achieve gain. In this work, we address suggestions (a) and (b) by assuming independence between relevance and user navigation (Assumptions 1 below). We address suggestion (c) by using expected gains and losses.

We recall from Sect. 1 that users gain relevant information from hits and near-misses. Near misses are defined in Kazai and Lalmas [24] as retrieved sub-documents that, may or may not be relevant, but which can be navigated from by the user to see unretrieved, relevant information. In this work, we reverse this definition. We consider a near-miss as a relevant sub-document that has not been retrieved and that can be accessed by the user via navigation from retrieved sub-documents. A hit is a relevant tree in the output. Finally, a miss is a relevant tree that is not seen by the user. These three cases define the basis of gain in ESR and are summarized in Table 1.

Table 1 Gain in SDR

Extended structural relevance framework: a framework for evaluating structured document retrieval

Abstract

Similar content being viewed by others

DTD Based Costs for Tree-Edit Distance in Structured Information Retrieval

Semistructured Data Search

Less is Less: When are Snippets Insufficient for Human vs Machine Relevance Estimation?

1 Introduction

2 Related work

3 Tree retrieval task

4 Relevance, user navigation and redundancy

4.1 Relevance

Assumption 1

4.2 User navigation

4.3 Redundancy

4.4 Example toy system and models

5 Extended structural relevance framework

6 ESR evaluation measures

6.1 Structural relevance

6.2 Highlighting XML retrieval evaluation

6.3 Extended cumulated gain

6.4 Precision-recall with user modeling (PRUM)

6.5 Calculating SR, HiXEval, XCG and PRUM using ESR

6.6 Discussion

7 System rankings using ESR measures

7.1 Focused task (2006): nXCG and iP

7.2 Focused task (2007): iP

7.3 Best in context task (2006): EPRUM

7.4 Relevant in context task (2007): MAgP

7.5 Summary

8 Conclusions and future work

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation