1 Introduction

Computer users encounter content with a hierarchical structure on a daily basis. For instance, computer files are grouped into folders and subfolders, programming classes are grouped into libraries and packages, and text in a digital book is grouped into chapters and sections. We can consider such corpora as belonging to the same content class: content that is restricted to some finite domain and arranged as a tree. Because the arrangement is performed by humans, there is little information redundancy and the organization is often meaningful and coherent, and thus it may be worth using when answering search queries.

On the web, search queries serve three different intents: to discover new information, to perform some web-mediated transaction, or to navigate to the location of a known site (Broder 2002). This contrasts with the above content class, where queries are mostly navigational. For instance, a laptop user may search their hard drive for a specific file; a programmer may search an API for a specific class or function; a lawyer may search the US Code for a specific legal clause; a database researcher may search the DBLP for a specific paper or venue; and a reader may search a digital book for a specific section or page. In each case a traditional content-only search could give numerous matches whereas, the user is only interested in a specific few.

Because of the user’s presumed foreknowledge and familiarity with the content domain, their queries constitute known-item search: the user may anticipate whether a query result exists, may have expectations of what the desired result must look like, or may have seen the result before and wants to locate it again. The goal of Parameterized Filesystem HITS (PFH) is to determine which matches in the tree the user is more likely trying to reach. We do this by favoring matching nodes that are structurally closer to other matching nodes while penalizing isolated matches. In this article we implement this idea and investigate its effectiveness.

On the web, navigational queries are addressed by algorithms such as HITS (Kleinberg 1999) and PageRank (Page et al. 1998), which analyze the web hyperlink graph to determine which pages exhibit higher degrees of popularity, or authority. Authoritativeness allows, for example, the EBay site to rank first for the query “ebay” despite millions of sites matching that keyword. The presence of hyperlinks enables such graph analyses on the web. In contrast, the corpora we are interested in tend to lack explicit links. In some cases links exist—such as a symlink in a filesystem or a cross-reference in a book—but these features are not as prevalent as web hyperlinks and their use (or non-use) is domain-specific and varies from corpus to corpus. In order to address the problem of known-item search in hierarchical content in general, such domain-specific linkage should not be relied on. Therefore we are interested in analyzing only structural features related to the tree-like organization.

Prior research in hierarchical content has focused on Desktop Search (DS)—an area related to Personal Information Management—where users use computer software to search for files in their personal computers. There are many commercial examples of such software, including applications from Google, Microsoft and Yahoo!. However, the commercial DS tools perform pure content analysis and ignore the filesystem organization. Research works in the area have also ignored the structure and relied on other features to enhance search results (Sect. 3).

However, it is well known that folder structures play an important role in how people organize their files. Based on the premise that users organize their files in a meaningful way and typically place related files close together, we drew an analogy in (Penev et al. 2006) that folders play the role of hubs and files of authorities, with similar interpretation as Kleinberg’s HITS. The reinforcing relationship still holds: good folders should contain good files and good files should be in good folders. The same idea generalizes to technical documents and books, and perhaps to other content hierarchies. For instance, in a book the authorities could be paragraphs and the hubs could be the chapters, sections, subsections (and so on) that arrange the paragraphs into a tree structure.

The motivation of this work, which stretches beyond DS, is to improve known-item search in human-arranged content hierarchies in general. Our solution is to adapt a web connectivity analysis paradigm from HITS in order to discover more useful and query-relevant nodes in the content tree. We use this idea to formulate PFH, an algorithm to rearrange the results of an existing content-only retrieval engine by including some analysis of structure.

The rearrangement is performed by combining the content-only component score with PFH’s structural score component to produce a new score. An α parameter is used to guide the combination by varying the weight of the two components, multiplying one by α and the other by (1 − α). This allows us to produce many different rearrangements, from pure content (α = 1) down to pure structure (α = 0) and steps in between. The effects of α are further described in Sect. 4.2, but the reader can interpret it as a slider: when it is 0, PFH ignores content and uses only structure scores, and vice versa when it is 1. A value of \(\alpha={\frac{1}{2}}\) represents an equal weighting of both components.

The experiments show that PFH with α = 0.8 significantly boosts known-item search accuracy over the baseline while keeping other IR metrics steady. The optimal α value may differ among separate domains, but we empirically show that there exist useful values of α that are not 1, or in other words, where at least some structure is taken into account. In practice, an optimal α should be determinable for any structured corpus given a set of representative queries and relevance judgments.

Similar to how web search engines answer navigational queries, a key idea in PFH is to favor query-matching nodes that are closer to other query-matching nodes. Consider the simple tree in Fig. 1.

Fig. 1
figure 1

A simple filesystem

Suppose the leaf nodes f 1,2,3,4 are files and the other nodes are folders, and suppose that, for some query, a search tool retrieves the files with content scores of 0.25, 0.26, 0.33 and 0.34, respectively. How should we rank them? The obvious order is by score, with f 4 first. But we notice that f 4 is somewhat isolated while the other nodes are close together, possibly indicating that the neighborhood formed by the folders a, c and d is more relevant to the query, overall. When the difference in content score between f 4 and f 3 is so small, we can justify placing f 3 ahead because we expect the structural arrangement to reveal additional context that is external to the content. For example, suppose an unknown file with unknown binary format is in a folder where the majority of other files relate the user’s recent conference in Germany; despite being unreadable, we can make a guess that the unknown file is also relevant to the German conference simply because it is very close to many other files on that subject.

In Fig. 1, it would take only a small consideration of structure to rank f 3 ahead of f 4. As α decreases, we start placing less emphasis on content and more on structure, which means that f 1 and f 2 may also outrank f 4 at some point. Of course, the user may have been seeking f 4 all along, so the MRR metric can assess at which α levels PFH is making useful rearrangements.

The remainder of this article is organized as follows. First, some background on hierarchical content and connectivity analysis is given in Sect. 2. In Sect. 3 we summarize related work. PFH is described and constructed in Sect. 4, and then evaluated in Sect. 5 where we aim to find both useful and ‘safe’ α values that either neutrally or positively affect several key IR metrics. Section 6 concludes.

2 Background

2.1 Hierarchical content and desktop search

A filesystem is perhaps the most commonly searched content hierarchy. Modern operating systems have built-in DS functionalities, but there are also many commercial DS tools that maintain their own search indices. If a user is after a specific file on her computer, she can type a query into a DS tool and view a list of matches. Most tools will allow her to sort the results by metadata, such as size, date and name. If she is is unsure or has forgotten, she may prefer to rank the results by relevance. Both ‘by date’ and ‘by relevance’ are popular sort criteria for DS (Dumais et al. 2003). We are only interested in rank-by-relevance because if the user remembers her target’s metadata then she does not need the results to be deeply analyzed in the first place.

While PFH may be used for Desktop Search, there are additional adjustments that must be performed to cater to some idiosyncrasies of the desktop environment. For instance, one adjustment would be that certain node types, such as an email.mbox file, are not suitable as search results and will require special handling during parsing and indexing. Another adjustment is that “dump folders”, where users often temporarily store unrelated files, also require special handling since they represent haphazard and meaningless organization.Footnote 1

Although this article talks extensively about filesystems and Desktop Search, PFH avoids using any desktop-specific features and is intended for hierarchical corpora in general. The common feature of hierarchically-arranged content is that the nodes form a tree. Being a tree, it is also a graph.

2.2 Connectivity analysis

Connectivity Analysis measures how well-connected a node is in a graph. The web, a connected graph of hyperlinked pages, is regularly subjected to such analysis using algorithms typically based on the core theoretic principles of PageRank and HITS (Chakrabarti et al. 1998, 1999; Crimmins 2000; Kumar et al. 1999; Lempel and Moran 2001; Marendy 2001).

Both algorithms have similar aim—to find the top authorities—but we consider HITS’s principles as more suited to adapt for hierarchical content. The reasons for this are threefold. Firstly, PageRank models both the web’s structure and a user’s browsing behavior with the ‘random surfer model’. Unfortunately, behavior is difficult to specify for hierarchical corpora. For instance, does a user browse a book in the same way as they browser a filesystem, or do they browse technical documents on different subjects in a different way? HITS models only structure and is thus easier to adapt and parameterize. Secondly, HITS’s concepts of authorities and hubs have elegant mappings to the concepts of content nodes and structural nodes, making it closely resemble the tree structure we are dealing with. Thirdly, PageRank computes only authority scores while HITS computes both authorities and hubs. The latter gives us an additional piece of information that we can use for other tasks, such as categorizing authorities into their most appropriate hubs (Sect. 5.4). Future work may investigate the relative merits of using PageRank instead of HITS, but for this initial work we investigate only HITS:

HITS. calculates hubs/auths on the web.

Inputs: webpages V and hyperlinks E.

Outputs: hub scores \(\user2{H}\), authority scores \(\user2{A}.\)

1. ∀v ∈ V, initialize H 0[v] := A 0[v] := 1.

2. for iteration k until convergence:

3. v  \(\in\)  V, set \( H_k[v] := \sum_{o:(v\to o)\, \in\, E}{A_{k-1}[o]}\)

4. v  \(\in\) V, set A k [v] := ∑i:(iv) ∈E H k−1[i]

5. normalize \(_{L_1}(\user2{H}_{\user2{k}})\) and normalize \(_{L_1}(\user2{A}_{\user2{k}}).\)

6. end for

7. return \(\user2{H}\) and \(\user2{A}\)

The HITS algorithm posits that a web page serves two purposes: to be a hub or to be an authority. An authority is a page perceived to contain relevant content for some query or topic, while a hub is a page linking to authorities. Hubs and authorities are in a mutually-reinforcing relationship in which a good hub should link to good authorities and a good authority should be linked to by good hubs. A good hub itself may also be a good authority, and vice versa.

HITS proceeds as follows. First, a small root set of pages is retrieved from a search engine for the query. This query-focused subgraph may be highly disconnected, so a base set is formed by adding a quota of pages ±1 hop away. This step allows relevant pages that do not include the keywords to still be ranked (since they are part of this neighborhood) and the newly added pages provide extra edges to induce an adjacency matrix M on the base set digraph. The nodes are then assigned a non-zero hub and authority score and HITS proceeds to iteratively “shake” the graph to distribute weight along the edges until equilibrium is reached. Under the Perron-Frobenius theorem, the iteration converges to the dominant eigenvectors of M T M and MM T, returning them as hub and authority scores.

In practice, the number of iterations is fixed to a small constant since the top scores stabilize very quickly (Bharat and Henzinger 1998; Kleinberg 1999). Combined with content analysis, HITS produces better results than matching only content in linked document spaces. Examples of such spaces include the body of academic research, where edges are citations, and the World Wide Web.

3 Related work

This article bridges three areas of IR research: file organization, intelligent Desktop Search applications and web connectivity analysis.

3.1 File organization

Window managers use the conventional desktop metaphor and it is expected of users to arrange their files using folders and subfolders. There is no reason that this metaphor should be optimal, but it is the de facto standard. Despite being such a common activity, multiple works have noted a lack of research on how users arrange and search files (Henderson 2005; Ravasio et al. 2004; Whittaker et al. 2000). Older studies have generally conducted small-scale user interviews but have become oudated due to many changes in modern window managers and personal repository sizes. Nevertheless, several studies support the intuition that latent judgment and meaning is encoded in the hierarchy. (Jones et al. 2005) concluded that folders and subfolders “represent an emergent understanding of the associated information items and their various relationships”. (Henderson 2005) found that users created folders to group files predominantly by genre, task and topic. In (Boardman and Sasse 2004), users had “extensive folder structures” and filed items into folders immediately upon creation. Folders were used to group related files based on projects, document types and long term roles. (Ravasio et al. 2004) reported that “a fair amount of effort was invested … in creating elaborate file system structures and in labelling them adequately so as to support the unveiling of the content’s meaning and relation”. They also noted that new subfolders were created “where the user felt that it was important to keep an overview” and for “documents on the same subject.” Therefore we know that a file-folder content hierarchy has meaning and plays an important role in organization. This is also true of some other types of human-arranged hierarchical content.

3.2 Intelligent DS tools

The DS literature is split into works that observe user behavior and works that attempt to enhance the results using some non-trivial method. A well-known early study in behavior was by (Dumais et al. 2003), who observed how 200 Microsoft employees searched their desktops; a few of their observations are cited elsewhere within this article.

A more recent study by (Cohen et al. 2008) recorded a log of DS iteractions of 19 users. They then performed machine learning on the large log data to determine a ranking method that minimized ranking errors. Their resulting method relied on a combination of content match, path match and many other metadata features, such as a file’s modified date, accessed date, created date, file type and file size, as well as some personalization based on previous clickthroughs on similar queries. They noted that the most important features for ranking were the three types of dates (modified, accessed, created) and file name. This suggests that their test subjects often knew the metadata associated with the file they wanted. Cohen additionally trialled a non-learning method based on optimizing selectivity, but still relied on personalization from the query log for this method. Therefore is it unclear how their approach works in more general settings, or for other content hierarchies where many of the metadata features are not applicable.Footnote 2

We do not compare PFH to the results in (Cohen et al. 2008) for two reasons. First, PFH is not intended solely for DS and thus cannot use any of their domain-specific features. Second, their results are difficult to compare against because their queries and datasets are not shareable due to privacy reasons. Nevertheless, PFH already has a suitable baseline for comparison in its underlying engine.

Another related DS tool is Beagle++Footnote 3, which used a schema to introduce semantic relationships between files. The relationships were assigned weights and evaluated using ObjectRank (Balmin et al. 2004). The goal was to use file associations, such as an email to its attachment, to improve Recall by retrieving files that a content-only retrieval would miss. However, the approach carries a limitation of being unable to help with many searches, i.e. not all queries are for email attachments and some queries would not match any schematic relationship. In such cases the MRR of Beagle++ may be worse than a generic content-only search because, in attempt to improve Recall, they may add search noise to the results.

PFH differs to Beagle++ in two ways. First, PFH does not affect Recall and does not risk introducing new noise to the results. Effectively, all of its results should be query-relevant—at least according to the underlying retrieval engine. Second, PFH is only a rearrangement algorithm and must work in tandem with a separate retrieval engine. This means that it can be used for building search systems over other types of hierarchical corpora, whereas Beagle++ is specific to desktops. For example, in our experiments we use the file tree of a popular technical document, but there are no existing features in our corpus that would help Beagle++ produce a superior ranking.

Another intelligent DS tool was Connections (Soules and Ganger 2005), which traced filesystem calls to maintain a file-file relationship graph. Traditional content-only results were enhanced by adding related files from this graph. Files were considered related if they used each other for I/O or were opened within a window of time. Injecting extra files greatly improved Recall measurements, but its authors noted only a marginal improvement in Precision. The strength of their approach lied in deriving relationships from temporal context, but a key disadvantage is time and cold-starting. The assumption is made by Connections and other monitoring systems such as TaskTracer (Dragunov et al. 2005) is that users work on a single ‘task’ at a time. Exploiting this idea requires monitoring the user and system activity—for as long as half a year in their experiments—to build the relationship model. Since the initial cold-start results would be no better than a content-only tool, and given that the monitoring approaches carry more overhead, a user is likely to use such tools only if they are satisfied with a promise that results should improve eventually. PFH has several advantages over monitoring applications: it works immediately without monitoring, it is light and can be applied on small devices such as mobiles, and it can be used where monitoring is impractical or impossible, such as for anonymized search in a public corpus, a one-off search not linked to any ‘task’, or searching infrequently-accessed corpora.

We note that PFH does not need to replace monitoring or learning approaches. Rather, it can complement them or be used on top of them. For instance, it can be used during the initial monitoring phase where the models are still taking shape, and it can be used thereafter for searches that do not produce a high enough confidence match under their probabilistic relationship models.

A trial experiment in (Soules and Ganger 2005) used folders to derive contextual relationships, enhancing the regular set of matches by adding other files from the corresponding folders. Their trial performed poorly and they concluded that folders cannot be used effectively. However, a likely reason was that they used an overly-optimistic approach: when the query specified a file extension, all files in a folder with that extension were assigned the combined weight of any content matches and, when no extension was asked, all files in a folder were assigned the score of the highest scoring file among them. This recall-centric scoring approach demands flawless organizational effort from the user because it adds many new files to the result and assigns them high scores even if they do not match the query. This approach is similar to running PFH with α = 0, relying purely on structure, because that is when PFH gives the same score to files in the same folder. Our experiments agree that such a low α is a poor choice, but higher α values perform much better.

PFH differs from Connections’ folder experiment in three ways: it does not increase recall and does not add noise to the results; it maintains intra-folder relative order whereas they gave files equal scores and blurred the line between genuinely relevant and irrelevant matches; and finally, our notion of physical context stretches to nodes “in the neighborhood” rather than their restriction to “in the current folder”. In Sect. 4 we define an appropriate proximity function for such a distance calculation.

3.3 Connectivity analysis

The aforementioned notion of distance is related to web connectivity analysis, where the web’s hyperlink graph is analyzed by graph-theory algorithms such as HITS to determine hubs, authorities, communities and topical hotspots. Since the web has so many links, connectivity analysis algorithms use some notion of distance in their adjacency matrices. (Miller et al. 2001) suggested using arbitrary path length to avoid counter-intuitive results in weakly-connected graphs. Although computationally infeasible for large graphs, it can be done for smaller query-focused subgraphs such as those used by HITS and PFH. Accordingly, PFH treats the tree hierarchy as a strongly connected graph and considers each node to have a path to every other node. This fits well with the concept of “in the neighborhood” because it allows nodes to influence each other even if not directly connected.

Another related work in connectivity was (Guo et al. 2003), which exploited structure in XML documents in a similar way to PFH—by adapting a connectivity algorithm and defining a proximity metric. While they relied on domain-specific XML properties such as explicit x-ref hyperlinks, they also opted for a PageRank-like analysis. In Sect. 2.2 we give reasons why HITS is more appropriate in our case.

Other considerations used in our approach relate to the scoring and weight distribution. Although we use Lucene and its simple vector space model implementation, (Li et al. 2002) argued that there is little difference between several popular similarity measures for this type of search task. (Dean and Henzinger 1999) and (Borodin et al. 2001) have both suggested averaging hub weights by outlink cardinality, and PFH performs a similar calculation. Finally, the closest work to PFH within the connectivity analysis literature is the HITS algorithm itself. Because of its relevance, it is discussed in Sect. 2.

4 Approach

This section details the derivation of PFH based on the principles of HITS, given in Sect. 2.2. In hierarchical corpora, we consider the nodes with searchable content as authorities (e.g., files) and the structural nodes that link them together as the hubs (e.g., folders). Like HITS, PFH is an iterative algorithm that ‘shakes’ a small graph of linked nodes to distribute weight until equilibrium. An 0 ≤ α ≤ 1 parameter is introduced to linearly vary how weight is distributed.

Note that PFH does not perform its own content analysis but, rather, inherits the top-k results from an underlying retrieval tool. Any tool can be used provided that it reports its scores. The popular commercial DS tools hide their scores from users, so we use Apache LuceneFootnote 4 as the underlying engine.

4.1 PFH algorithm

PFH begins similar to HITS, obtaining a root set of query-specific files F from its underlying retrieval engine. We set F to Lucene’s top-250 results. Lucene’s content score for a file f is labeled C f .

PFH’s goal is to rearrange the ranking of F. The output is a set of new authority scores \(\user2{A}\) for F, and a set of hub scores \(\user2{H}\) for the nodes that provide the implicit structural links, i.e. the folders. The \(\user2{H}\) folder scores are used during the weight distribution and are not needed afterward, but they can still be output since they have peripheral uses such as file categorization (Sect. 5.4).

On the web, HITS expands its root set by considering webpages ±1 hop away. This allows it to find pages that are still relevant but do not precisely match the query. Do we need something similar, i.e. should we expand F? We argue that we do not, for three reasons. First, adding extra files increases the size of the result such that it is no longer a ‘rearrangement’, making it harder to compare against the Lucene baseline. Second, in known-item search situations users are expected to formulate queries that at least partially match their target. Third, the underlying engine may retrieve results that do not partially match the query provided it has the means and justification to do so. Thus, we do not expand F.

Since hubs and authorities are in a mutually reinforcing relationship, a set of folders or “directories”, D, is needed to provide the hub scores. In a filesystem tree, D is simply the set of folders that form a spanning tree on F. The directories provide implicit relationships such as ‘containment’ and ‘sibling’, so F ∪ D corresponds to the base set in HITS. The graph induced by F ∪ D can be considered bidirectional, with link paths between all nodes; some links would be direct while others would involve a distance calculation.

In agreement with (Marchiori 1997; Miller et al. 2001), the influence of one node on another should be decayed by their distance. The length of a link path \(\delta_{d_1,d_2}\) between two foldersFootnote 5 d 1 and d 2 can be deduced from their paths by first finding the deepest ancestor-or-self node λ shared by both and then, using |n| to refer to n’s depth in the tree, setting:

$$ \delta_{d_1,d_2}=|d_1|-|\lambda| + |d_2|-|\lambda| $$

In Fig. 1’s example, \(\delta_{e,f_4} = 0, \delta_{f_4,c} = 4\) and δc,d = 2. For a decay function, we considered reciprocal factorials (Miller et al. 2001), parameter exponents (Marchiori 1997) and the Inverse-Square Law from physics. In practice all three gave similar results, so we chose the simplest: 1/(1 + δ)2.

So far we have described a set of files F and their content scores \(\user2{C}\) , a set of folders D that structurally link all f \(\in\) F together, and a decay function to fade influence with node distance. We can now formulate the structure and content components of PFH’s weight distribution. Under PFH’s parameterized model, ‘content’ sums or variables will be multiplied by α and ‘structure’ sums or variables will be multiplied by (1 − α).

First we want a ‘content’ component for a hub, or in our case a directory d. We set d’s score by summing the authority scores A[f i ] for the files f i directly inside d. As in HITS, \(\user2{A}_0[f_i]\) is initialized to 1. Formally, at iteration k:

$$ d_{\rm content} = \sum_{f \in F:\delta_{d,f}=0}{A_{k-1}[f]} $$

We can quickly upgrade the above definition to avoid having a hub with many weak authorities surpass a hub with few but strong ones by factoring in a \({\frac{nr}{nf}}\) ratio, where nr is the number of matching files in d and nf is the total files in d. Intuitively we would expect a folder with a higher ratio to be more relevant, as a whole, to the query. Additionally, a log factor may be added to favor folders with more files when ratios are similar. Hence:

$$ d_{\rm content} = {\frac{nr \log{(1+nr)}}{1+nf}}\sum_{f \in F:\delta_{d,f}=0}{A_{k-1}[f]} $$
(1)

Since this is a ‘content’ component, it will later by multiplied by α. Next, we seek a ‘structure’ component for the hub d. According to HITS, this is an sum over the nodes that d points to. Directories point to both files and other directories, so each d has an influence over all nodes in F ∪ D via some folder path, with appropriate distance decay. Thus d structure sums over F ∪ D. As the sum has both a content feature and structure feature, we should multiply the sum over F by α and over D by (1 − α). However, recalling the initial d content equation, we see that a sum over all files may be replaced by a sum over D since each d already aggregates its own files. Thus we end up with two sums over D, of which one is multiplied by α and the other by (1 − α). This cancels out α, leaving:

$$ d_{\rm structure} = \sum_{d' \in D}{\frac{H_{k-1}[d']} {(1+\delta_{d,d'})^2}} $$
(2)

Consequently, a hub’s score takes the form α(1) + (2). Note that if any directory has subdirectories but no files, scores can still propagate past it because it will have a non-zero structure component.

Next, we want a ‘content’ component for an authority, f. This score is already reported by the underlying retrieval engine and will be later multiplied by α:

$$ f_{\rm content} = C_f $$
(3)

Finally, we want a ‘structure’ component for an authority. This is an aggregation of objects that link to f. Since files are only pointed to by directories, it is a sum over D and will later be multiplied by (1 − α).

$$ f_{\rm structure} = \sum_{d \in D}{\frac{H_{k-1}[d]} {(1+\delta_{f,d})^2}} $$
(4)

Consequently, an authority’s score takes the form α(3) + (1 − α)(4). We can now piece together the individual components. The PFH algorithm, shown next, has three inputs: F, D and \(\user2{C}.\)

PFH: calculates hubs/auths in a filesystem.

Inputs: directories D, files F and their content analysis scores \(\user2{C}.\)

Outputs: hub scores \(\user2{H}\) for D, authority scores \(\user2{A}\) for F.

1. initialize \(\user2{H}_{\user2{\rm 0}}\) and \(\user2{A}_{\user2{\rm 0}}\) to all 1’s.

2. for iteration k from 1 to K:

3. for d ∈ D:

4.  let dc[d] := d content, as in (1)

5.  let ds[d] := d structure, as in (2)

6. normalize \(_{L_\infty}(dc)\) and normalize \(_{L_\infty}(ds)\)

7. for d \( \in \) D:

8.  set H k [d] := αdc[d] + ds[d]

9. for f \( \in \) F:

10.  let fc[f] := f content, as in (3)

11.  let fs[f] := f structure, as in (4)

12. normalize \(_{L_\infty}(fs)\)

13. for f  ∈ F:

14.  set A k [f] := αfc[f] + (1 − α) fs[f]

15. normalize \(_{L_1}(\user2{H}_{\user2{k}})\) and normalize \(_{L_1}(\user2{A}_{\user2{k}}).\)

16. end for

17. return \(\user2{H}\) and \(\user2{A}.\)

The first thing to notice is the parallelism between PFH and HITS from Sect. 2.2: both perform two alternating sums, using the old hub/authority values to update the new authority/hub values, then normalizing and repeating the loop until equilibrium—or at least until the vectors stabilize after K iterations. We use K = 20 in the experiments, more than sufficient for stability. Besides redefining the summations, two notable differences with HITS are that PFH partitions the base set into disjoint sets F and D, and that PFH partitions the sums into ‘content’ and ‘structure’ components so that they can be parameterized with α.

Another note is regarding the extra normalizations. The L 1 norms are already part of HITS and needed for convergence, but the intermediate L norms are needed by PFH because the different scores are on different scales: the hub and authority vectors sum to 1, but Lucene’s relevance scores (line 10) are each in [0,1] and sum to something large. To make \(\alpha = {\frac{1}{2}}\) represent the “equal split” between structure and content, all scores must be normalized to the same scale (lines 6 and 12).

4.2 Some observations

4.2.1 Boundary cases

When α = 1, hub scores are ignored, \(\user2{A}\) inherits \(\user2{C}\) and PFH returns Lucene’s results unchanged. When \(\alpha=0, \user2{C}\) is ignored and \(\user2{A}\) depends solely on structural influences from hubs; all matched files from a folder receive the same score because content no longer differentiates between them. The top hubs will be those that are overall closest to the epicenters of neighborhoods of query-matching files in F. namely, a folder benefits if it has more files to represent it.

4.2.2 Simple filesystem example

Recall Fig. 1. When α = 1, PFH reports the authorities in Lucene’s order: f 4, f 3, f 2, f 1. The higher content score of f 4 promotes e as the second best hub after c. The root and b receive the lowest hub scores. It takes a small bias toward structure to rank f 3 above f 4, but a much larger bias (α = 0.55) to push f 4 into third place.

4.2.3 Complexity

The experiments fix K = 20 and limitFootnote 6 |F| to 250, so PFH runs in O(|D| × max(|D|, |F|)) time. The only variable is |D|, the number of folders forming a spanning tree over F. Since D depends on F, it does not necessarily grow with the corpus. Unless F is very sparsely distributed or the query produces very few matches (e.g. Fig. 1), then typically |D| < |F|.

Since PFH only rearranges an existing ranking, we do not measure its indexing or retrieval performance because they are part of Lucene.

To speed up each iteration, a |D| × |F ∪ D| matrix of δ values can be built and used to lookup the δ-distance between pairs of nodes. We computed this naively in O(n 2), but it can be optimized with dynamic programming. Since cumulative time for building this matrix and then running PFH with K = 20 iterations averaged only 0.06 s, we did not optimize it.

4.2.4 Other observations

If the tree hierarchy is completely flattened, all files will be in the same hub and receive the same hub influence, thus C f will be the sole discriminator and Lucene’s results will be unchanged. Inversely, if the F ∪ D root set is very sparsely distributed and all files are distant from each other, hub influences will be weakened due to large distance decays. Although they are later normalized, they will reflect the quality of the hub’s own files and incorporate little about the neighborhood. In this case, folders with many high-scoring files will become the top hubs, which agrees with intuition. If neither case applies and the retrieved results are neither sparse nor flat, then the search problem is a typical DS query. An experiment on such a case is performed in Sect. 5.

Alternatively, if files are poorly organized (e.g., random), there will not be any meaningful organization to exploit. It is difficult to predict PFH’s behavior in such atypical circumstances because a human-crafted hierarchy is never completely meaningless. Nevertheless, an experiment on such a case also is performed in Sect. 5 to find ‘safe’ values of α for which PFH will perform not-worse than the baseline when organization is meaningless.

5 Evaluation

Three experiments are performed: queries on an organized corpus, queries on a meaninglessly organized corpus, and file categorization. In all cases, the top-250 results from Lucene v1.9, with stemming enabled, were used as PFH’s input and set F. For experimental measurements requiring human relevance judgments, the top-50 results from PFH and from five other retrieval systems were pooled together and judged in a blind test.

5.1 Competition and configuration

We compare PFH’s ranking to five DS tools: Lucene (our baseline) and four popular (Noda and Helwig 2005) commercial tools by Copernic, Google, Microsoft and Yahoo!. The latter are labelled as CDS, GDS, MDS and YDS.

To obtain results from these tools, we selected the rank-by-relevance option for GDS; for MDS we chose rank-by-relevance over ‘Document’ results; CDS and YDS do not have a relevance ranking and their default lexical ordering was used; Lucene matches disjunctively and ignores stopwords by default, so we left its default settings alone. Each of these tools was configured to index only the corpus being experimented on.

The implementations of the four commercial tools are black-boxes, but it is evident that they all use conjunctive matching and index stopwords. It is also evident that at least CDS and MDS stem, although weakly, and that all tools except Lucene match the file name and path. To avoid a situation where a black-box tool may be using the path or file timestamps to influence its ranking (which Lucene does not), we renamed all files and folders to meaningless numbers and reset their timestamps to the same value. This will have forced each tool to match and retrieve files their index based on their content rather than metadata.

5.2 Exp A: a structured corpus

This experiment aims to evaluate PFH under Mean Reciprocal Rank. Traditional metrics of Precision and Mean Average Precision are also given.

Unfortunately there is no standard corpus for DS (Chernov et al. 2007) and thus Precision metrics are not simple to provide. Previous works on DS, in Sect. 3, have run user evaluations using data that is not shareable and experiments that are impossible to reproduce precisely. In order to show reproducible and transparent results, we opted for a well-known file hierarchy: the JDK 1.5.0. This corpus is a 10000-node technical documentation tree of Java source code files. It has a varied structure and is both accessible and familiar to many computer scientists.

The files are documented with developer comments that serve as their semantic content. We ran Javadoc over the corpus to strip away the programming source code and generate the API-format web pages for each file. We then stripped away the web markup—including any templates, such as headers and footers—to effectively leave behind only a clean, plain-text version of each class. The resulting corpus had 9,268 files, 674 folders, on average 14 files per folder (median 5, max 321) and 930 words per file (median 460, max 38000), and average depth of 4.5 (median 4, max 9).

Although structured as a package tree, from the perspective of keyword queries there is much disorder in the hierarchy with many classes or packages that deal with related topics and yet are not placed together. Often it is a case of the implementation and interface being in separate locations, or classes dealing with similar topics. For example, many files match the query “connect remote server” because there are many packages dealing with establishing network connections. If we were to combine the top-50 results from each of the 6 DS tool rankings for this query, we would end up with 242 unique files from 45 separate packages. This suggests that the DS tools cannot come to an agreement regarding which files are most relevant. As another example, the first query in Table 1 has four targets that lie in completely separate parts of the tree such that any pair of them share only the corpus root folder in common.

Table 1 Test queries, showing |R| (ratio of corpus retrieved by Lucene), \(\bar{\delta}\) (average pairwise δ of top-250 results), and the targets

Another property of the corpus is that many queries match a huge number of files. For example, the last query in Table 1 matches 94% of the corpus disjunctively (and 7% conjunctively) because its keywords are common—yet only two target files are rewarded. This behavior is typical of navigational search and therefore the structure is reasonably representative of of DS-style navigational queries.

5.2.1 Queries

Shown in Table 1, a total of 24 queries were sourced from online Java FAQs to exemplify real-world user questions about a programming task. The queries were aligned to typical search practices by being condensed and expressed using 2–3 keywords (Dumais et al. 2003; Jansen et al. 1998; Spink et al. 2001). The query targets were classes mentioned in the FAQ answers. The queries are considered navigational because they seek very specific results.

5.2.2 Results

For Precision-based metrics, for each query we pooled together the top-50 results from the six different raking methods: PFH, Lucene and the four commercial tools. Relevance was judged blindly by two researchers who had taught Java-oriented courses at university level. On average, 203 files needed to be inspected per query. To control overblow in vague queries with too many tangential matches, judges were asked to scrutinize results more strictly in order to control the maximal number of files deemed relevant per query.

At the end of the process there was an average of 15 ‘relevant’ Java classes per query. These results could then be retrospectively compared against the original rankings of the six tools. The resulting P@3, P@10 and MAP graphs are shown in Figs. 2, 3, 4. While we are mostly interested in MRR, these three precision graphs can help place a tighter bound on the useful α range. The MRR graph, shown in Fig. 5, is plotted using the reciprocal ranks of the Table 1 targets and therefore does not use the relevance judgements. The four graphs are described below.

Fig. 2
figure 2

MAP for Exp A’s Table 1 queries. Also shows Exp B’s randomized bad-case scenario

Fig. 3
figure 3

P@10 for Exp A’s Table 1 queries. Also shows Exp B’s randomized bad-case scenario

Fig. 4
figure 4

P@3 for Exp A. PFH had good precision for α > 0.6 even under Exp B’s randomized bad-case scenario

Fig. 5
figure 5

MRR for Exp A. Shows a significant improvement in ranking the Table 1 targets higher

Figure 2 shows Mean Average Precision. MAP is not considered a comprehensive indicator for known-item search because it rewards potentially-wanted rather than actually-wanted items, but it is nevertheless a stable metric across query set size (Buckley and Voorhees 2000) and it is popular in IR. In Fig. 2, PFH was within ±2.5% of the baseline when α > 0.69. The difference was not statistically significant for most α (Wilcoxon p ≫ 0.3 when α > 0.5).

Figure 3 shows P@10. PFH reached the baseline at α = 0.54 and the difference was not statistically significant when α > 0.4.

Figure 4 shows P@3, qualifying the observation by (Joachims et al. 2005) that users look at the top-3 search results. PFH compared favorably for most α values, with the difference being significant (p < 0.05) in the middle. Comparing P@3 and P@10, it seems that PFH makes beneficial swaps at the top ranks but leaves the overall quality of the top-10 intact.

Figure 5 shows MRR, a useful metric for known-item search. PFH performed well when α > 0.17, and the difference was significant for most α values (p < 0.05 when 0.27 < α < 0.97). The maximum was at 0.30–0.47, a strong bias towards structure that reflects a meaningful organization. However, this α-range had suboptimal P@n and MAP. Consolidating all four metrics reveals that α = 0.75 was the best-performing value: compared to the baseline, it raised MRR and did not worsen the Precision-based metrics.

5.3 Exp B: poor file organization

Exp A suggested that a 25% bias towards structure was good for a structured corpus. In this experiment, we are interested in how such an α value performs when file organization is poor. This experiment is a smoke test to verify that PFHα=0.75 performs sanely when structure is meaningless, but it can also reveal a ‘safe’ range of α values for which PFH performs no worse than the Lucene baseline when the hierarchical structure is meaningless.

We can simulate poor organization and create a meaningless structure simply by randomizing the arrangement of files in the corpus from the previous experiment. Reusing the corpus would have the benefit of maintaining the same numerical structure properties—such as depth, breadth and distribution of nodes in the hierarchy—and will therefore let us compare the two results. Note that the baseline is unaffected by any structural modifications since it uses only content to rank its results.

To randomize the corpus arrangement, we chose random pairs of files and swapped their content. Enough swaps were made for each file’s content to have jumped four times. Using the original Exp A queries on this randomized arrangement, PFH’s precision and MRR results were then plotted against α. The results are shown in Figs. 2, 3 and 4 5 as the line PFH-random. By construction, PFH-random still converges to the baseline accuracy at α = 1, when structure is ignored.

The P@3 and P@10 graphs show that α = 0.75 is still good for Precision, even under meaningless organization. The MAP graph suggests that a tighter bound of 0.83 is safer. Not unexpectedly, MRR worsened in this experiment. By randomly swapping files while keeping folder density intact, dense folders acquire a larger proportion of query-matching files. If the relatively few query targets do not end up close to these concentrated neighborhoods, their ranking may significantly worsen as α decreases and thus affect the MRR reading. With so few targets rewarded, this perhaps explains why only MRR was worse at the α = 0.75 range (although still competitive against other DS tools).

Overall, this experiment shows that PFH performed acceptably for α > 0.83, even when the hierarchical arrangement was meaningless. Since such disorganized arrangements are unlikely to be seen in practice, it can be argued that 0.75–0.83 is a useful α range since it maintains a higher MRR under regular conditions (Exp A) and has no negative effect on Precision under poor conditions (Exp B).

5.4 Exp C: hub categorization

A possible negative side-effect of improving the quality of Desktop Search is that users may become lazy and pay less attention to how they organize their desktops since they know they can retrieve files quickly. A similar behavior already occurs online, where instead of bookmarking sites we often type a keyword query directly into the URL bar to redirect the browser to the first result from a search engine. While this behavior is not problematic online, it is undesirable for Desktop Search and PFH because Exp B shows that PFH loses its advantage when files are randomly organized. It is therefore in the users’ interest not to corrupt their established organizational practices.

One approach to help users maintain a modicum of diligence is to suggest suitable save locations whenever they create or download new files. On modern Windows systems, the Save dialog contains icons to key locations—such as ‘Desktop’ and ‘My Documents’—so that the user can jump there without traversing the hierarchy. An idea explored in the literature has been to add icons to other suitable folders. FolderPredictor (Bao et al. 2006), a component of TaskTracer (Dragunov et al. 2005), modified the Windows registry to change the folders based on the user’s past history of file accesses for the current user task. Unfortunately, FolderPredictor has similar disadvantages as TaskTracer (Sect. 3), such as having to identify task switches. In any case, we can use PFH for a similar prediction. Suppose we can represent a new file by a query. Then, the top hubs returned by PFH for this query may be the most suitable folders where it should be saved. These hubs can be represented by folder icons in the Save dialog to help the user, either as a cognitive aid or as a shortcut to jump somewhere closer to the desired save location. Essentially, we would be categorizing authorities into their appropriate hubs. This idea may also extend beyond the desktop, for example adding a new web page to the Open Directory or adding a new paragraph to a thesis.

PFH already computes the necessary information during regular execution, so there is little extra work to do—the hub scores were simply discarded before. To conduct an experiment on this idea and compare against the other DS tools, we require each tool to return a ranked list of folders instead of files. A simple transformation is to consider a file as its folder. Ignoring repeated sightings produces a ranked list of only folders. In the following experiment, this transformation was used for Lucene, MDS and GDS (CDS and YDS yielded poor lexically-sorted results and are omitted).

Obtaining a folder ranking from PFH can be done in at least three simple ways. One way is to use the above file-folder transformation, a second way is to use the \(\user2{H}\) hub vector scores directly, and a third way could be a hybrid approach that combines the previous two ways. The hybrid approach we tested was to set a folder d’s score to \(\alpha\user2{A}[f] + (1 - \alpha)\user2{H}[d],\) where \(\user2{A}[f]\) was the authority score of the highest ranked file from d (0 if none). Sorting by this combined score gives a ranking of only folders. Because the transformation of the results of the other DS tools necessarily outputs only non-empty folders, we elided empty hubs from PFH’s hub results.

The experiment used the settings from Exp A. The target folder(s) were considered to be the folder(s) of the query’s original target file(s). The result, in Fig. 6, shows that PFH scored a higher MRR under all three approaches, suggesting that it ranked the correct folder (or ‘picked the correct category’) with higher accuracy. The simple transformation approach peaked at α = 0.47, likely due to PFH’s high P@n and MRR at that point. The hub vector approach was relatively steady for most values of α and consistently above the baseline. The hybrid approach, peaking at 0.66–0.92, was best in the 0.75–0.83 region suggested by the previous experiments.

Fig. 6
figure 6

MRR for Exp C’s categorization. The PFH approaches predicted the correct folders for files more often than the naive approaches, and the hybrid approach was the best performer in the α ≈ 0.8 range

6 Conclusion

Search intent in hierarchical corpora is generally of a navigational nature and constitutes known-item search, seeking a specific result. Based on the premise that there is meaning and reason behind the way humans choose to organize content in a tree-like hierarchical arrangement, we present PFH, a lightweight rearrangement algorithm that re-ranks existing content-only search results of retrieval systems that do not analyze structure. We do this by combining content scores with PFH’s own structural analyses scores to favor tree nodes that are structurally closer to other relevant nodes. We also introduce an α parameter to logically separate the ‘structure’ and ‘content’ components in order to allow us to vary their emphasis and observe changes in search accuracy.

Experiments are provided on search in a real-world structured corpus, search in a meaninglessly arranged corpus, and categorization of new hierarchy nodes. Our results show that a 20%-structure vs. 80%-content ratio outperforms a content-only ranking. At the α = 0.8 level, PFH significantly improves Mean Reciprocal Rank while keeping other IR metrics steady, indicating that it makes beneficial rearrangements for known-item searching.

Overall, we show that a certain consideration of structure is useful for search over hierarchical corpora. The approach provides this facility in a flexible manner because it is time and space efficient, and can be used on top of the results of various existing retrieval systems. The approach can also be used to build new retrieval systems over hierarchical content such as Desktop Search, technical documentation or digital books.