Leveraging personal metadata for Desktop search: The Beagle++ system

https://doi.org/10.1016/j.websem.2009.12.001Get rights and content

Abstract

Search on PCs has become less efficient than searching the Web due to the increasing amount of stored data. In this paper we present an innovative Desktop search solution, which relies on extracted metadata, context information as well as additional background information for improving Desktop search results. We also present a practical application of this approach—the extensible Beagle++ toolbox. To prove the validity of our approach, we conducted a series of experiments. By comparing our results against the ones of a regular Desktop search solution – Beagle – we show an improved quality in search and overall performance.

Introduction

The capacity of our hard-disk drives has increased tremendously over the past decade, and so has the number of files we usually store on our computer. With a few hundred of gigabytes at hand, it is quite common to have over 100,000 indexable items on the Desktop. It is no wonder that sometimes we cannot find a document anymore, even when we know we saved it somewhere. Ironically, in some of these cases nowadays, the document we are looking for can be found faster on the World Wide Web than on our personal computer. In view of these trends, resource organisation in personal repositories has received more and more attention during the past years. Thus, several research and development projects have started to explore personal information management, including Stuff I’ve Seen [17], Haystack [38], or Gnowsis [43]. The personal information management challenge is to make all resources on one’s Desktop easily accessible and manageable. In this context, Desktop search is the obvious solution for finding such stored information.

In order to offer better results, current Desktop search engines have to improve the classic method of retrieval based on TFxIDF measures, and use additional information about the searchable resources. Currently, only few of the commercial Desktop search engines collect basic metadata, such as titles, authors, comments, etc., usually already contained in the indexed files. However, since very few people spend time annotating their documents, this functionality provides only a limited improvement over regular text-based search. Studies have shown that people associate things with certain contexts [45], or to be more specific, everything happens within a context and a person will not think of a thing by its own, but within this very context. For example, a person will not only consider a document, but also the email that it was sent with and the person who sent it, i.e., the context of the document. For this reason, this kind of information should be utilised during search. So far, however, neither has this information been collected, nor have there been attempts to use it.

In this paper we propose to exploit the implicit semantic information residing at the Desktop level in order to enhance Desktop Search. We therefore propose the automatic generation of metadata taking into account the context of Desktop resources:

  • Email context clearly generates useful information. For example, one email might contain a question describing the object one is looking for, and another email in the same thread might include the answer to that question in the form of an attached document.

  • Email attachments lose all contextual information as soon as they are stored on the PC, even though emails usually include additional information about their attachments, such as sender, subject or comments. It would be helpful to find an attachment not only based on its content, but also based on its associated context2 from within the email.

  • Folder hierarchies may contain valuable context information, because we might have spent considerable time to build sophisticated structuring hierarchies for the documents we store.

  • Browser caches include all information about the user’s browsing behaviour. This is useful both for finding relevant results, and for providing additional context for them.

  • Downloaded publications also miss all their “links”, once stored on our machines. Yet it would be very useful if a search application not only returns one specific scientific paper, but all the referenced and referring papers which we downloaded on that occasion as well.

The additional metadata generated would be useless without a proper mechanism of querying and results ranking. Web search has become very efficient due to the powerful link-based ranking solutions such as PageRank [35]. The recent arrival of Desktop search applications, which index all data on the PC, promises to increase search efficiency on the Desktop. However, Desktop search engines are now comparable to first generation Web search engines, which provided full-text indexing, but only relied on textual IR (Information Retrieval) algorithms for searching and ranking.

We propose a centralised approach for querying, which combines the full-text and metadata search, and adds a modified ObjectRank [3] mechanism for improved ranking of the retrieved results. In summary, this paper makes the following main contributions:

  • present how to enhance and contextualise Desktop search based on semantic metadata collected from different activities performed on a PC;

  • show how this metadata can be used by a search application together with a full-text index for improving search results and ranking quality;

  • by putting together all those enhancements we construct Beagle++, an easily extendible application, and we further demonstrate its benefits through a set of user conducted experiments.

Through our extensions, we show that Beagle++ is not simply a Semantic Desktop (as Haystack [38], IRIS [11], Gnowsis [43]), a Personal Information Management application (as SEMEX [16] or SIS [17]), or a new data storage paradigm (as Lifestream [20] or TagFS [7], [21]). Instead, Beagle++ relies on a combination of all these aspects to provide a Desktop search engine that works on the “classic” Desktop metaphor and exploits the semantics contained in the Desktop data items. It is thus an example to illustrate the Semantic Desktop paradigm, demonstrating its benefits and potentials to ordinary users. Beagle++ is available for download3 as sources, binaries and virtual machine.

The components making up Beagle++ contribute to the NEPOMUK project.4 The goal of NEPOMUK is to create the Social Semantic Desktop which allows management of Desktop resources as well as sharing and exchange of data between Desktops [13], [22]. NEPOMUK provides an infrastructure for including various components in the Social Semantic Desktop application. All our components were also embedded in this framework, and thus also integrated with other components such as Gnowsis [43].

The rest of the paper is organised as follows: In the next section we present a short overview of the Beagle++ architecture and continue with details about each new component we add to the system. We discuss exploited contexts and resulting metadata in Section 3, together with the way we store and index this additional information. Section 4 then introduces additional modalities for enriching the already existing metadata by some novel techniques: Entity Identification, Attachment-File Linker and ObjectRank. Section 5 presents how we combine metadata and full-text search, how we rank search results, and how we present them to the user via a visual interface. To test the efficiency and effectiveness of our solution, we conducted several experiments, described in Section 6. We present related work in Section 7 and finally conclude and present some future work in the last section.

Section snippets

Enhancing the Beagle Desktop Search Architecture to Support Metadata—An overview

As basis for our Beagle++ environment we use the open source Gnome Desktop search engine Beagle5 for Linux, which we extend with semantic indexing, searching and ranking capabilities. The reason for choosing Gnome Beagle to build upon was to reuse existing work on developing and establishing a Desktop indexing and searching platform, such that we could primarily focus on developing the semantic part of our Semantic Desktop Search engine.

Fig. 1 illustrates the overall

Metadata Generation and Storage

As already presented in Section 2, an important functionality of our Beagle++ Desktop tool is the creation and storage of metadata. Since these metadata are used and processed by an extensible set of components, they need to be compliant with a common well-defined ontology, such that every component which generates and consumes metadata can rely on their format and semantics. In the following, we will describe the way we reference Desktop objects, the metadata format, and discuss the underlying

Metadata Enrichment

As already mentioned in Section 2, once the metadata are stored in the RDF repository, we propose to further apply several methods for enriching them. In this section we describe in detail these methods and more precisely three different modules encapsulating them: the Entity Identification which looks for similar items in the RDF repository and joins them, thus allowing to find entities with different representations; the Attachment-File Linker which preserves the links between emails and

Metadata Search

After populating and indexing the metadata store we provide the user with the search functionality as well as with the possibility to visualise retrieved Desktop items together with their metadata. In this section we will describe how Beagle++ performs the search, how the final ranking of the results is computed, and how it displays the retrieved resources.

Experiments

In order to evaluate the performance of our Beagle++ system, the natural baseline we considered was Beagle, since Beagle++ is an extension of Beagle. The first category of experiments aimed to prove the quality of the results provided by Beagle++. It was done involving human judges who rated the results that our system provided to personalised queries. The second type of experiments considered the performance in terms of time to index collections of data, the amount of extra data (metadata)

Related work

Desktop search applications are not new to the industry, only the high interest in this area is new: applications have been available since 1998 (e.g., Enfish Personal35), usually under a commercial license. As the amount of searchable Desktop data has reached very high volumes and will most probably continue to grow in the future, the major search engines have recently given more focus to this area than the academia. Thus, several free Desktop search distributions have

Lessons Learned

Developing, evaluating, and using Beagle++ allowed us to realise several issues one needs to deal with in order to create an effective Semantic Desktop search engine. In this section we present and discuss the most important issues.

Metadata Extraction. Extracting metadata that are not explicitly stored in Desktop resources require non-trivial algorithms, or the usage of external information sources (cf. Metadata Filter for Scientific Publication in Section 3.2.2). For the former, we had to

Conclusions

In this paper we presented the Beagle++ Desktop search tool and the underlying architectural design details for implementing a semantically enhanced Desktop search application. Our current implementation builds upon a snapshot of the standard Beagle implementation and we provided details about all new components we added to the system: the Metadata Filters, the RDF Storage and Indexing Module, the Metadata Enrichment Components – Entity Identification, ObjectRank and Attachment-File Linker – as

Acknowledgements

This work was supported by the NEPOMUK48 project funded by the European Commission under the 6th Framework Programme (IST Contract No. 027705). We would also like to thanks many colleagues within L3S for their important contributions.

References (51)

  • H. Alani

    TGVizTab: an ontology visualisation extension for Protégé

  • B. Aleman-Meza et al.

    Context-aware semantic association ranking

  • A. Balmin et al.

    ObjectRank: authority-based keyword search in Databases

  • T. Berners-Lee
  • T. Berners-Lee
  • I. Bhattacharya et al.

    Deduplication and group detection using links

  • S. Bloehdorn et al.

    TagFS—tag semantics for Hierarchical File Systems

  • J. Broekstra et al.

    Sesame: a generic architecture for storing and querying RDF and RDF Schema

  • I. Brunkhorst et al.

    The Beagle++ Toolbox: towards an extendable desktop search architecture

  • C. Chemudugunta et al.

    Modeling documents by combining semantic concepts with unsupervised statistical learning

  • A. Cheyer et al.

    IRIS: Integrate. Relate. Infer. Share

  • P.A. Chirita et al.

    Activity based metadata for semantic desktop search

  • S. Decker, M. Frank, The Social Semantic Desktop, Tech. Rep., DERI 2004-05-02,...
  • L. Ding et al.

    Swoogle: a search and metadata engine for the semantic web

  • X. Dong et al.

    Reference reconciliation in complex information spaces

  • X. Dong et al.

    SEMEX: toward on-the-fly personal information integration

  • S. Dumais et al.

    Stuff I’ve Seen: a system for personal information retrieval and re-use

  • P. Eklund et al.

    OntoRama: browsing an RDF ontology using a hyperbolic-like browser

  • B. Fallenstein

    Fentwine: a navigational RDF browser and editor

  • E. Freeman et al.

    Lifestreams: organizing your electronic life

  • O. Görlitz et al.
  • T. Groza et al.

    The NEPOMUK Project—on the way to the Social Semantic Desktop

  • R. Guha et al.

    Semantic search

  • A. Harth et al.
  • E. Ioannou et al.

    Probabilistic entity linkage for heterogeneous information spaces

  • Cited by (10)

    View all citing articles on Scopus
    1

    This work was performed while the author was employed by L3S Research Center.

    View full text