Elsevier

Information Systems

Volume 36, Issue 4, June 2011, Pages 708-720
Information Systems

Metric information filtering

https://doi.org/10.1016/j.is.2010.09.007Get rights and content

Abstract

The traditional problem of similarity search requires to find, within a set of points, those that are closer to a query point q, according to a distance function d. In this paper we introduce the novel problem of metric information filtering (MIF): in this scenario, each point xi comes with its own distance function di and the task is to efficiently determine those points that are close enough, according to di, to a query point q. MIF can be seen as an extension of both the similarity search problem and of approaches currently used in content-based information filtering, since in MIF user profiles (points) and new items (queries) are compared using arbitrary, personalized, metrics. We introduce the basic concepts of MIF and provide alternative resolution strategies aiming to reduce processing costs. Our experimental results show that the proposed solutions are indeed effective in reducing evaluation costs.

Research Highlights

► Basic principles of information filtering in metric spaces (MIF). ► Pivot-based resolution strategies for Lipschitz-equivalent metrics. ► Experimental assessment of the efficiency of the proposed techniques.

Introduction

The advent of the era of globalization has originated an ever-growing amount of new information also provided by end-users (thanks also to the availability of new channels like blogs, RSS, digicams, and so on), making it hard for people to process such information in an effective way. Basically, the information production rate is now much higher than our consumption rate [1]. This phenomenon, which is termed information overload, harms the decision-making process, because users are not able to assess the validity of all such information. Although ad hoc solutions (e.g., spam filters) exist for some application fields, solving the problem for a broader range of domains is still an active subject of research.

The goal of information filtering (IF) systems is to avoid flooding users with all available information, by delivering to them only the (small fraction of) relevant data. Ultimately, this would lead to increase the semantic signal-to-noise ratio. IF systems have been demonstrated useful in many domains, including newsfeeds (apml.areyoupayingattention.com), music (rateyourmusic.com), movies (www.everyonesacritic.net), and websites (givealink.org). In order to “learn” user's interests, each IF system stores a user profile [2], that takes into account, say, which items the user has visited/rated/bought. Such profile can then be exploited using two alternative filtering paradigms [3]:

Collaborative filtering: The assumption behind this approach is that users that agreed in the past will also agree in the future [4]. For example, if users A and B bought the same item X and user A also bought item Y, then it is likely that user B might be interested in buying Y. Systems following this approach, which has been popularized by Amazon.com, usually store a user/item matrix to infer rating/buying patterns for each user.

Content-based filtering: Systems adopting this approach filter data according to the correlation/similarity between the user profile and the content of each item [5]. In the most general case, the user profile also includes a personalized similarity criterion and a threshold, so that only items that are similar enough to the profile are forwarded to the user.

It has to be noted that content-based filtering is usually easier to implement than collaborative filtering [3] and also has the advantage of allowing filtering of new data/items that, lacking ratings, could not be easily dealt with by collaborative filtering techniques (this is called the “cold start” problem). On the other hand, efficient solutions for content-based filtering are hard to obtain when the number of user profiles is large: in such cases, comparing the new item with all user profiles is indeed an unviable solution.

Content-based filtering has been traditionally based on either the Boolean or the vector space model (see Section 2 for more details). However, for complex objects and/or similarity criteria it is known that more flexible modelling techniques are needed [6]. To this end, in this paper we develop a new IF approach based on the principles of the metric space model. Indeed, techniques developed in the context of metric spaces for the problem of similarity search [7], [8], [9], [10] have been successfully used in a variety of problems that commonly arise in application domains as diverse as content-based retrieval of multimedia objects, genomic and biomolecular databases, as well as string/text databases.

State-of-the-art techniques for searching in metric spaces, however, lack a fundamental feature for IF, that is, the possibility of accommodating user preferences in the specification of the distance function that determines how much two objects can be considered to be “similar” to each other. The metric information filtering (MIF) problem we tackle in this article requires, therefore, an “enlarged” metric scenario, allowing “personalized views” of the space (Section 3). We show how to extend metric space searching techniques to deal with MIF peculiar features, taking into account both correctness (Section 4) and efficiency (Section 5), and we experimentally validate the presented techniques (Section 6). We also consider the case, very common in commercial systems, when multiple items are to be filtered at the same time, and show how performance can be further improved (Section 7).

Section snippets

Background

In this section we provide the necessary background on content-based IF and on search methods in metric spaces.

Metric information filtering

The metric information filtering (MIF) problem we consider in this paper can be precisely defined as follows:

Problem 1

Given a finite set of objects XU and a point qU, where each point xiX is associated to a personalized distance metric di and a personalized radius ri, determine the subset of X consisting of those xi such that di(xi,q)ri holds.

In IF terms, the triple (xi,di,ri) represents the profile of the ith user, whereas q is a new item. Note that the case where each user is associated with

Computing lower bounds

Based on the assumption that at least one of the scaling factors si,j and sj,i exists, in this section we explore the basic alternatives that can be used to generalize pivot-based methods so as to deal with the MIF problem.

Beyond correctness: the symmetric scaling factor

In order to characterize not only the correctness but also the performance of MIF, it is important to properly understand how the use of a specific lower-bounding inequality (among those introduced in the previous section) can impact search costs. To this end, in this section we provide basic results aiming to shed more light on the role that scaling factors play from a performance point of view. We start with the following preliminary observations:

  • 1.

    Just considering the scaling factor sj,i (or si

Experimental evaluation

We implemented the three pruning strategies proposed in Section 4 and tested them over some synthetic datasets. The use of synthetic, rather than real, datasets is motivated by the fact that in our experiments we need not only a set of points, but also a corresponding distance for each point. We therefore generated three different 3D datasets in the [0,1]3 cube, using a different weighted Euclidean distance for each point; points (and items) and distance weights were produced as follows:

  • uni

    Points

Batch arrivals

A common case in information filtering is that several items arrive together into the system (batch arrivals), and each of them should be forwarded only to those users whose profile matches it. A naïve way of processing batch arrivals is to perform filtering of items in a completely independent way. As demonstrated in [21] for the case of Boolean IF, this approach is likely to waste a lot of system resources, since it does not recognize that items might be grouped based on their similarity. For

Conclusions

In this paper we have introduced the novel problem of metric information filtering (MIF). MIF can be seen both as a generalization of the traditional metric space similarity search problem (in that in MIF every point carries its own personalized distance to measure dissimilarity with queries) as well as an extension of traditional modelling approaches used in content-based information filtering (in that in MIF user profiles and new items are compared using arbitrary metrics).

In order to speed

References (24)

  • B. Bustos et al.

    Pivot selection techniques for proximity searching in metric spaces

    Pattern Recognition Letters

    (2003)
  • J. Jacoby

    Perspectives on information overload

    The Journal of Consumer Research

    (1984)
  • U. Çetintemel et al.

    Self-adaptive user profiles for large-scale data delivery

  • U. Hanani et al.

    Information filtering: overview of issues, research and systems

    User Modeling and User-Adapted Interaction

    (2001)
  • D. Goldberg et al.

    Using collaborative filtering to weave an information tapestry

    Communications of the ACM

    (1992)
  • T.W. Yan et al.

    The SIFT information dissemination system

    ACM Transactions on Database Systems

    (1999)
  • R.A. Baeza-Yates et al.

    Modern Information Retrieval

    (1999)
  • E. Chávez et al.

    Proximity searching in metric spaces

    ACM Computing Surveys

    (2001)
  • P. Zezula et al.
    (2006)
  • E. Chávez, G. Navarro (Eds.), Proceedings of the 1st International Workshop on Similarity Search and Applications...
  • T. Skopal, P. Zezula (Eds.), Proceedings of the 2nd International Workshop on Similarity Search and Applications (SISAP...
  • G. Salton

    Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer

    (1989)
  • Cited by (0)

    Or: How to Win the (Metric) Space War with Information Overload.

    View full text