Elsevier

Information Systems

Volume 36, Issue 1, March 2011, Pages 62-78
Information Systems

Generalizing prefix filtering to improve set similarity joins

https://doi.org/10.1016/j.is.2010.07.003Get rights and content

Abstract

Identification of all pairs of objects in a dataset whose similarity is not less than a specified threshold is of major importance for management, search, and analysis of data. Set similarity joins are commonly used to implement this operation; they scale to large datasets and are versatile to represent a variety of similarity notions. Most methods proposed so far present two main phases at a high level of abstraction: candidate generation producing a set of candidate pairs and verification applying the actual similarity measure to the candidates and returning the correct answer. Previous work has primarily focused on the reduction of candidates, where candidate generation presented the major effort to obtain better pruning results. Here, we propose an opposite approach. We drastically decrease the computational cost of candidate generation by dynamically reducing the number of indexed objects at the expense of increasing the workload of the verification phase. Our experimental findings show that this trade-off is advantageous: we consistently achieve substantial speed-ups as compared to known algorithms.

Introduction

Similarity joins pair objects from a dataset whose similarity is not less than a specified threshold; the notion of similarity is mathematically approximated by a similarity function defined on the collection of relevant features representing two objects. This is a core operation for many important application areas including data cleaning [2], [3], text data support in relational databases [4], [5], collaborative filtering [6], Web indexing [7], [8], social networks [6], and information extraction [9].

Several issues make the realization of similarity joins challenging. First, the objects to be matched are often sparsely represented in very high dimensions—text data are a prominent example. It is well known that indexing techniques based on data-space partitioning are often outperformed by simple sequential scans at high dimensionality [10]. Moreover, many domains involve very large datasets, therefore scalability is a prime requirement. Finally, the concept of similarity is intrinsically application-dependent. Thus, a general purpose similarity join realization has to support a variety of similarity functions [3].

Recently, set similarity joins gained popularity as a means to tackle the issues mentioned above [2], [3], [8], [11], [12], [13]. The main idea behind this special class of similarity joins is to view operands as sets of features and employ a set similarity function to assess their similarity. An important property, predicates containing set similarity functions can be expressed by the set overlap abstraction [3], [11]. Several popular measures belong to the general class of set similarity functions, including Jaccard, Dice, Hamming, and Cosine. Moreover, even when not representing a similarity function on its own, set overlap constraints can still be used as an effective filter for metric distances such as the string edit distance [5], [14].

As a concrete example, consider the data cleaning domain. A fundamental data cleaning activity is the identification of the so-called “fuzzy duplicates”, i.e., multiple and non-identical representations of a real-world entity. Fuzzy duplicates appear in a dataset often owing to data entry errors like typos and misspellings. In such cases, fuzzy duplicates exhibit slight textual deviations and can be identified by applying a (self) similarity join over the dataset. A widely used notion of string similarity is based on the concept of q-grams. Informally, a q-gram is a substring of size q, obtained by “sliding” a window of size q over the characters of a given string. We can view q-grams as features representing a string. Employing similarity joins based on multidimensional data structures is problematic due to high-dimensionality of the underlying space: up to |Σ|q, where Σ is the alphabet from which strings are built (see further discussion in Section 8). In this context, set similarity joins have been the method of choice to realize similarity matching based on q-grams [2], [3]. Besides efficiency, the corresponding set similarity functions have been shown to provide competitive quality results compared to other (more complex) similarity functions [15].

Example 1

Let s1=Kaiserslautern and s2=Kaisersautern be strings; their respective set of 2-grams are q(s1)={Ka,ai,is,se,er,rs,sl,la,au,ut,te,er,rn},q(s2)={Ka,ai,is,se,er,rs,sa,au,ut,te,er,rn}.Consider the Jaccard similarity (JS), which is defined as JS(x1,x2)=|x1x2||x1x2|,where x1 and x2 are set operands. Applying Jaccard on the q-gram sets of s1 and s2, we obtain: JS(q(s1),q(s2))=11/(13+1211)0.785.

Most set similarity joins algorithms are composed of two main phases: candidate generation, which produces a set of candidate pairs, and verification, which applies the actual similarity measure to the generated candidates and returns the correct answer. Recently, Xiao et al. [13] improved the previous state-of-the-art similarity join algorithm due to Bayardo et al. [12] by pushing the overlap constraint checking into the candidate generation phase. To reduce the number of candidates even more, the authors proposed the suffix filtering technique, where a relatively expensive operation is carried out before qualifying a pair as a candidate. For that purpose, the overlap constraint is converted into an equivalent Hamming distance and subsets are verified in a coordinated way using a divide-and-conquer algorithm. As a result, the number of candidates is substantially reduced, often to the same order of magnitude of the result set size.

In this paper, we propose a new index-based algorithm for set similarity joins. Our work builds upon the previous work of [12], [13], however, we follow an opposite approach to that of [13]. Our focus is on the decrease of the computational cost of candidate generation instead of reduction of the number of candidates. For this, we introduce the concept of min-prefix, a generalization of the prefix filtering concept [3], [11] applied to indexed sets. Min-prefix allows to dynamically maintain the length of the inverted lists reduced to a minimum, and therefore the candidate generation time is drastically decreased. We address the increasing in the workload of the verification phase, a side-effect of our approach, by interrupting the computation of candidate pairs that will not meet the overlap constraint as early as possible. We also improve the overlap score accumulation by avoiding the overhead of dedicated data structures. Furthermore, we consider disk-based and parallel versions of the algorithm. Finally, we conduct a thorough experimental evaluation using synthetic and real datasets. Our results demonstrate that our algorithm consistently outperforms the so-far known ones for unweighted and weighted sets and reveal important trends of set similarity join algorithms in general.

The rest of this paper is organized as follows. Section 2 defines our terminology and reviews important optimization techniques for set similarity joins. In Section 3, we introduce the min-prefix concept and show how it can be exploited to improve the runtime of set similarity joins. In Section 4, we present further optimizations in the candidate generation and verification phase. Section 5 considers disk-based and parallel versions of mpjoin and Section 6 describes the version for weighted sets. Experimental results are presented in Section 7. We discuss related work in Section 8, before we wrap up with the conclusions in Section 9.

Section snippets

Preliminaries

In this section, we first provide background material on set similarity join concepts and techniques. Then, we describe the baseline algorithm for set similarity joins that we use in this work.

Generalizing prefix filtering

In this section, we first empirically show that the number of generated candidates can be highly misleading as a measure of runtime efficiency. Motivated by this observation, we introduce the min-prefix concept and propose a new algorithm that focuses on minimizing the computational cost of candidate generation.

Further optimizations

In this section, we discuss the verification phase and propose a modification to mpjoin concerning the optimization of overlap score accumulation.

Practical aspects

In this section, we address two important practical aspects around our min-prefix approach, namely: a disk-based external version of mpjoin to work with limited memory and data splitting for parallel execution.

The weighted case

We now consider the weighted version of the set similarity join problem. In this version, sets are drawn from a universe of features Uw, where each feature f is associated with a weight w(f). Weights are used to quantify the importance of features. In many domains, features show non-uniformity regarding some semantic properties, such as discriminating power, and therefore the definition of an appropriate weighting scheme is instrumental in obtaining reasonable results. For instance, the widely

Experiments

The main goal of our experiments is to measure the runtime performance of our algorithms, mpjoin and w-mpjoin, and compare them against previous, state-of-the-art set similarity join algorithms. We also aim at identifying the most important characteristics of the input data driving the performance of the set similarity joins algorithms under study. To this end, we conduct our study under several different data distributions and configuration parameters using real and synthetic datasets.

Related work

There is a vast body of literature on performing similarity joins in vector spaces; in this context, a similarity join is a variant of the more general approach known as spatial join. See [22] for a recent survey. Indexing techniques for vector spaces are well-suited for implementing similarity joins in application domains where the objects can be described by low-dimension feature vectors and the notion of similarity can be expressed by a distance function of the Minkowski family, such as the

Conclusion

In this paper, we proposed a new index-based algorithm for set similarity joins. Following a completely different approach compared to previous work, we focused on a reduction of the computational cost for candidate generation as opposed to lowering the number of candidates. For this reason, we introduced the concept of min-prefix, a generalization of the prefix filtering concept, which allows to dynamically and safely minimize the length of the inverted lists; hence, a larger number of

References (39)

  • L.A. Ribeiro et al.

    Efficient set similarity joins using min-prefixes

  • A. Arasu et al.

    Efficient exact set-similarity joins

  • S. Chaudhuri et al.

    A primitive operator for similarity joins in data cleaning

  • W.W. Cohen

    Integration of heterogeneous databases without common domains using queries based on textual similarity

  • L. Gravano et al.

    Approximate string joins in a database (almost) for free

  • E. Spertus et al.

    Evaluating similarity measures: a large-scale study in the orkut social network

  • A.Z. Broder

    On the resemblance and containment of documents

  • M. Theobald et al.

    Spotsigs: robust and efficient near duplicate detection in large web collections

  • K. Chakrabarti et al.

    An efficient filter for approximate membership checking

  • R. Weber et al.

    A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces

  • S. Sarawagi et al.

    Efficient set joins on similarity predicates

  • R.J. Bayardo et al.

    Scaling up all pairs similarity search

  • C. Xiao et al.

    Efficient similarity joins for near duplicate detection

  • C. Xiao et al.

    Ed-join: an efficient algorithm for similarity joins with edit distance constraints

    Proceedings of the VLDB Endowment (PVLDB)

    (2008)
  • A. Chandel et al.

    Benchmarking declarative approximate selection predicates

  • C. Li et al.

    Efficient merging and filtering algorithms for approximate string searches

  • C. Xiao et al.

    Top-k set similarity joins

  • S.E. Robertson et al.

    Relevance weighting of search terms

    Journal of the American Society for Information Science

    (1976)
  • R.M. Karp et al.

    Efficient randomized pattern-matching algorithms

    IBM Journal of Research and Development

    (1987)
  • Cited by (56)

    • Parallelizing filter-and-verification based exact set similarity joins on multicores

      2022, Information Systems
      Citation Excerpt :

      GroupJoin further extends PPJoin by merging identical prefixes over multiple sets; this avoids the re-computation of the same overlaps [2]. MPJoin introduces a removal filter [18]. It disregards entries in the inverted index that do not pass future applications of the position filter.

    • Similarity query support in big data management systems

      2020, Information Systems
      Citation Excerpt :

      An example algorithm is gram-count [10]. Prefix filtering [11–20] utilizes the fact that two strings can be similar only if they share some commonality in their prefixes. Based on this fact, many algorithms have been proposed, such as AllPair [11], PPJoin/PPJoin+ [12], ED-Join [16], MPJoin [14], QChunk [18],VChunk [19], and Pivotal prefix [20].

    • Dynamic Set Similarity Join: An Update Log Based Approach

      2023, IEEE Transactions on Knowledge and Data Engineering
    View all citing articles on Scopus

    This paper is a significantly extended and revised version of [1].

    1

    Work partially supported by CAPES/Brazil; grant BEX1129/04-0.

    View full text