Generalizing prefix filtering to improve set similarity joins☆
Introduction
Similarity joins pair objects from a dataset whose similarity is not less than a specified threshold; the notion of similarity is mathematically approximated by a similarity function defined on the collection of relevant features representing two objects. This is a core operation for many important application areas including data cleaning [2], [3], text data support in relational databases [4], [5], collaborative filtering [6], Web indexing [7], [8], social networks [6], and information extraction [9].
Several issues make the realization of similarity joins challenging. First, the objects to be matched are often sparsely represented in very high dimensions—text data are a prominent example. It is well known that indexing techniques based on data-space partitioning are often outperformed by simple sequential scans at high dimensionality [10]. Moreover, many domains involve very large datasets, therefore scalability is a prime requirement. Finally, the concept of similarity is intrinsically application-dependent. Thus, a general purpose similarity join realization has to support a variety of similarity functions [3].
Recently, set similarity joins gained popularity as a means to tackle the issues mentioned above [2], [3], [8], [11], [12], [13]. The main idea behind this special class of similarity joins is to view operands as sets of features and employ a set similarity function to assess their similarity. An important property, predicates containing set similarity functions can be expressed by the set overlap abstraction [3], [11]. Several popular measures belong to the general class of set similarity functions, including Jaccard, Dice, Hamming, and Cosine. Moreover, even when not representing a similarity function on its own, set overlap constraints can still be used as an effective filter for metric distances such as the string edit distance [5], [14].
As a concrete example, consider the data cleaning domain. A fundamental data cleaning activity is the identification of the so-called “fuzzy duplicates”, i.e., multiple and non-identical representations of a real-world entity. Fuzzy duplicates appear in a dataset often owing to data entry errors like typos and misspellings. In such cases, fuzzy duplicates exhibit slight textual deviations and can be identified by applying a (self) similarity join over the dataset. A widely used notion of string similarity is based on the concept of q-grams. Informally, a q-gram is a substring of size q, obtained by “sliding” a window of size q over the characters of a given string. We can view q-grams as features representing a string. Employing similarity joins based on multidimensional data structures is problematic due to high-dimensionality of the underlying space: up to , where is the alphabet from which strings are built (see further discussion in Section 8). In this context, set similarity joins have been the method of choice to realize similarity matching based on q-grams [2], [3]. Besides efficiency, the corresponding set similarity functions have been shown to provide competitive quality results compared to other (more complex) similarity functions [15]. Example 1 Let s1=Kaiserslautern and s2=Kaisersautern be strings; their respective set of 2-grams are Consider the Jaccard similarity (JS), which is defined as where x1 and x2 are set operands. Applying Jaccard on the q-gram sets of s1 and s2, we obtain: .
Most set similarity joins algorithms are composed of two main phases: candidate generation, which produces a set of candidate pairs, and verification, which applies the actual similarity measure to the generated candidates and returns the correct answer. Recently, Xiao et al. [13] improved the previous state-of-the-art similarity join algorithm due to Bayardo et al. [12] by pushing the overlap constraint checking into the candidate generation phase. To reduce the number of candidates even more, the authors proposed the suffix filtering technique, where a relatively expensive operation is carried out before qualifying a pair as a candidate. For that purpose, the overlap constraint is converted into an equivalent Hamming distance and subsets are verified in a coordinated way using a divide-and-conquer algorithm. As a result, the number of candidates is substantially reduced, often to the same order of magnitude of the result set size.
In this paper, we propose a new index-based algorithm for set similarity joins. Our work builds upon the previous work of [12], [13], however, we follow an opposite approach to that of [13]. Our focus is on the decrease of the computational cost of candidate generation instead of reduction of the number of candidates. For this, we introduce the concept of min-prefix, a generalization of the prefix filtering concept [3], [11] applied to indexed sets. Min-prefix allows to dynamically maintain the length of the inverted lists reduced to a minimum, and therefore the candidate generation time is drastically decreased. We address the increasing in the workload of the verification phase, a side-effect of our approach, by interrupting the computation of candidate pairs that will not meet the overlap constraint as early as possible. We also improve the overlap score accumulation by avoiding the overhead of dedicated data structures. Furthermore, we consider disk-based and parallel versions of the algorithm. Finally, we conduct a thorough experimental evaluation using synthetic and real datasets. Our results demonstrate that our algorithm consistently outperforms the so-far known ones for unweighted and weighted sets and reveal important trends of set similarity join algorithms in general.
The rest of this paper is organized as follows. Section 2 defines our terminology and reviews important optimization techniques for set similarity joins. In Section 3, we introduce the min-prefix concept and show how it can be exploited to improve the runtime of set similarity joins. In Section 4, we present further optimizations in the candidate generation and verification phase. Section 5 considers disk-based and parallel versions of mpjoin and Section 6 describes the version for weighted sets. Experimental results are presented in Section 7. We discuss related work in Section 8, before we wrap up with the conclusions in Section 9.
Section snippets
Preliminaries
In this section, we first provide background material on set similarity join concepts and techniques. Then, we describe the baseline algorithm for set similarity joins that we use in this work.
Generalizing prefix filtering
In this section, we first empirically show that the number of generated candidates can be highly misleading as a measure of runtime efficiency. Motivated by this observation, we introduce the min-prefix concept and propose a new algorithm that focuses on minimizing the computational cost of candidate generation.
Further optimizations
In this section, we discuss the verification phase and propose a modification to mpjoin concerning the optimization of overlap score accumulation.
Practical aspects
In this section, we address two important practical aspects around our min-prefix approach, namely: a disk-based external version of mpjoin to work with limited memory and data splitting for parallel execution.
The weighted case
We now consider the weighted version of the set similarity join problem. In this version, sets are drawn from a universe of features Uw, where each feature f is associated with a weight w(f). Weights are used to quantify the importance of features. In many domains, features show non-uniformity regarding some semantic properties, such as discriminating power, and therefore the definition of an appropriate weighting scheme is instrumental in obtaining reasonable results. For instance, the widely
Experiments
The main goal of our experiments is to measure the runtime performance of our algorithms, mpjoin and w-mpjoin, and compare them against previous, state-of-the-art set similarity join algorithms. We also aim at identifying the most important characteristics of the input data driving the performance of the set similarity joins algorithms under study. To this end, we conduct our study under several different data distributions and configuration parameters using real and synthetic datasets.
Related work
There is a vast body of literature on performing similarity joins in vector spaces; in this context, a similarity join is a variant of the more general approach known as spatial join. See [22] for a recent survey. Indexing techniques for vector spaces are well-suited for implementing similarity joins in application domains where the objects can be described by low-dimension feature vectors and the notion of similarity can be expressed by a distance function of the Minkowski family, such as the
Conclusion
In this paper, we proposed a new index-based algorithm for set similarity joins. Following a completely different approach compared to previous work, we focused on a reduction of the computational cost for candidate generation as opposed to lowering the number of candidates. For this reason, we introduced the concept of min-prefix, a generalization of the prefix filtering concept, which allows to dynamically and safely minimize the length of the inverted lists; hence, a larger number of
References (39)
- et al.
Efficient set similarity joins using min-prefixes
- et al.
Efficient exact set-similarity joins
- et al.
A primitive operator for similarity joins in data cleaning
Integration of heterogeneous databases without common domains using queries based on textual similarity
- et al.
Approximate string joins in a database (almost) for free
- et al.
Evaluating similarity measures: a large-scale study in the orkut social network
On the resemblance and containment of documents
- et al.
Spotsigs: robust and efficient near duplicate detection in large web collections
- et al.
An efficient filter for approximate membership checking
- et al.
A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces
Efficient set joins on similarity predicates
Scaling up all pairs similarity search
Efficient similarity joins for near duplicate detection
Ed-join: an efficient algorithm for similarity joins with edit distance constraints
Proceedings of the VLDB Endowment (PVLDB)
Benchmarking declarative approximate selection predicates
Efficient merging and filtering algorithms for approximate string searches
Top-k set similarity joins
Relevance weighting of search terms
Journal of the American Society for Information Science
Efficient randomized pattern-matching algorithms
IBM Journal of Research and Development
Cited by (56)
Parallelizing filter-and-verification based exact set similarity joins on multicores
2022, Information SystemsCitation Excerpt :GroupJoin further extends PPJoin by merging identical prefixes over multiple sets; this avoids the re-computation of the same overlaps [2]. MPJoin introduces a removal filter [18]. It disregards entries in the inverted index that do not pass future applications of the position filter.
VSIM: Distributed local structural vertex similarity calculation on big graphs
2021, Journal of Parallel and Distributed ComputingAn empirical evaluation of exact set similarity join techniques using GPUs
2020, Information SystemsSimilarity query support in big data management systems
2020, Information SystemsCitation Excerpt :An example algorithm is gram-count [10]. Prefix filtering [11–20] utilizes the fact that two strings can be similar only if they share some commonality in their prefixes. Based on this fact, many algorithms have been proposed, such as AllPair [11], PPJoin/PPJoin+ [12], ED-Join [16], MPJoin [14], QChunk [18],VChunk [19], and Pivotal prefix [20].
Bitmap filter: Speeding up exact set similarity joins with bitwise operations
2020, Information SystemsDynamic Set Similarity Join: An Update Log Based Approach
2023, IEEE Transactions on Knowledge and Data Engineering
- 1
Work partially supported by CAPES/Brazil; grant BEX1129/04-0.