Elsevier

Information Sciences

Volume 257, 1 February 2014, Pages 32-53
Information Sciences

iKernel: Exact indexing for support vector machines,☆☆

https://doi.org/10.1016/j.ins.2013.09.017Get rights and content

Abstract

SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for learning classification, regression, and ranking functions. Especially, SVM for rank learning has been applied to various applications including search engines or relevance feedback systems. A ranking function F learned by SVM becomes the query in some search engines: A relevance function F is learned from the user’s feedback which expresses the user’s search intention, and top-k results are found by evaluating the entire database by F. This paper proposes an exact indexing solution for the SVM function queries, which is to find top-k results without evaluating the entire database. Indexing for SVM faces new challenges, that is, an index must be built on the kernel space (SVM feature space) where (1) data points are invisible and (2) the distance function changes with queries. Because of that, existing top-k query processing algorithms, or existing metric-based or reference-based indexing methods are not applicable. We first propose key geometric properties of the kernel space – ranking instability and ordering stability – which is crucial for building indices in the kernel space. Based on them, we develop an index structure iKernel and processing algorithms. We then present clustering techniques in the kernel space to enhance the pruning effectiveness of the index. According to our experiments, iKernel is highly effective overall producing 1–5% of evaluation ratio on large data sets.

Introduction

SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for learning classification, regression, and ranking functions [22], [4], [11], [16], [15]. Especially, SVM for rank learning has been applied to various applications including search engines or relevance feedback systems [5], [20], [18], [24], [26], [27], [25], [28]. For example, in content-based image retrieval (CBIR) systems, the user provides feedback of whether the resulting images are relevant or not, from which SVM learns a ranking function F, and the function is evaluated through the database to find top-k relevant images. A query in this case is the ranking function F learned by SVM.

Researchers in SVM community have focused on improving the accuracy or learning efficiency of SVM and developed numerous algorithms for it, but they pay relatively less attention to processing the SVM function for efficiently finding top-k results. Processing or testing an SVM function on a large set of data is crucial in many applications and takes a non-trivial amount of time. For example, in the above retrieval systems, learning a function F (or formulating the query) is done instantly since the training data is typically small, while processing the query to find top-k results requires evaluating the entire database by F.

This paper proposes exact indexing solutions for the SVM function queries. Indexing for SVM faces new challenges, that is, an index must be built on the kernel space (SVM feature space) where (1) data points are invisible and (2) the distance function changes with queries. Because of that, existing metric-based or reference-based indexing methods [8], [2], [13] are not directly applicable. Existing top-k processing algorithms are not also applicable, as they often make restrictive assumptions on the query such as linearity or monotonicity of functions [9], [10], [3], [12], while the queries in our case are machine-learned ranking functions. Kernel indexing methods were also proposed, but they find approximate results [20], [19]. (We will detail related work in Section 2.).

Ranking functions learned by SVMs are kernel ranking functions, which are found to be of simple structure but are highly expressive in representing the user’s hidden ranking. More specifically, a ranking function F(z) returns a ranking score of instance (or vector) z such that higher ranking score indicates higher preference or relevance. A kernel ranking function F is structured asF(z)=αiK(xi,z)where x and α are respectively some instances (support vectors in SVMs) and their coefficients, and the kernel K returns a similarity score (0, 1] between xi and z. For example, K for the RBF kernel, i.e., the most popularly used nonlinear kernel, is K=exp(-sxi-z2). (Backgrounds on kernel ranking function are detailed in Sections 3 Kernel ranking function, 4 Properties of kernel ranking functions.)

Our specific goal is to quickly find top-k instances z of highest F(z) among all instances. Note that, evaluating one nonlinear kernel ranking function F(z) typically takes a non-negligible time, as it involves scanning through the support vectors. Thus, the processing time is usually dominated by the evaluation time, as we will also demonstrate with experimental results in Section 7.

Toward the goal, we first observe common properties of the nonlinear kernel ranking functions. Specifically, for all nonlinear kernel ranking functions, a kernel K can be represented as a dot product of two vectors in a “feature space”, and a complex nonlinear ranking function in data space is represented as a linear function in the feature space. Prior literature [17] noted that, data instances in the feature space are scattered on the surface of a hypersphere, and the ranking function is represented as the normal vector to a hyperplane crossing the center of the hypersphere. Top results are the instances farthest from the hyperplane or nearest to the normal vector. Fig. 1 illustrates an example of data instances (points) and a ranking function (the normal to a hyperplane) in a three-dimensional feature space. In this surface space, the top-k processing problem is translated into finding k nearest neighbors to a query point q (i.e., the normal vector). (Details are discussed in Section 4.)

In this feature space, however, data instances are only defined with respect to pairwise distances, i.e., angle distances between instances, and their absolute feature values are “undefined” and thus cannot be visualized. (Fig. 1 shows “hypothetical visualization” for illustration purposes only.) Thus, the well-known nearest neighbor algorithms built on indices over absolute values, such as R-tree, X-tree, TV-tree, SR-trees, are not applicable. Instead, we need to identify nearest neighbors in a space where distances are only defined relatively for pairs.

Reference-based algorithms [8], [2], [13] have been studied for the metric space where the pairwise distance function is fixed. However, our target problem poses another key challenge. Since kernel parameters changes over varying queries, the pairwise distance function changes too! While more recent work tackled this challenge [17], [18], existing algorithms find approximate top-k results. To the best of our knowledge, no method exists which returns exact top-k results of a query of kernel ranking functions.

To build an exact index structure in the kernel space where the feature values are hidden and the distance function changes with queries, we first propose key geometric properties of the kernel space – ranking instability and ordering stability – which is crucial for building indices in the kernel space. The properties prove that although the distance function changes with queries in the kernel space, the ordering of distances does not change.

Based on the properties, we propose a novel index structure iKernel, which is a set of “reference-based rings”. The ring structures are necessary for indexing in the kernel space, as they keep the orderings of instances rather than the actual distances. In other words, iKernel builds an index using only the ordering information, which is invariant to the kernel parameters, thus unchanging over queries.

We then propose an efficient processing algorithm using the index to identify exact top-k results for a query of kernel ranking function. We formally discuss both the correctness and optimality of our proposed algorithm. Finally, we propose a density-based clustering technique in the kernel space in order to enhance the pruning effectiveness of the index. According to our experiments, iKernel is highly effective overall producing 1  5% of evaluation ratio on large data sets, and its index size and construction time is substantially smaller and faster than the existing approximate solution. Also the maintenance for inserting and deleting new instances is inexpensive. The implementation and source codes are available at http://hwanjoyu.org/ikernel.

This paper is organized as follows. We first discuss related work (Section 2). Section 3 explains fundamentals of kernel ranking functions. Section 4 presents the key geometric properties of kernel ranking functions that are crucial in building and proving the correctness of our methods. Section 5 details our proposed methods. Section 6 presents a way to utilize existing metric-based indexing algorithms such as M-tree and MVP-tree for indexing SVM. Section 7 reports experimental evaluations. Section 8 concludes our study.

Section snippets

Top-k processing algorithms

Most existing top-k query processing algorithms [9], [10], [3], [12] generally assume that ranking function F is (1) defined over absolute attribute values and (2) monotonic over values. Building upon these assumptions, existing algorithms achieve the efficiency by exploiting generic similarity index structures, such as B-trees on attributes, to selectively access the high-scoring sub-region. More recent efforts [29], [23] have focused on relaxing the monotonicity assumption to include

Kernel ranking function

This section presents preliminary background of the kernel ranking function. Suppose there is a set of data {x} that are vectors in a “data space.” The kernel ranking function in Eq. (1) is composed of support vectors x and their coefficients α. The kernel K, a kind of similarity function, returns a dot product of two vectors in some “feature space,” i.e.,K(x1,x2)=ϕ(x1)·ϕ(x2)where ϕ is an implicit mapping used for projecting the data space instances x1 and x2 onto the feature space.

Properties of kernel ranking functions

This section presents several important properties of kernel ranking functions and proposes lemmas that are crucial in building and proving the correctness of our iKernel.

  • 1.

    The instances projected to the feature space, although they cannot be seen, lie on the surface of a unit hypersphere, and the angle of any two instances is bounded by π2 [19]. It is because the cosine similarity of any two instances in the feature space, that is returned by a kernel, is between 0 and 1, i.e., 0ϕ(x1)·ϕ(x2)=K(

iKernel

This section first defines the problem and overviews our approach to solve the problem (Section 5.1), and discusses in details (1) how to construct the index (Section 5.2), (2) how to process the top-k query of nonlinear kernel ranking functions (Section 5.3), and (3) how to cluster the data to improve the pruning ratio (Section 5.4). We then discuss the time complexity for indexing and processing, and discuss insertion and deletion operations (Section 5.5).

Metric-based indices for SVM indexing

In this section, we present a method of utilizing two metric-based indexing algorithms – M-tree and MVP-tree – for indexing SVM. Note that, since the distance function changes with queries, metric-based indexing methods such as M-tree or MVP tree are not directly applicable, because they store distances among some instances in their indices [2], [8]. However, using the ordering stability of the angular distance in the feature space, the structures of those indices can be built such that they

Experimental results

This section evaluates the effectiveness of our methods on synthetic and real-world data sets. To fully show the usefulness of our methods, we have measured the performance of iKernel in various situations such as changing kernel parameters and inserting (deleting) instances into (from) the index. We also compared our methods with two different indexing methods. One is the existing approximation indexing method, KDX [19], and the other is a metric-based index, M-tree [8], which is modified to

Conclusions

This paper proposes an indexing method and top-k processing algorithm for the ranking functions learned by SVMs, i.e., the nonlinear kernel ranking functions. Key challenges in developing indexes for the kernel ranking function are that the data instances are only defined with respect to the “parameterized” pairwise distances in the feature space and thus their absolute feature values are invisible and the real pairwise distances are determined at the query time. Our proposed method, iKernel,

References (29)

  • F. Aurenhammer

    Voronoi Diagrams – A Survey of a Fundamental Geometric Data Structure

    ACM Computing Surveys

    (1991)
  • T. Bozkaya et al.

    Indexing Large Metric Spaces for Similarity Search Queries

    ACM Transaction on Database Systems

    (1999)
  • N. Bruno, L. Gravano, A. Marian, Evaluating top-k queries over web-accessible databases, in: Proc. Int. Conf. Data...
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining and Knowledge Discovery

    (1998)
  • E. Chang, S. Tong, Support vector machine active learning for image retrieval, in: ACM Int. Conf. Multimedia (MM’01),...
  • N. Chawla et al.

    Smote: synthetic minority over-sampling technique

    Journal of Artificial Intelligence Research

    (2002)
  • N. Christianini et al.

    An Introduction to Support Vector Machines and other Kernel-based Learning Methods

    (2000)
  • P. Ciaccia, M. Patella, P. Zezula, M-tree: an efficient access method for similarity search in metric spaces, in: Proc....
  • R. Fagin, A. Lote, M. Naor, Optimal aggregation algorithms for middleware, in: Proc. ACM SIGACT-SIGMOD-SIGART Symp....
  • U. Guentzer, W. Balke, W. Kiessling, Optimizing multi-feature queries in image databases, in: Proc. Int. Conf. Very...
  • U. He, C. Wu, Separating theorem of samples in Banach space for support vector machine learning, International Journal...
  • S.-W. Hwang, K.C.-C. Chang, Optimizing access cost for top-k queries over web sources: a unified cost-based approach,...
  • H. Jagadish et al.

    idistance: An adaptive b+tree based indexing method for nearest neighbor search

    ACM Transaction on Database Systems

    (2005)
  • D. Keim, Tutorial on high-dimensional index structures: database support for next decades applications, in: Proc. Int....
  • Cited by (2)

    • Support vector machine with Dirichlet feature mapping

      2018, Neural Networks
      Citation Excerpt :

      An SVM is a non-parametric max-margin classification technique aims at classifying data into two groups, making it useful for two-group classification problems. However, it has also been extended to solve multi-group classification problems (Crammer & Singer, 2001; López & Maldonado, 2016; Platt, Cristianini, & Shawe-Taylor, 1999; Weston & Watkins, 1998; Zangooei & Jalili, 2012), continuous output cases (Suykens & Vandewalle, 1999; Xu, An, Qiao, Zhu, & Li, 2013), semi-supervised problems (Reitmaier & Sick, 2015), rank learning (Kim, Ko, Han, & Yu, 2014) and so on. According to the previous researches, SVM has outperformed other classification techniques in terms of the accuracy measures (Auria & Moro, 2008; Joachims, 1998).

    • Scalable action localization with kernel-space hashing

      2015, Proceedings - International Conference on Image Processing, ICIP

    A preliminary version of the paper, “Exact Indexing for Support Vector Machines”, appeared in Proc. ACM SIGMOD 2011. However, this submission has substantially extended the previous paper and contains new and major-value added technical contribution in comparison with the conference publication.

    ☆☆

    This work was supported by Mid-career Researcher Program through NRF Grant funded by the MEST (No. NRF-2011-0016029).

    View full text