Enabling soft queries for data retrieval

doi:10.1016/j.is.2006.02.001

Information Systems

Volume 32, Issue 4, June 2007, Pages 560-574

https://doi.org/10.1016/j.is.2006.02.001 Get rights and content

Abstract

Data retrieval finding relevant data from large databases — has become a serious problem as myriad databases have been brought online in the Web. For instance, querying the for-sale houses in Chicago from realtor.com returns thousands of matching houses. Similarly, querying “digital camera” in froogle.com returns hundreds of thousand of results. This data retrieval is essentially an online ranking problem, i.e., ranking data results according to the user's preference effectively and efficiently. This paper proposes a new rank query framework, for effectively incorporating “user-friendly” rank-query formulation into “data base (DB)-friendly” rank-query processing, in order to enable “soft” queries on databases. Our framework assumes, as the “back-end,” the score-based ranking model for expressive and efficient query processing. On top of the score-based model, as the “front-end,” we adopt an SVM-ranking mechanism for providing intuitive and exploratory query formulation. In essence, our framework enables users to formulate queries simply by ordering some sample objects, while learning the “DB-friendly” ranking function $F$ from the partial orders. Such learned functions can then be processed and optimized by existing database systems. We demonstrate the efficiency and effectiveness of our framework using real-life user queries and datasets: our results show that the system effectively learns quantitative ranking functions from qualitative feedback from users with efficient online processing.

Introduction

As we move toward a digital world, information abounds everywhere—retrieving desired data thus becomes a ubiquitous challenge. In particular, with the widespread of the Internet, myriad databases have been brought online, providing massive data through searchable query interfaces. (The July 2000 survey of [1] claims that there were 500 billion hidden “pages,” or data objects, in $10^{5}$ online sources.) While our databases provide well-maintained, high-quality structured data, with the sheer scale, users are facing the hurdle of searching and retrieving.

This data retrieval problem— that of finding relevant data from large databases — has thus become a clear challenge. (By “retrieval,” we intend to stress the relevance-based matching, even for structured “data” — much like text retrieval for finding relevant documents.) To illustrate, Fig. 1 shows several example scenarios. Consider user Amy, who is looking for a house in Chicago. She searches realtor.com with a few constraints on city, price, beds, baths, which returns 3581 matching houses. Similarly, when Amy searches froogle.com for "digital camera", she is again overwhelmed by a total of 746,000 matches. She will have to sift through and sort out all these matches. Or, Amy may realize that she must “narrow” her query — However, on this extreme, and equally undesirable, she may as well get no hits at all. She will likely manually “oscillate” between these extremes, before eventually managing to complete her data retrieval task, if at all.

Relational databases offer little support for such retrieval tasks. Traditional Boolean-based query models like SQL are based on “hard” criteria (e.g., price $<$ $100,000) while users often employ “soft” criteria for their specific senses of “relevance” or “preference.” Unlike flat Boolean results, these fuzzy criteria naturally calls for ranking, to indicate how well the results match. Such ranking is essential for data retrieval, by ordering answers according to their matching “scores.” Thus, on one hand, there will not be too many matches, since ranking focuses users on the best matches. On the other hand, neither will there be no hits, since ranking will return even partial matches. While such ranking has been the norm for “text” retrieval [2] (e.g., search engines like Google), it is critically missing in relational database systems for supporting similar “data” retrieval.

To enable such soft queries for data retrieval, we observe two major barriers: First, user-friendliness: The data retrieval system should be “user friendly,” for ordinary users to easily express their preference. Note that, unlike traditional data management with mostly “canned transactions” by application developers, data retrieval system must accommodate ordinary users who are not able to express their implicit preference by formulating a query or function. Second, DB-friendliness: The system should be “DB-friendly,” to be compatible with existing relational DBMS, so that it can be executed and optimized by any DBMS. Note that data retrieval, with many interesting scenarios online, must essentially achieve responsive processing.

While there has been existing work on supporting ranking in both databases and machine learning communities (discussed in Section 6), due to their different aspects of interests, there has been no efforts ventured for enabling soft queries for data retrieval: On one hand, the databases community has studied rank query processing [3], [4], [5], [6]. However, they clearly lack the support for intuitively formulating ranking in the first place, to accommodate everyday users (as Section 2 will discuss). On the other hand, the machine learning community has focused on learning or formulating ranking from examples [7], [8]. However, such ranking functions are hardly amenable to relational DBMS for efficient processing.

This paper develops the “bridging” techniques of database and machine learning, to provide systematic solutions for data retrieval. We proposes a new framework such that: (1) to achieve user-friendliness, it allows users to qualitatively and intuitively express their preferences by ordering some sample, (2) to achieve DB-friendliness, we learn a quantitative global ranking function which is amenable to existing relational DBMS. In summary, our framework seamlessly integrates the front-end machine learner with a back-end processing engine to evaluate the learned functions.

The new contributions of this paper are summarized as follows.

•
We develop the duality of ranking and classification view in Section 3.1, in order to connect the “user-friendly” query formulation (i.e., learning ranking from relative orderings) with the “DB-friendly” query processing (i.e., processing ranking from absolute scores).
•
We provide an intuitive interpretation of the SVM ranking solution [8], by using the duality and presenting Corollaries 1 and 2 and Remark 1 in Section 3.2.2.
•
We develop top sampling method which (1) provides an “exploratory” interface to users; (2) further enhances the SVM performance for ranking; and (3) is efficiently expressed in SQL and thus facilitate the integration with RDBMS. We experimentally show that the top sampling method is efficient and reduces the amount of user feedback to achieve a high accuracy.

We motivate and describe the architecture of our framework (Section 2) and present the component techniques (Section 3). We demonstrate the efficiency and effectiveness of our framework using real-life queries and data sets (Section 4). We discuss valuable lessons we learned from our user study and further challenges to build the data retrieval-integrated relational system (Section 5). We discuss related work in Section 6.

Section snippets

Overview: Bridging rank formulation and processing

This section motivates and introduces our approach—Our goal is to seamlessly integrate user-friendly rank formulation with DB-friendly rank processing. As Section 1 explained, such “mix” is critical for enabling soft queries for data retrieval.

The RankFP framework: enabling rank formulation and processing online

In this section, we present the techniques for realizing the RankFP framework (Fig. 2). First, Section 3.1 starts with developing how we “connect” score-based ranking view, which is effective for processing back-end, with classification view, effective for learning front-end. Second, Section 3.2 then investigates SVM as the learning machine (Step 3a in Fig. 3). Finally, Section 3.3 develops techniques to enable rank formulation and processing to be “online”, e.g., selective sampling for

Experimental evaluation

This section reports our extensive experiments for studying the usability (or “user-friendness”) and efficiency (or “DB-friendness”) of our RankFP framework: First, for usability, we used Kendall's $τ$ measure [9], [10], [11], which is used widely to measure the similarity of the two orderings, i.e., the ideal ordering $R^{*}$ and the ordering generated by our system $R^{F}$ . Second, for efficiency, we measured absolute response time. Our experiments were conducted with a Pentium 4 2 GHz PC with 1 GB RAM.

Discussion

Integration with relational database systems: Recently, there have been research efforts in processing rank queries in a relational context. References [5], [19] have proposed rank processing as a layer “on top of” relational databases—exploiting histograms [5] and materialized views [19], respectively, for efficient rank processing. However, these works cannot be adopted as a processing back-end for our framework: First, these works rely on assumptions that ranking function is a $k$ -nearest

Related work

As our framework consists of rank learning and processing, this section discusses current-of-art in each of the topic areas.

Rank learning: For the “user-friendly” rank formulation, we adopt machine learning approach, in particular SVM [12], to learn a quantitative ranking function from a qualitative feedback. SVM has proven highly effective in classification [12], [13], [14]. Ref. [8] developed an SVM ordinal regression method, and reference [10] applied it for optimizing search engines. We

Conclusion

This paper proposes a new data retrieval framework which incorporates a user-friendly rank formulation into the DB-friendly query processing. In particular, SVM techniques are adopted as the front-end to build an intuitive rank formulation which is compatible and integrated with the back-end score-based query processing. The experiments on a real-estate data set show promising results: the data retrieval system effectively learns quantitative ranking functions from the users’ qualitative

References (29)

BrightPlanet.com, The deep web: Surfacing hidden value, Accessible at http://brightplanet.com/technology/deepweb.asp...
G. Salton, Automatic Text Processing, Addison-Wesley, Reading, MA.,...
M.J. Carey, D. Kossmann, On saying “enough already!” in SQL, in: Proceedings of the ACM SIGMOD International Conference...
K.C.-C. Chang, S.-W. Hwang, Minimal probing: supporting expensive predicates for top-k queries, SIGMOD...
S. Chaudhuri, L. Gravano, Evaluating Top-k Selection Queries, in: Proceedings of the International Conference Very...
R. Fagin, A. Lote, M. Naor, Optimal aggregation algorithms for middleware, in: Proceedings of ACM SIGACT-SIGMOD-SIGART...
W.W. Cohen, R.E. Schapire, Y. Singer, Learning to order things, in: Proceedings of Advances in Neural Information...
M. Kendall, Rank Correlation Methods, Hafner,...
T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM SIGKDD International...

A.M. Mood et al.

Introduction to the Theory of Statistics

(1974)

V.N. Vapnik

Statistical Learning Theory

(1998)

C.J.C. Burges

A tutorial on support vector machines for pattern recognition

Data Mining Knowledge Discovery

(1998)

N. Christianini et al.

An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

(2000)

Cited by (21)

Drawing on the iPad to input fuzzy sets with an application to linguistic data science
2019, Information Sciences
Citation Excerpt :
Many aspects related to dealing with large data and their understanding have been addressed by fuzzy queries. A very interesting example of that – called soft queries – has been proposed in [41,42]. The essential part of soft queries is related to defining linguistic terms as summarizers and quantifiers as quantities in agreement.
Large amounts of collected and stored data require specialized processing skills. However, a human-computer interaction is crude and far from natural. Users are forced to interact with databases using languages understood by machines. A simple way to learn more about phenomena represented by data can be done via representing the data as human-perceived concepts and enabling a human-like interpretation of it. The users should be able to use linguistic terms – for example LARGE, SLOW, MOST – as their representations of concepts in order to gain a better understanding of data. Yet, other issues arise: how to enter definitions of such terms, how to incorporate individual's understanding of their meanings, how to ensure their proper interpretation, and of course, how to do all this in an easy and simple way.
In this paper, we present and describe an iPad-based software that enables an easy procedure of defining linguistic terms – such as LARGE, MEDIUM, SMALL – and linguistic qualifiers – ALL, MOSTLY – that are suitable for data analysis purposes. We state, that linguistic terms and qualifiers represented as fuzzy sets embody human defined concepts and allow users to better understand data. This understating is achieved via ‘mapping’ data into models built with definitions of imprecise concepts familiar to users. Further, fuzzy-based definitions of concepts and fuzzy operations on them facilitate a human-like analysis of data as an important aspect of data science. The main contribution is a tablet application – called Tablet input of Fuzzy Sets (TiFS) – that allows users to define terms in a simple way via drawing their ‘shapes’ using fingers. We provide a detailed description of the developed iPad application for that purpose and show its benefits when used in defining linguistic terms for data summarization processes.
IKernel: Exact indexing for support vector machines
2014, Information Sciences
Citation Excerpt :
SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for learning classification, regression, and ranking functions [22,4,11,16,15]. Especially, SVM for rank learning has been applied to various applications including search engines or relevance feedback systems [5,20,18,24,26,27,25,28]. For example, in content-based image retrieval (CBIR) systems, the user provides feedback of whether the resulting images are relevant or not, from which SVM learns a ranking function F, and the function is evaluated through the database to find top-k relevant images.
SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for learning classification, regression, and ranking functions. Especially, SVM for rank learning has been applied to various applications including search engines or relevance feedback systems. A ranking function F learned by SVM becomes the query in some search engines: A relevance function F is learned from the user’s feedback which expresses the user’s search intention, and top-k results are found by evaluating the entire database by F. This paper proposes an exact indexing solution for the SVM function queries, which is to find top-k results without evaluating the entire database. Indexing for SVM faces new challenges, that is, an index must be built on the kernel space (SVM feature space) where (1) data points are invisible and (2) the distance function changes with queries. Because of that, existing top-k query processing algorithms, or existing metric-based or reference-based indexing methods are not applicable. We first propose key geometric properties of the kernel space – ranking instability and ordering stability – which is crucial for building indices in the kernel space. Based on them, we develop an index structure iKernel and processing algorithms. We then present clustering techniques in the kernel space to enhance the pruning effectiveness of the index. According to our experiments, iKernel is highly effective overall producing 1–5% of evaluation ratio on large data sets.
An efficient method for learning nonlinear ranking SVM functions
2012, Information Sciences
Citation Excerpt :
The key idea of RV-SVM is to use training vectors, rather than pairwise difference vectors, as the support vectors. In RankSVMs, the support vectors are pairwise difference vectors of the closest pairs [36,37]. Thus, training requires the investigation of every data pair as a potential candidate support vector, and the number of data pairs increases quadratically with the size of the training set.
The problem of learning ranking (or preference) functions has become important in recent years as various applications have been found in information retrieval. Among the rank learning methods, RankSVM has been favorably used in various applications, e.g., optimizing search engines and improving data retrieval quality. Fast learning methods for linear RankSVM (RankSVM with a linear kernel) have been extensively developed, whereas methods for nonlinear RankSVM (RankSVM with nonlinear kernels) are lacking. This paper proposes an efficient method for learning with nonlinear kernels, called Ranking Vector SVM (RV-SVM). RV-SVM utilizes training vectors rather than pairwise difference vectors to determine the support vectors, and is thus faster to train than conventional RankSVMs. Experimental comparisons with the state-of-the-art RankSVM implementation provided in SVM-light show that RV-SVM is substantially faster for nonlinear kernels, although our method is slower for linear kernels. RV-SVM also uses far fewer support vectors, and thus the trained models are much simpler than those built by RankSVMs while maintaining comparable accuracy. Our implementation of RV-SVM is accessible at http://dm.hwanjoyu.org/rv-svm.
Context based ranking of web database based on user query search
2015, International Journal of Applied Engineering Research
Personalized ranking in web databases: Establishing and utilizing an appropriate workload
2013, Distributed and Parallel Databases
A learning approach to SQL query results ranking using skyline and users' current navigational behavior
2013, IEEE Transactions on Knowledge and Data Engineering

View all citing articles on Scopus

View full text