Elsevier

Information Systems

Volume 32, Issue 4, June 2007, Pages 560-574
Information Systems

Enabling soft queries for data retrieval

https://doi.org/10.1016/j.is.2006.02.001Get rights and content

Abstract

Data retrieval finding relevant data from large databases — has become a serious problem as myriad databases have been brought online in the Web. For instance, querying the for-sale houses in Chicago from realtor.com returns thousands of matching houses. Similarly, querying “digital camera” in froogle.com returns hundreds of thousand of results. This data retrieval is essentially an online ranking problem, i.e., ranking data results according to the user's preference effectively and efficiently. This paper proposes a new rank query framework, for effectively incorporating “user-friendly” rank-query formulation into “data base (DB)-friendly” rank-query processing, in order to enable “soft” queries on databases. Our framework assumes, as the “back-end,” the score-based ranking model for expressive and efficient query processing. On top of the score-based model, as the “front-end,” we adopt an SVM-ranking mechanism for providing intuitive and exploratory query formulation. In essence, our framework enables users to formulate queries simply by ordering some sample objects, while learning the “DB-friendly” ranking function F from the partial orders. Such learned functions can then be processed and optimized by existing database systems. We demonstrate the efficiency and effectiveness of our framework using real-life user queries and datasets: our results show that the system effectively learns quantitative ranking functions from qualitative feedback from users with efficient online processing.

Introduction

As we move toward a digital world, information abounds everywhere—retrieving desired data thus becomes a ubiquitous challenge. In particular, with the widespread of the Internet, myriad databases have been brought online, providing massive data through searchable query interfaces. (The July 2000 survey of [1] claims that there were 500 billion hidden “pages,” or data objects, in 105 online sources.) While our databases provide well-maintained, high-quality structured data, with the sheer scale, users are facing the hurdle of searching and retrieving.

This data retrieval problem— that of finding relevant data from large databases — has thus become a clear challenge. (By “retrieval,” we intend to stress the relevance-based matching, even for structured “data” — much like text retrieval for finding relevant documents.) To illustrate, Fig. 1 shows several example scenarios. Consider user Amy, who is looking for a house in Chicago. She searches realtor.com with a few constraints on city, price, beds, baths, which returns 3581 matching houses. Similarly, when Amy searches froogle.com for "digital camera", she is again overwhelmed by a total of 746,000 matches. She will have to sift through and sort out all these matches. Or, Amy may realize that she must “narrow” her query — However, on this extreme, and equally undesirable, she may as well get no hits at all. She will likely manually “oscillate” between these extremes, before eventually managing to complete her data retrieval task, if at all.

Relational databases offer little support for such retrieval tasks. Traditional Boolean-based query models like SQL are based on “hard” criteria (e.g., price< $100,000) while users often employ “soft” criteria for their specific senses of “relevance” or “preference.” Unlike flat Boolean results, these fuzzy criteria naturally calls for ranking, to indicate how well the results match. Such ranking is essential for data retrieval, by ordering answers according to their matching “scores.” Thus, on one hand, there will not be too many matches, since ranking focuses users on the best matches. On the other hand, neither will there be no hits, since ranking will return even partial matches. While such ranking has been the norm for “text” retrieval [2] (e.g., search engines like Google), it is critically missing in relational database systems for supporting similar “data” retrieval.

To enable such soft queries for data retrieval, we observe two major barriers: First, user-friendliness: The data retrieval system should be “user friendly,” for ordinary users to easily express their preference. Note that, unlike traditional data management with mostly “canned transactions” by application developers, data retrieval system must accommodate ordinary users who are not able to express their implicit preference by formulating a query or function. Second, DB-friendliness: The system should be “DB-friendly,” to be compatible with existing relational DBMS, so that it can be executed and optimized by any DBMS. Note that data retrieval, with many interesting scenarios online, must essentially achieve responsive processing.

While there has been existing work on supporting ranking in both databases and machine learning communities (discussed in Section 6), due to their different aspects of interests, there has been no efforts ventured for enabling soft queries for data retrieval: On one hand, the databases community has studied rank query processing [3], [4], [5], [6]. However, they clearly lack the support for intuitively formulating ranking in the first place, to accommodate everyday users (as Section 2 will discuss). On the other hand, the machine learning community has focused on learning or formulating ranking from examples [7], [8]. However, such ranking functions are hardly amenable to relational DBMS for efficient processing.

This paper develops the “bridging” techniques of database and machine learning, to provide systematic solutions for data retrieval. We proposes a new framework such that: (1) to achieve user-friendliness, it allows users to qualitatively and intuitively express their preferences by ordering some sample, (2) to achieve DB-friendliness, we learn a quantitative global ranking function which is amenable to existing relational DBMS. In summary, our framework seamlessly integrates the front-end machine learner with a back-end processing engine to evaluate the learned functions.

The new contributions of this paper are summarized as follows.

  • We develop the duality of ranking and classification view in Section 3.1, in order to connect the “user-friendly” query formulation (i.e., learning ranking from relative orderings) with the “DB-friendly” query processing (i.e., processing ranking from absolute scores).

  • We provide an intuitive interpretation of the SVM ranking solution [8], by using the duality and presenting Corollaries 1 and 2 and Remark 1 in Section 3.2.2.

  • We develop top sampling method which (1) provides an “exploratory” interface to users; (2) further enhances the SVM performance for ranking; and (3) is efficiently expressed in SQL and thus facilitate the integration with RDBMS. We experimentally show that the top sampling method is efficient and reduces the amount of user feedback to achieve a high accuracy.

We motivate and describe the architecture of our framework (Section 2) and present the component techniques (Section 3). We demonstrate the efficiency and effectiveness of our framework using real-life queries and data sets (Section 4). We discuss valuable lessons we learned from our user study and further challenges to build the data retrieval-integrated relational system (Section 5). We discuss related work in Section 6.

Section snippets

Overview: Bridging rank formulation and processing

This section motivates and introduces our approach—Our goal is to seamlessly integrate user-friendly rank formulation with DB-friendly rank processing. As Section 1 explained, such “mix” is critical for enabling soft queries for data retrieval.

The RankFP framework: enabling rank formulation and processing online

In this section, we present the techniques for realizing the RankFP framework (Fig. 2). First, Section 3.1 starts with developing how we “connect” score-based ranking view, which is effective for processing back-end, with classification view, effective for learning front-end. Second, Section 3.2 then investigates SVM as the learning machine (Step 3a in Fig. 3). Finally, Section 3.3 develops techniques to enable rank formulation and processing to be “online”, e.g., selective sampling for

Experimental evaluation

This section reports our extensive experiments for studying the usability (or “user-friendness”) and efficiency (or “DB-friendness”) of our RankFP framework: First, for usability, we used Kendall's τ measure [9], [10], [11], which is used widely to measure the similarity of the two orderings, i.e., the ideal ordering R* and the ordering generated by our system RF. Second, for efficiency, we measured absolute response time. Our experiments were conducted with a Pentium 4 2 GHz PC with 1 GB RAM.

Discussion

Integration with relational database systems: Recently, there have been research efforts in processing rank queries in a relational context. References [5], [19] have proposed rank processing as a layer “on top of” relational databases—exploiting histograms [5] and materialized views [19], respectively, for efficient rank processing. However, these works cannot be adopted as a processing back-end for our framework: First, these works rely on assumptions that ranking function is a k-nearest

Related work

As our framework consists of rank learning and processing, this section discusses current-of-art in each of the topic areas.

Rank learning: For the “user-friendly” rank formulation, we adopt machine learning approach, in particular SVM [12], to learn a quantitative ranking function from a qualitative feedback. SVM has proven highly effective in classification [12], [13], [14]. Ref. [8] developed an SVM ordinal regression method, and reference [10] applied it for optimizing search engines. We

Conclusion

This paper proposes a new data retrieval framework which incorporates a user-friendly rank formulation into the DB-friendly query processing. In particular, SVM techniques are adopted as the front-end to build an intuitive rank formulation which is compatible and integrated with the back-end score-based query processing. The experiments on a real-estate data set show promising results: the data retrieval system effectively learns quantitative ranking functions from the users’ qualitative

References (29)

  • BrightPlanet.com, The deep web: Surfacing hidden value, Accessible at http://brightplanet.com/technology/deepweb.asp...
  • G. Salton, Automatic Text Processing, Addison-Wesley, Reading, MA.,...
  • M.J. Carey, D. Kossmann, On saying “enough already!” in SQL, in: Proceedings of the ACM SIGMOD International Conference...
  • K.C.-C. Chang, S.-W. Hwang, Minimal probing: supporting expensive predicates for top-k queries, SIGMOD...
  • S. Chaudhuri, L. Gravano, Evaluating Top-k Selection Queries, in: Proceedings of the International Conference Very...
  • R. Fagin, A. Lote, M. Naor, Optimal aggregation algorithms for middleware, in: Proceedings of ACM SIGACT-SIGMOD-SIGART...
  • W.W. Cohen, R.E. Schapire, Y. Singer, Learning to order things, in: Proceedings of Advances in Neural Information...
  • M. Kendall, Rank Correlation Methods, Hafner,...
  • T. Joachims, Optimizing search engines using clickthrough data, in: Proceedings of the ACM SIGKDD International...
  • A.M. Mood et al.

    Introduction to the Theory of Statistics

    (1974)
  • V.N. Vapnik

    Statistical Learning Theory

    (1998)
  • C.J.C. Burges

    A tutorial on support vector machines for pattern recognition

    Data Mining Knowledge Discovery

    (1998)
  • N. Christianini et al.

    An Introduction to Support Vector Machines and Other Kernel-based Learning Methods

    (2000)
  • Cited by (21)

    • Drawing on the iPad to input fuzzy sets with an application to linguistic data science

      2019, Information Sciences
      Citation Excerpt :

      Many aspects related to dealing with large data and their understanding have been addressed by fuzzy queries. A very interesting example of that – called soft queries – has been proposed in [41,42]. The essential part of soft queries is related to defining linguistic terms as summarizers and quantifiers as quantities in agreement.

    • IKernel: Exact indexing for support vector machines

      2014, Information Sciences
      Citation Excerpt :

      SVM (Support Vector Machine) is a well-established machine learning methodology popularly used for learning classification, regression, and ranking functions [22,4,11,16,15]. Especially, SVM for rank learning has been applied to various applications including search engines or relevance feedback systems [5,20,18,24,26,27,25,28]. For example, in content-based image retrieval (CBIR) systems, the user provides feedback of whether the resulting images are relevant or not, from which SVM learns a ranking function F, and the function is evaluated through the database to find top-k relevant images.

    • An efficient method for learning nonlinear ranking SVM functions

      2012, Information Sciences
      Citation Excerpt :

      The key idea of RV-SVM is to use training vectors, rather than pairwise difference vectors, as the support vectors. In RankSVMs, the support vectors are pairwise difference vectors of the closest pairs [36,37]. Thus, training requires the investigation of every data pair as a potential candidate support vector, and the number of data pairs increases quadratically with the size of the training set.

    • Context based ranking of web database based on user query search

      2015, International Journal of Applied Engineering Research
    View all citing articles on Scopus
    View full text