Elsevier

Information Sciences

Volume 453, July 2018, Pages 198-215
Information Sciences

On pricing approximate queries

https://doi.org/10.1016/j.ins.2018.04.036Get rights and content

Abstract

Nowadays, data is being bought and sold in data markets whose prices are usually determined by vendors. In a data market, approximate aggregate queries over big data sets are often expensive due to computing time, machine resources and data prices. Therefore, sometimes consumers cannot obtain query results within preferred deadlines or under budgetary. Approximate aggregate queries with reasonable accuracy is a key technique to address this problem. However, there is no mature theory for pricing approximate aggregate queries.

In this paper, we first propose a novel theoretical framework to support pricing approximate aggregate queries. By using a sampling technique to achieve an error-bounded approximate answer over data queries, a transforming function is provided to convert the original pricing function to the one that supports approximate aggregate queries. We further adopt a statistical method to estimate consumers’ payments. The proposed transform function preserves the arbitrage free property. We implement a prototype system and through comparing our framework with two benchmark pricing methods, experiments show that our pricing method is much suitable for pricing approximate aggregate queries.

Introduction

The value of information for data analysis is at the core of business intelligence [9]. In an Internet enterprise, extracting and analyzing click streams or search logs help the company’s mission-critical businesses improve service quality, find the future revenue growth opportunities, and monitor trends and detect root causes of live-site events in a timely fashion [44]. For example, the stock market can be predicted by analyzing the social media data [10] as well as the presidential election using social media data from Twitter [16].

Data markets, where data owners can sell and consumers can buy data, are emerging on the internet, such as Azure Marketplace [5] and BDEX [8]. Some data markets even provide analytical services, for instance, Qlik Analytics Platform [36] and infochimps [25]. For all these data markets, data prices are static and non-negotiable, giving consumers access to the best data accuracy that providers can achieve. Meanwhile, the commercialization of data attracts more researchers’ attention, such as pricing data [19] and cloud-based query optimization [46]. In data markets, the most important rule for data pricing is arbitrage-free,1 which means consumers must pay for all the information whether directly asked or deduced from previous queries [27]. Therefore, the price of a query is non-decreasing with the number of records that contribute to the answer. However, aggregate queries often involve a large number of records that make them expensive.

Although aggregate queries over big data are often processed in a distributed system that scales up to thousands of nodes for performance improvement, aggregate queries are still expensive to compute [44]. The basic reason is that aggregate query processing may consume the entire data set, which is computationally expensive even for large-scale systems. As a result, aggregate queries may exhaust computation resources and take a long time to compute accurate answers. In certain scenario, the answer to a query for some consumers is valuable only for a short duration. For example, to predict the outcome of the presidential election, the query result is useless after the election is completed. Existing data markets and research are focused on pricing the exact queries while they completely ignored the consumers’ timeliness requirements.

To address this problem, we apply the approximate aggregate query technique [21], [22]. Aggregate queries are estimated in a small randomly-sampled data set to return error-bounded approximate answers. Like other user-friendly queries, for example, personalized music retrieval [14], [15] can adopt user-specific information, the approximate aggregate query can confirm the error bounds which are data ranges containing the real value with high possibility requested by consumers. The error-bounded results are often sufficient to meet the consumer’s requirements with explicit deadlines. Since approximate query answers are not exact, tuning the price with uncertainty is crucial for pricing approximate queries. Generally, the larger the error bound is, the lower the price will be. For some consumers that the approximate answers are sufficed enough, the discount price may encourage them to purchase the services. For instance, a video manufacturer wants to know the rating for different video categories to guide the next shooting investment plan. He does not need to know the exact score for each category. Supporting approximate queries in the data market will encourage the potential consumers for cost-saving. Selling approximate aggregate queries provides a different business model that reduces the computational and monetary cost for data consumers meanwhile increasing the revenue for data sellers (as they can execute and sell more approximate data queries on the same computational resource).

There are some challenges to be addressed for pricing approximate queries. Firstly, when a data market provides approximate aggregate query service, the pricing functions should be arbitrage-free, which means a smart buyer cannot obtain information with the price less than the advertised price. Secondly, formulating a new pricing function for approximate aggregate queries requires additional actions for sellers, and the coexist of two pricing functions are not convenient for consumers to estimate query payments. Therefore, the key principle for pricing approximate aggregate queries is to convert the existing arbitrage-free pricing function to adapt to the new approximate feature with the arbitrage-free guarantee. Thirdly, it is obvious that the prices of approximate aggregate queries should decrease with the growth of error bounds. Since the value (accuracy) of information is not linear with the number of records that contributes to the answer, simply charging the used data for pricing approximate aggregate queries is not appropriate for both data consumers and providers. Therefore, it is quite complex to define the relationship between prices and the data used for bounded-error aggregate queries.

In this work, we focus on the information value of data, which means that we do not take the computational cost and response time into our pricing function. Using approximate technique, the computational cost and response time will be significantly decreased, and we illustrate the decrement with different confidences and bounded errors in Section 5.

The scenario is overviewed in Fig. 1. In our framework, sellers do the same work as in the existing data markets, i.e., they trustee data to data markets, provide price points as input and earn the revenue computed by the data market; In case buyers wanted to query data with bounded error, the only necessary change was to submit queries with defined error bound and confidence. If buyers wanted to purchase exact queries instead, they could submit the query with zero error. On the data market side, upon receiving a data query, it samples a subset of data (for the exact query, the subset is the whole data set) for query execution and charges consumers for the information value.

To better understand our framework, we provide an intuitive example here. A data seller conducts the questionnaire survey on factors of movie rating, such as genre and schedule. He trustees this survey data to the data market and makes a high price. An entertainment firm plans to invest a specific kind of movie and wants to access the survey to guide the shooting. To reduce the survey cost, the film can query the specific genre. To gain additional cost reduction, the film could query the approximate rating on different factors. From the seller’s perspective, price arbitrage is undesired, which means the buyer may buy some cheaper data and use them to compute answers for expensive queries. This work provides a method to price approximate queries and preserves arbitrage free.

The main contributions of the paper are summarized as follows:

  • We introduce the query language used in this paper, i.e., linear query, and extend the basic pricing concepts to adapt to the new feature that supports approximate aggregate queries via the sampling technique to provide error bound guarantees.

  • We present a basic pricing function for pricing approximate aggregate queries generated from price points that are a set of  < query, price >  given by sellers. Using this method, the exact and approximate queries need to be priced separately, which may not be convenient for both sellers and consumers.

  • We propose a novel framework to sell both exact and approximate answers by transforming the existing data pricing schema. In our framework, both exact and approximate queries are priced uniformly. The arbitrage-free property is also guaranteed as the original pricing function.

  • We implement a prototype of the pricing framework. Taking the relationship between accuracy loss and price discount as metric and through comparing our framework with two benchmark pricing methods and a case study, experiment result shows that our method is much suitable for pricing approximate queries. With different error bounds, we also demonstrate the improvement of execution time and price decrease.

The remaining of the paper is organized as follows. Section 2 introduces preliminaries on query pricing and data sampling. In Section 3 we present a method to generate arbitrage-free pricing function for approximate aggregate queries from seller-defined price points. The framework that transforms the existing price function to support approximate aggregate queries is introduced in Section 4. Section 5 reports some simulation results. Section 6 discusses related work. At last, Section 7 concludes the paper and points to some the future work.

Section snippets

Preliminaries

In this section, we introduce basic concepts and problem models on sampling and query pricing, and the extensions made to price the approximate aggregate queries. As the approximate approach is not the focus of this paper, we simply employ the random sampling technique for this purpose.

Basic pricing function

Since there may be unlimited query types for a database instance, it is not possible for sellers to give a price for each query. To get a flexible data market that supports ad-hoc queries, the query price should be automatically derived from a finite number of priced view. In this section, we give a basic way to generate the arbitrage-free pricing function based on explicit price points, and the pricing function is required to support approximate aggregate queries.

Enhanced pricing function

The generation of arbitrage-free pricing function for the approximate aggregate query from price points may not be widely accepted by data markets and their users (both buyers and sellers), because of the extra mass of work and price confusion. Therefore, finding a method that needs less work for all participants in data trading is important to popularize the new feature of the approximate aggregate query.

In this section, we introduce a framework about how to translate the existing pricing

Simulation study

In this section, we evaluate the enhanced pricing function for approximate aggregate queries transformed from the existing one, presented in Section 4.

Related work

The answers of approximate query can be generated from sampling approaches (such as stratified sampling [1], [2], [3], [6], [13], [44]), online aggregation [18], [24], [34], synopses (wavelets [21], sketches [26], histograms [35], etc.) and lossless summaries (e.g. materialized views [12], data cubes [35]). As the approximate approach is not the focus of this paper, we simply employ the random sampling technique for this purpose. We first integrate sampling technique with query-pricing to

Conclusion and future work

Aggregate queries are expensive in prices and execution time. Sometimes the approximate aggregate queries are acceptable for consumers. However, the existing data markets do not support the approximate aggregate query. In this paper, we provide a novel framework for data markets to sell both exact and approximate aggregate queries. Sellers do not need to do any extra work to achieve this feature. Buyers can purchase any aggregate queries with any error bounds and confidences to meet their

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC) (grants nos. 61772228), National key research and development program of China (grant nos. 2016YFB0201503 and 2016YFB0701101), Major Special Research Project of Science and Technology Department of Jilin Province (20160203008GX), Jilin Scientific and Technological Development Program (20170520066JH) and Graduate Innovation Fund of Jilin University (2017069).

References (47)

  • J.Y. Chang et al.

    Extended conditions for answering an aggregate query using materialized views

    Inf. Process. Lett.

    (1999)
  • S. Acharya et al.

    Congressional samples for approximate answering of group-by queries

    ACM Sigmod Rec.

    (2000)
  • S. Acharya et al.

    The aqua approximate query answering system.

    ACM Sigmod Rec.

    (1999)
  • S. Agarwal et al.

    Blinkdb: queries with bounded errors and bounded response times on very large data

    ACM European Conference on Computer Systems

    (2012)
  • Y. Amsterdamer et al.

    Provenance for aggregate queries

    Comput. Sci.

    (2011)
  • M. Azure, Azure Marketplace, 2016, (https://datamarket.azure.com/). [Online; accessed...
  • B. Babcock et al.

    Dynamic sample selection for approximate query processing

    ACM SIGMOD International Conference on Management of Data

    (2003)
  • B. Barak et al.

    Privacy, accuracy, and consistency too: a holistic solution to contingency table release.

    Twenty-Sixth ACM Sigact-Sigmod-Sigart Symposium on Principles of Database Systems, June 11-13, 2007, Beijing, China

    (2007)
  • BDEX, BDEX, Data Exchange Platform, 2016, (http://www.bigdataexchange.com/). [Online; accessed...
  • M.J.A. Berry et al.

    Data Mining Techniques: For Marketing, Sales, and Customer Support

    (1997)
  • J. Bollen et al.

    Twitter mood predicts the stock market

    Comput. Sci.

    (2010)
  • S. Boyd et al.

    Convex Optimization

    (2004)
  • S. Chaudhuri et al.

    Optimized stratified sampling for approximate query processing

    ACM Trans. Database Syst.

    (2007)
  • Z. Cheng et al.

    On effective personalized music retrieval by exploring online user behaviors

    Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval

    (2016)
  • Z. Cheng et al.

    Exploring user-specific information in music retrieval

    Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval

    (2017)
  • M. Choy et al.

    Us presidential election 2012 prediction using census corrected twitter model

    Comput. Sci.

    (2012)
  • S. Coles

    An Introduction to Statistical Modeling of Extreme Values

    (2008)
  • T. Condie et al.

    Mapreduce online

    Usenix Symposium on Networked Systems Design and Implementation, NSDI 2010, April 28–30, 2010, San Jose, CA, USA

    (2010)
  • S. Deep, P. Koutris, The Design of Arbitrage-free Data Pricing Schemes...
  • C. Engle et al.

    Shark: fast data analysis using coarse-grained distributed memory

    ACM SIGMOD International Conference on Management of Data

    (2012)
  • A.C. Gilbert et al.

    Surfing wavelets on streams: one-pass summaries for approximate aggregate queries

    Vldb

    (2001)
  • A.C. Gilbert et al.

    Optimal and approximate computation of summary statistics for range aggregates

    In Proceedings of the ACM Symposium on Principles of Database Systems

    (2001)
  • I. Goiri et al.

    Approxhadoop: bringing approximations to mapreduce frameworks

    ACM Sigplan Not.

    (2015)
  • Cited by (0)

    View full text