Abstract
Incompleteness is a ubiquitous issue and brings challenges to answer queries with completeness guaranteed. A density core-set is a subset of an incomplete dataset, whose completeness is approximate to the completeness of the entire dataset. Density core-sets are effective mechanisms to estimate completeness of queries on incomplete datasets. This paper studies the problems of drawing density core-sets on incomplete relational data. To the best of our knowledge, there is no such proposal in the past. (1) We study the problems of drawing density core-sets in different requirements, and prove the problems are all NP-Complete whether functional dependencies are given. (2) An efficient approximate algorithm to draw an approximate density core-set is proposed, where an approximate Knapsack algorithm and weighted sampling techniques are employed to select important candidate tuples. (3) Analysis of the proposed approximate algorithm shows the relative error between completeness of the approximate density core-set and that of a density core-set with same size is within a given relative error bound with high probability. (4) Experiments on both real-world and synthetic datasets demonstrate the effectiveness and efficiency of the algorithm.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
DBLP data from http://dblp.uni-trier.de/xml/. Since DBLP is always updating, the data set was downloaded on July 30, 2016.
- 2.
References
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record, vol. 29, pp. 487–498. ACM (2000)
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)
Arocena, P.C., Glavic, B., Miller, R.J.: Value invention in data exchange. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 157–168. ACM (2013)
Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)
Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogram construction: how much is enough? ACM SIGMOD Rec. 27, 436–447 (1998). ACM
Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. IEEE Trans. Knowl. Data Eng. 23(8), 1138–1153 (2011)
Cheng, S., Cai, Z., Li, J., Fang, X.: Drawing dominant dataset from big sensory data in wireless sensor networks. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 531–539. IEEE (2015)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
Deng, T., Fan, W., Geerts, F.: On recommendation problems beyond points of interest. Inf. Syst. 48, 64–88 (2015)
Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: estimating the trustworthiness of web sources. Proc. VLDB Endow. 8(9), 938–949 (2015)
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 159–170. ACM (2008)
Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–178. ACM, June 2010
Fan, W., Geerts, F.: Relative information completeness. ACM Trans. Database Syst. 35(4), 27 (2010)
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, pp. 100–108. ACM (2014)
Ito, H., Kiyoshima, S., Yoshida, Y.: Constant-time approximation algorithms for the knapsack problem. In: Theory and Applications of Models of Computation, pp. 131–142 (2012)
Levy, A.Y.: Obtaining complete answers from incomplete databases. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 402–412. Morgan Kaufmann Publishers Inc. (1996)
Motro, A.: Integrity = validity + completeness. ACM Trans. Database Syst. 14(4), 480–502 (1989)
Phillips, J.M.: Coresets and sketches. http://arxiv.org/abs/1601.00617
Pol, A., Jermaine, C.: Relational confidence bounds are easy with the bootstrap. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 587–598. ACM (2005)
Poleto, F.Z., Singer, J.M., Paulino, C.D.: Missing data mechanisms and their implications on the analysis of categorical data. Stat. Comput. 21(1), 31–43 (2011)
Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. Proc. VLDB Endow. 8(9), 898–909 (2015)
Razniewski, S., Nutt, W.: Completeness of queries over incomplete databases. Proc. VLDB Endow. 4(11), 749–760 (2011)
Saha, B., Srivastava, D.: Data quality: the other face of big data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. IEEE (2014)
Song, S., Zhang, A., Chen, L., Wang, J.: Enriching data imputation with extensive similarity neighbors. Proc. VLDB Endow. 8(11), 1286–1297 (2015)
Wayne, W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data warehouse Institute (TDWI) report (2004). www.dw-institute.com
Acknowledgments
This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Liu, Y., Li, J., Gao, H. (2017). Drawing Density Core-Sets from Incomplete Relational Data. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-55699-4_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)