Drawing Density Core-Sets from Incomplete Relational Data

Liu, Yongnan; Li, Jianzhong; Gao, Hong

doi:10.1007/978-3-319-55699-4_32

Drawing Density Core-Sets from Incomplete Relational Data

Yongnan Liu¹⁸,
Jianzhong Li¹⁸ &
Hong Gao¹⁸

Conference paper
First Online: 22 March 2017

2501 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Abstract

Incompleteness is a ubiquitous issue and brings challenges to answer queries with completeness guaranteed. A density core-set is a subset of an incomplete dataset, whose completeness is approximate to the completeness of the entire dataset. Density core-sets are effective mechanisms to estimate completeness of queries on incomplete datasets. This paper studies the problems of drawing density core-sets on incomplete relational data. To the best of our knowledge, there is no such proposal in the past. (1) We study the problems of drawing density core-sets in different requirements, and prove the problems are all NP-Complete whether functional dependencies are given. (2) An efficient approximate algorithm to draw an approximate density core-set is proposed, where an approximate Knapsack algorithm and weighted sampling techniques are employed to select important candidate tuples. (3) Analysis of the proposed approximate algorithm shows the relative error between completeness of the approximate density core-set and that of a density core-set with same size is within a given relative error bound with high probability. (4) Experiments on both real-world and synthetic datasets demonstrate the effectiveness and efficiency of the algorithm.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
DBLP data from http://dblp.uni-trier.de/xml/. Since DBLP is always updating, the data set was downloaded on July 30, 2016.
2.
http://www.cars.com.

References

Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record, vol. 29, pp. 487–498. ACM (2000)
Google Scholar
Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)
Article MathSciNet MATH Google Scholar
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)
Google Scholar
Arocena, P.C., Glavic, B., Miller, R.J.: Value invention in data exchange. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 157–168. ACM (2013)
Google Scholar
Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)
Article Google Scholar
Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogram construction: how much is enough? ACM SIGMOD Rec. 27, 436–447 (1998). ACM
Article Google Scholar
Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. IEEE Trans. Knowl. Data Eng. 23(8), 1138–1153 (2011)
Article Google Scholar
Cheng, S., Cai, Z., Li, J., Fang, X.: Drawing dominant dataset from big sensory data in wireless sensor networks. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 531–539. IEEE (2015)
Google Scholar
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
MATH Google Scholar
Deng, T., Fan, W., Geerts, F.: On recommendation problems beyond points of interest. Inf. Syst. 48, 64–88 (2015)
Article Google Scholar
Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: estimating the trustworthiness of web sources. Proc. VLDB Endow. 8(9), 938–949 (2015)
Article Google Scholar
Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 159–170. ACM (2008)
Google Scholar
Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–178. ACM, June 2010
Google Scholar
Fan, W., Geerts, F.: Relative information completeness. ACM Trans. Database Syst. 35(4), 27 (2010)
Article Google Scholar
Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, pp. 100–108. ACM (2014)
Google Scholar
Ito, H., Kiyoshima, S., Yoshida, Y.: Constant-time approximation algorithms for the knapsack problem. In: Theory and Applications of Models of Computation, pp. 131–142 (2012)
Google Scholar
Levy, A.Y.: Obtaining complete answers from incomplete databases. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 402–412. Morgan Kaufmann Publishers Inc. (1996)
Google Scholar
Motro, A.: Integrity = validity + completeness. ACM Trans. Database Syst. 14(4), 480–502 (1989)
Article Google Scholar
Phillips, J.M.: Coresets and sketches. http://arxiv.org/abs/1601.00617
Pol, A., Jermaine, C.: Relational confidence bounds are easy with the bootstrap. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 587–598. ACM (2005)
Google Scholar
Poleto, F.Z., Singer, J.M., Paulino, C.D.: Missing data mechanisms and their implications on the analysis of categorical data. Stat. Comput. 21(1), 31–43 (2011)
Article MathSciNet MATH Google Scholar
Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. Proc. VLDB Endow. 8(9), 898–909 (2015)
Article Google Scholar
Razniewski, S., Nutt, W.: Completeness of queries over incomplete databases. Proc. VLDB Endow. 4(11), 749–760 (2011)
Google Scholar
Saha, B., Srivastava, D.: Data quality: the other face of big data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. IEEE (2014)
Google Scholar
Song, S., Zhang, A., Chen, L., Wang, J.: Enriching data imputation with extensive similarity neighbors. Proc. VLDB Endow. 8(11), 1286–1297 (2015)
Article Google Scholar
Wayne, W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data warehouse Institute (TDWI) report (2004). www.dw-institute.com

Download references

Acknowledgments

This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Yongnan Liu, Jianzhong Li & Hong Gao

Authors

Yongnan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yongnan Liu .

Editor information

Editors and Affiliations

Arizona State University, Tempe - Phoenix, Arizona, USA
Selçuk Candan
of Science and Technology, Hong Kong University of Science and Technology, Hong Kong, China
Lei Chen
Aalborg University , Aalborg, Denmark
Torben Bach Pedersen
University of New South Wales , Sydney, New South Wales, Australia
Lijun Chang
The University of Queensland , Brisbane, Queensland, Australia
Wen Hua

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, Y., Li, J., Gao, H. (2017). Drawing Density Core-Sets from Incomplete Relational Data. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_32

Download citation

DOI: https://doi.org/10.1007/978-3-319-55699-4_32
Published: 22 March 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics