Skip to main content

Drawing Density Core-Sets from Incomplete Relational Data

  • Conference paper
  • First Online:
  • 2501 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10178))

Abstract

Incompleteness is a ubiquitous issue and brings challenges to answer queries with completeness guaranteed. A density core-set is a subset of an incomplete dataset, whose completeness is approximate to the completeness of the entire dataset. Density core-sets are effective mechanisms to estimate completeness of queries on incomplete datasets. This paper studies the problems of drawing density core-sets on incomplete relational data. To the best of our knowledge, there is no such proposal in the past. (1) We study the problems of drawing density core-sets in different requirements, and prove the problems are all NP-Complete whether functional dependencies are given. (2) An efficient approximate algorithm to draw an approximate density core-set is proposed, where an approximate Knapsack algorithm and weighted sampling techniques are employed to select important candidate tuples. (3) Analysis of the proposed approximate algorithm shows the relative error between completeness of the approximate density core-set and that of a density core-set with same size is within a given relative error bound with high probability. (4) Experiments on both real-world and synthetic datasets demonstrate the effectiveness and efficiency of the algorithm.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    DBLP data from http://dblp.uni-trier.de/xml/. Since DBLP is always updating, the data set was downloaded on July 30, 2016.

  2. 2.

    http://www.cars.com.

References

  1. Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: ACM SIGMOD Record, vol. 29, pp. 487–498. ACM (2000)

    Google Scholar 

  2. Agarwal, P.K., Har-Peled, S., Varadarajan, K.R.: Approximating extent measures of points. J. ACM 51(4), 606–635 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  3. Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: Proceedings of the 8th ACM European Conference on Computer Systems, pp. 29–42. ACM (2013)

    Google Scholar 

  4. Arocena, P.C., Glavic, B., Miller, R.J.: Value invention in data exchange. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 157–168. ACM (2013)

    Google Scholar 

  5. Beskales, G., Ilyas, I.F., Golab, L., Galiullin, A.: Sampling from repairs of conditional functional dependency violations. VLDB J. 23(1), 103–128 (2014)

    Article  Google Scholar 

  6. Chaudhuri, S., Motwani, R., Narasayya, V.: Random sampling for histogram construction: how much is enough? ACM SIGMOD Rec. 27, 436–447 (1998). ACM

    Article  Google Scholar 

  7. Chen, K., Chen, H., Conway, N., Hellerstein, J.M., Parikh, T.S.: Usher: improving data quality with dynamic forms. IEEE Trans. Knowl. Data Eng. 23(8), 1138–1153 (2011)

    Article  Google Scholar 

  8. Cheng, S., Cai, Z., Li, J., Fang, X.: Drawing dominant dataset from big sensory data in wireless sensor networks. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 531–539. IEEE (2015)

    Google Scholar 

  9. Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)

    MATH  Google Scholar 

  10. Deng, T., Fan, W., Geerts, F.: On recommendation problems beyond points of interest. Inf. Syst. 48, 64–88 (2015)

    Article  Google Scholar 

  11. Dong, X.L., Gabrilovich, E., Murphy, K., Dang, V., Horn, W., Lugaresi, C., Sun, S., Zhang, W.: Knowledge-based trust: estimating the trustworthiness of web sources. Proc. VLDB Endow. 8(9), 938–949 (2015)

    Article  Google Scholar 

  12. Fan, W.: Dependencies revisited for improving data quality. In: Proceedings of the Twenty-Seventh ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 159–170. ACM (2008)

    Google Scholar 

  13. Fan, W., Geerts, F.: Capturing missing tuples and missing values. In: Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 169–178. ACM, June 2010

    Google Scholar 

  14. Fan, W., Geerts, F.: Relative information completeness. ACM Trans. Database Syst. 35(4), 27 (2010)

    Article  Google Scholar 

  15. Indyk, P., Mahabadi, S., Mahdian, M., Mirrokni, V.S.: Composable core-sets for diversity and coverage maximization. In: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2014, pp. 100–108. ACM (2014)

    Google Scholar 

  16. Ito, H., Kiyoshima, S., Yoshida, Y.: Constant-time approximation algorithms for the knapsack problem. In: Theory and Applications of Models of Computation, pp. 131–142 (2012)

    Google Scholar 

  17. Levy, A.Y.: Obtaining complete answers from incomplete databases. In: Proceedings of the 22th International Conference on Very Large Data Bases, pp. 402–412. Morgan Kaufmann Publishers Inc. (1996)

    Google Scholar 

  18. Motro, A.: Integrity = validity + completeness. ACM Trans. Database Syst. 14(4), 480–502 (1989)

    Article  Google Scholar 

  19. Phillips, J.M.: Coresets and sketches. http://arxiv.org/abs/1601.00617

  20. Pol, A., Jermaine, C.: Relational confidence bounds are easy with the bootstrap. In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, pp. 587–598. ACM (2005)

    Google Scholar 

  21. Poleto, F.Z., Singer, J.M., Paulino, C.D.: Missing data mechanisms and their implications on the analysis of categorical data. Stat. Comput. 21(1), 31–43 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  22. Potti, N., Patel, J.M.: DAQ: a new paradigm for approximate query processing. Proc. VLDB Endow. 8(9), 898–909 (2015)

    Article  Google Scholar 

  23. Razniewski, S., Nutt, W.: Completeness of queries over incomplete databases. Proc. VLDB Endow. 4(11), 749–760 (2011)

    Google Scholar 

  24. Saha, B., Srivastava, D.: Data quality: the other face of big data. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 1294–1297. IEEE (2014)

    Google Scholar 

  25. Song, S., Zhang, A., Chen, L., Wang, J.: Enriching data imputation with extensive similarity neighbors. Proc. VLDB Endow. 8(11), 1286–1297 (2015)

    Article  Google Scholar 

  26. Wayne, W.: Data quality and the bottom line: achieving business success through a commitment to high quality data. The Data warehouse Institute (TDWI) report (2004). www.dw-institute.com

Download references

Acknowledgments

This work is supported in part by the Key Research and Development Plan of National Ministry of Science and Technology under grant No. 2016YFB1000703, and the Key Program of the National Natural Science Foundation of China under Grant No. 61190115, 61632010 and U1509216.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yongnan Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Liu, Y., Li, J., Gao, H. (2017). Drawing Density Core-Sets from Incomplete Relational Data. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-55699-4_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-55698-7

  • Online ISBN: 978-3-319-55699-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics