Skip to main content

Selecting Sources for Query Approximation with Bounded Resources

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12577))

Abstract

In big data era, the Web contains a big amount of data, which is extracted from various sources. Exact query answering on large amounts of data sources is challenging for two main reasons. First, querying on big data sources is costly and even impossible. Second, due to the uneven data quality and overlaps of data sources, querying low-quality sources may return unexpected errors. Thus, it is critical to study approximate query problems on big data by accessing a bounded amount of the data sources. In this paper, we present an efficient method to select sources on big data for approximate querying. Our approach proposes a gain model for source selection by considering sources overlaps and data quality. Under the proposed model, we formalize the source selection problem into two optimization problems and prove their hardness. Due to the NP-hardness of problems, we present two approximate algorithms to solve the problems and devise a bitwise operation strategy to improve efficiency, along with rigorous theoretical guarantees on their performance. Experimental results on both real-world and synthetic data show high efficiency and scalability of our algorithms.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Fan, W., Geerts, F., Neven, F.: Making queries tractable on big data with preprocessing. PVLDB 6(9), 685–696 (2013)

    Google Scholar 

  2. Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? VLDB 6(2), 97–108 (2012)

    Google Scholar 

  3. Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. In: VLDB, vol. 6, pp. 37–48. VLDB Endowment (2012)

    Google Scholar 

  4. Codd, E.F.: Relational completeness of data base sublanguages. In: Courant Computer Science Symposia, vol. 6, pp. 65–98. Data Base Systems (1972)

    Google Scholar 

  5. Cornuejols, G., Fisher, M.L., Nemhauser, G.L.: Location of bank accounts to optimize float: an analytic study of exact and approximate algorithms. Manag. Sci. 23(8), 789–810 (1977)

    Article  MathSciNet  Google Scholar 

  6. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions-I. Math. Program. 14(1), 265–294 (1978). https://doi.org/10.1007/BF01588971

    Article  MathSciNet  MATH  Google Scholar 

  7. Khuller, S., Moss, A., Naor, J.: The budgeted maximum coverage problem. Inf. Process. Lett. 70(1), 39–45 (1999)

    Article  MathSciNet  Google Scholar 

  8. Shachnai, H., Tamir, T.: Polynomial time approximation schemes. In: Handbook of Approximation Algorithms and Metaheuristics, pp. 9.1–9.21. Chapman & Hall/CRC Computer and Information Science Series (2007)

    Google Scholar 

  9. Salloum, M., Dong, X.L., Srivastava, D., Tsotras, V.J.: Online ordering of overlapping data sources. VLDB 7(3), 133–144 (2013)

    Google Scholar 

  10. Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930. ACM (2014)

    Google Scholar 

  11. Lin, Y., Wang, H., Zhang, S., Li, J., Gao, H.: Efficient quality-driven source selection from massive data sources. J. Syst. Softw. 118(1), 221–233 (2016)

    Article  Google Scholar 

  12. Lin, Y., Wang, H., Li, J., Gao, H.: Data source selection for information integration in big data era. Inf. Sci. 479(1), 197–213 (2019)

    Article  Google Scholar 

  13. Li, L., Feng, X., Shao, H., Li, J.: Source selection for inconsistency detection. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 370–385. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_22

    Chapter  Google Scholar 

  14. Cao, Y., Fan, W., Wo, T., Yu, W.: Bounded conjunctive queries. PVLDB 7(12), 1231–1242 (2014)

    Google Scholar 

  15. Fan, W., Geerts, F., Libkin, L.: On scale independence for querying big data. In: PODS, pp. 51–62. ACM (2014)

    Google Scholar 

  16. Fan, W., Geerts, F., Cao, Y., Deng, T., Lu, P.: Querying big data by accessing small data. In: PODS, pp 173–184. ACM (2015)

    Google Scholar 

  17. Cao, Y., Fan, W.: An effective syntax for bounded relational queries. In: SIGMOD, pp 599–614. ACM (2016)

    Google Scholar 

  18. Cao, Y., Fan, W.: Data driven approximation with bounded resources. PVLDB 10(9), 973–984 (2017)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the National Natural Science Foundation of China under grants 61732003, 61832003, 61972110 and U1811461.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianzhong Li .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Guo, H., Li, J., Gao, H. (2020). Selecting Sources for Query Approximation with Bounded Resources. In: Wu, W., Zhang, Z. (eds) Combinatorial Optimization and Applications. COCOA 2020. Lecture Notes in Computer Science(), vol 12577. Springer, Cham. https://doi.org/10.1007/978-3-030-64843-5_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64843-5_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64842-8

  • Online ISBN: 978-3-030-64843-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics