Abstract
In big data era, the Web contains a big amount of data, which is extracted from various sources. Exact query answering on large amounts of data sources is challenging for two main reasons. First, querying on big data sources is costly and even impossible. Second, due to the uneven data quality and overlaps of data sources, querying low-quality sources may return unexpected errors. Thus, it is critical to study approximate query problems on big data by accessing a bounded amount of the data sources. In this paper, we present an efficient method to select sources on big data for approximate querying. Our approach proposes a gain model for source selection by considering sources overlaps and data quality. Under the proposed model, we formalize the source selection problem into two optimization problems and prove their hardness. Due to the NP-hardness of problems, we present two approximate algorithms to solve the problems and devise a bitwise operation strategy to improve efficiency, along with rigorous theoretical guarantees on their performance. Experimental results on both real-world and synthetic data show high efficiency and scalability of our algorithms.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Fan, W., Geerts, F., Neven, F.: Making queries tractable on big data with preprocessing. PVLDB 6(9), 685–696 (2013)
Li, X., Dong, X.L., Lyons, K., Meng, W., Srivastava, D.: Truth finding on the deep web: is the problem solved? VLDB 6(2), 97–108 (2012)
Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. In: VLDB, vol. 6, pp. 37–48. VLDB Endowment (2012)
Codd, E.F.: Relational completeness of data base sublanguages. In: Courant Computer Science Symposia, vol. 6, pp. 65–98. Data Base Systems (1972)
Cornuejols, G., Fisher, M.L., Nemhauser, G.L.: Location of bank accounts to optimize float: an analytic study of exact and approximate algorithms. Manag. Sci. 23(8), 789–810 (1977)
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions-I. Math. Program. 14(1), 265–294 (1978). https://doi.org/10.1007/BF01588971
Khuller, S., Moss, A., Naor, J.: The budgeted maximum coverage problem. Inf. Process. Lett. 70(1), 39–45 (1999)
Shachnai, H., Tamir, T.: Polynomial time approximation schemes. In: Handbook of Approximation Algorithms and Metaheuristics, pp. 9.1–9.21. Chapman & Hall/CRC Computer and Information Science Series (2007)
Salloum, M., Dong, X.L., Srivastava, D., Tsotras, V.J.: Online ordering of overlapping data sources. VLDB 7(3), 133–144 (2013)
Rekatsinas, T., Dong, X.L., Srivastava, D.: Characterizing and selecting fresh data sources. In: SIGMOD, pp. 919–930. ACM (2014)
Lin, Y., Wang, H., Zhang, S., Li, J., Gao, H.: Efficient quality-driven source selection from massive data sources. J. Syst. Softw. 118(1), 221–233 (2016)
Lin, Y., Wang, H., Li, J., Gao, H.: Data source selection for information integration in big data era. Inf. Sci. 479(1), 197–213 (2019)
Li, L., Feng, X., Shao, H., Li, J.: Source selection for inconsistency detection. In: Pei, J., Manolopoulos, Y., Sadiq, S., Li, J. (eds.) DASFAA 2018. LNCS, vol. 10828, pp. 370–385. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91458-9_22
Cao, Y., Fan, W., Wo, T., Yu, W.: Bounded conjunctive queries. PVLDB 7(12), 1231–1242 (2014)
Fan, W., Geerts, F., Libkin, L.: On scale independence for querying big data. In: PODS, pp. 51–62. ACM (2014)
Fan, W., Geerts, F., Cao, Y., Deng, T., Lu, P.: Querying big data by accessing small data. In: PODS, pp 173–184. ACM (2015)
Cao, Y., Fan, W.: An effective syntax for bounded relational queries. In: SIGMOD, pp 599–614. ACM (2016)
Cao, Y., Fan, W.: Data driven approximation with bounded resources. PVLDB 10(9), 973–984 (2017)
Acknowledgments
This work was supported by the National Natural Science Foundation of China under grants 61732003, 61832003, 61972110 and U1811461.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Guo, H., Li, J., Gao, H. (2020). Selecting Sources for Query Approximation with Bounded Resources. In: Wu, W., Zhang, Z. (eds) Combinatorial Optimization and Applications. COCOA 2020. Lecture Notes in Computer Science(), vol 12577. Springer, Cham. https://doi.org/10.1007/978-3-030-64843-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-030-64843-5_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64842-8
Online ISBN: 978-3-030-64843-5
eBook Packages: Computer ScienceComputer Science (R0)