Abstract
Exact query on big data is a challenging task due to the large numbers of autonomous data sources. In this paper, an efficient method is proposed to select sources on big data for approximate query. A gain model is presented for source selection by considering information coverage and quality provided by sources. Under this model, the source selection problem is formalized into two optimization problems. Because of the NP-hardness of proposed problems, two approximate algorithms are devised to solve them respectively, and their approximate ratios and complexities are analyzed. To further improve efficiency, a randomized method is developed for gain estimation. Based on it, the time complexities of improved algorithms are sub-linear in the number of data item. Experimental results show high efficiency and scalability of proposed algorithms.
Similar content being viewed by others
References
Codd EF et al (1972) Relational completeness of data base sublanguages
Dong XL, Saha B, Srivastava D (2012) Less is more: selecting sources wisely for integration. Proc VLDB Endow 6(2):37–48
Guo H, Li J, Gao H (2020)Selecting sources for query approximation with bounded resources. In: International conference on combinatorial optimization and applications. Springer, pp. 61–75
Karp RM, Luby M, Madras N (1989) Monte-Carlo approximation algorithms for enumeration problems. J Algorithms 10(3):429–448
Khuller S, Moss A, Naor JS (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45
Li L, Feng X, Shao H, Li J (2018) Source selection for inconsistency detection. In: International conference on database systems for advanced applications. Springer, pp. 370–385
Lin Y, Wang H, Li J, Gao H (2019) Data source selection for information integration in big data era. Info Sci 479:197–213
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions-i. Math Program 14(1):265–294
Rekatsinas T, Dong XL, Srivastava D (2014) Characterizing and selecting fresh data sources. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. pp 919–930
Salloum M, Dong XL, Srivastava D, Tsotras VJ (2013) Online ordering of overlapping data sources. Proc VLDB Endow 7(3):133–144
Shachnai H, Tamir T (2018) Polynomial time approximation schemes. In: Handbook of approximation algorithms and Metaheuristics, 2nd edn. Chapman and Hall/CRC, pp. 125–156
Sun J, Li J, Gao H, Wang H (2018) Truth discovery on inconsistent relational data. Tsinghua Sci Technol 23(3):288–302
Sun L, Hong Z, Pixing Z (2000) A randomized algorithm for the union of sets problem. J Softw 11(12):1587–1593
Funding
This study was funded by the National Natural Science Foundation of China under grants 61732003 and 61832003.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
A preliminary version of this paper appeared in Proceedings of 14th International Conference on Combinatorial Optimization and Applications, pp 61-75, 2020 (Guo et al. 2020).
Rights and permissions
About this article
Cite this article
Guo, H., Li, J. & Gao, H. Data source selection for approximate query. J Comb Optim 44, 2443–2459 (2022). https://doi.org/10.1007/s10878-021-00760-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10878-021-00760-y