Skip to main content
Log in

Data source selection for approximate query

  • Published:
Journal of Combinatorial Optimization Aims and scope Submit manuscript

Abstract

Exact query on big data is a challenging task due to the large numbers of autonomous data sources. In this paper, an efficient method is proposed to select sources on big data for approximate query. A gain model is presented for source selection by considering information coverage and quality provided by sources. Under this model, the source selection problem is formalized into two optimization problems. Because of the NP-hardness of proposed problems, two approximate algorithms are devised to solve them respectively, and their approximate ratios and complexities are analyzed. To further improve efficiency, a randomized method is developed for gain estimation. Based on it, the time complexities of improved algorithms are sub-linear in the number of data item. Experimental results show high efficiency and scalability of proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Codd EF et al (1972) Relational completeness of data base sublanguages

  • Dong XL, Saha B, Srivastava D (2012) Less is more: selecting sources wisely for integration. Proc VLDB Endow 6(2):37–48

    Article  Google Scholar 

  • Guo H, Li J, Gao H (2020)Selecting sources for query approximation with bounded resources. In: International conference on combinatorial optimization and applications. Springer, pp. 61–75

  • Karp RM, Luby M, Madras N (1989) Monte-Carlo approximation algorithms for enumeration problems. J Algorithms 10(3):429–448

    Article  MathSciNet  Google Scholar 

  • Khuller S, Moss A, Naor JS (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45

    Article  MathSciNet  Google Scholar 

  • Li L, Feng X, Shao H, Li J (2018) Source selection for inconsistency detection. In: International conference on database systems for advanced applications. Springer, pp. 370–385

  • Lin Y, Wang H, Li J, Gao H (2019) Data source selection for information integration in big data era. Info Sci 479:197–213

    Article  Google Scholar 

  • Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge

    Book  Google Scholar 

  • Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions-i. Math Program 14(1):265–294

    Article  MathSciNet  Google Scholar 

  • Rekatsinas T, Dong XL, Srivastava D (2014) Characterizing and selecting fresh data sources. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. pp 919–930

  • Salloum M, Dong XL, Srivastava D, Tsotras VJ (2013) Online ordering of overlapping data sources. Proc VLDB Endow 7(3):133–144

    Article  Google Scholar 

  • Shachnai H, Tamir T (2018) Polynomial time approximation schemes. In: Handbook of approximation algorithms and Metaheuristics, 2nd edn. Chapman and Hall/CRC, pp. 125–156

  • Sun J, Li J, Gao H, Wang H (2018) Truth discovery on inconsistent relational data. Tsinghua Sci Technol 23(3):288–302

    Article  Google Scholar 

  • Sun L, Hong Z, Pixing Z (2000) A randomized algorithm for the union of sets problem. J Softw 11(12):1587–1593

    Google Scholar 

Download references

Funding

This study was funded by the National Natural Science Foundation of China under grants 61732003 and 61832003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hong Gao.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared in Proceedings of 14th International Conference on Combinatorial Optimization and Applications, pp 61-75, 2020 (Guo et al. 2020).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, H., Li, J. & Gao, H. Data source selection for approximate query. J Comb Optim 44, 2443–2459 (2022). https://doi.org/10.1007/s10878-021-00760-y

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10878-021-00760-y

Keywords

Navigation