Data source selection for approximate query

Guo, Hongjie; Li, Jianzhong; Gao, Hong

doi:10.1007/s10878-021-00760-y

Data source selection for approximate query

Published: 24 May 2021

Volume 44, pages 2443–2459, (2022)
Cite this article

Journal of Combinatorial Optimization Aims and scope Submit manuscript

Hongjie Guo¹,
Jianzhong Li¹ &
Hong Gao¹

259 Accesses
1 Citation
Explore all metrics

Abstract

Exact query on big data is a challenging task due to the large numbers of autonomous data sources. In this paper, an efficient method is proposed to select sources on big data for approximate query. A gain model is presented for source selection by considering information coverage and quality provided by sources. Under this model, the source selection problem is formalized into two optimization problems. Because of the NP-hardness of proposed problems, two approximate algorithms are devised to solve them respectively, and their approximate ratios and complexities are analyzed. To further improve efficiency, a randomized method is developed for gain estimation. Based on it, the time complexities of improved algorithms are sub-linear in the number of data item. Experimental results show high efficiency and scalability of proposed algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

Article 12 April 2024

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

A survey of density based clustering algorithms

Article 29 September 2020

References

Codd EF et al (1972) Relational completeness of data base sublanguages
Dong XL, Saha B, Srivastava D (2012) Less is more: selecting sources wisely for integration. Proc VLDB Endow 6(2):37–48
Article Google Scholar
Guo H, Li J, Gao H (2020)Selecting sources for query approximation with bounded resources. In: International conference on combinatorial optimization and applications. Springer, pp. 61–75
Karp RM, Luby M, Madras N (1989) Monte-Carlo approximation algorithms for enumeration problems. J Algorithms 10(3):429–448
Article MathSciNet Google Scholar
Khuller S, Moss A, Naor JS (1999) The budgeted maximum coverage problem. Inf Process Lett 70(1):39–45
Article MathSciNet Google Scholar
Li L, Feng X, Shao H, Li J (2018) Source selection for inconsistency detection. In: International conference on database systems for advanced applications. Springer, pp. 370–385
Lin Y, Wang H, Li J, Gao H (2019) Data source selection for information integration in big data era. Info Sci 479:197–213
Article Google Scholar
Motwani R, Raghavan P (1995) Randomized algorithms. Cambridge University Press, Cambridge
Book Google Scholar
Nemhauser GL, Wolsey LA, Fisher ML (1978) An analysis of approximations for maximizing submodular set functions-i. Math Program 14(1):265–294
Article MathSciNet Google Scholar
Rekatsinas T, Dong XL, Srivastava D (2014) Characterizing and selecting fresh data sources. In: Proceedings of the 2014 ACM SIGMOD international conference on Management of data. pp 919–930
Salloum M, Dong XL, Srivastava D, Tsotras VJ (2013) Online ordering of overlapping data sources. Proc VLDB Endow 7(3):133–144
Article Google Scholar
Shachnai H, Tamir T (2018) Polynomial time approximation schemes. In: Handbook of approximation algorithms and Metaheuristics, 2nd edn. Chapman and Hall/CRC, pp. 125–156
Sun J, Li J, Gao H, Wang H (2018) Truth discovery on inconsistent relational data. Tsinghua Sci Technol 23(3):288–302
Article Google Scholar
Sun L, Hong Z, Pixing Z (2000) A randomized algorithm for the union of sets problem. J Softw 11(12):1587–1593
Google Scholar

Download references

Funding

This study was funded by the National Natural Science Foundation of China under grants 61732003 and 61832003.

Author information

Authors and Affiliations

Harbin Institute of Technology, Harbin, China
Hongjie Guo, Jianzhong Li & Hong Gao

Authors

Hongjie Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Li
View author publications
You can also search for this author in PubMed Google Scholar
Hong Gao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong Gao.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper appeared in Proceedings of 14th International Conference on Combinatorial Optimization and Applications, pp 61-75, 2020 (Guo et al. 2020).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, H., Li, J. & Gao, H. Data source selection for approximate query. J Comb Optim 44, 2443–2459 (2022). https://doi.org/10.1007/s10878-021-00760-y

Download citation

Accepted: 15 May 2021
Published: 24 May 2021
Issue Date: November 2022
DOI: https://doi.org/10.1007/s10878-021-00760-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data source selection for approximate query

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

A survey of density based clustering algorithms

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Data source selection for approximate query

Abstract

Access this article

Similar content being viewed by others

An efficient join operations for utility list-based high-utility mining approaches using hybrid search technique

NoSQL: Future of BigData Analytics Characteristics and Comparison with RDBMS

A survey of density based clustering algorithms

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation