ABSTRACT
In many analytic settings join operations are fundamental as data is dispersed across different data sets (SQL or NoSQL tables, .csv files recording logs, click streams, KPIs from system/network monitoring, IoT telemetry, etc). However, in the era of big data the join operation can become exorbitantly expensive in terms of execution times and/or memory/space footprints.
- Swarup Acharya, Phillip B Gibbons, Viswanath Poosala, and Sridhar Ramaswamy. 1999. Join synopses for approximate query answering. In Proceedings of the 1999 ACM SIGMOD international conference on Management of data. 275--286.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2015. Learning set cardinality in distance nearest neighbours. In 2015 IEEE international conference on data mining. IEEE, 691--696.Google ScholarDigital Library
- Christos Anagnostopoulos and Peter Triantafillou. 2017a. Efficient scalable accurate regression queries in in-dbms analytics. In 2017 IEEE 33rd International Conference on Data Engineering (ICDE). IEEE, 559--570.Google ScholarCross Ref
- Christos Anagnostopoulos and Peter Triantafillou. 2017b. Query-driven learning for predictive analytics of data subspace cardinality. ACM Transactions on Knowledge Discovery from Data (TKDD), Vol. 11, 4 (2017), 1--46.Google Scholar
- Christopher M Bishop. 2013. Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, Vol. 371, 1984 (2013), 20120222.Google ScholarCross Ref
- Surajit Chaudhuri, Rajeev Motwani, and Vivek Narasayya. 1999. On random sampling over joins. ACM SIGMOD Record, Vol. 28, 2 (1999), 263--274.Google ScholarDigital Library
- Benjamin Hilprecht, Andreas Schmidt, Moritz Kulessa, Alejandro Molina, Kristian Kersting, and Carsten Binnig. 2019. DeepDB: Learn from Data, not from Queries! arXiv preprint arXiv:1909.00607 (2019).Google Scholar
- Steffen L Lauritzen. 1996. Graphical models. Vol. 17. Clarendon Press.Google Scholar
- Qingzhi Ma and Peter Triantafillou. 2019. Dbest: Revisiting approximate query processing engines with machine learning models. In Proceedings of the 2019 International Conference on Management of Data. 1553--1570.Google ScholarDigital Library
- Frank J Massey Jr. 1951. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American statistical Association, Vol. 46, 253 (1951), 68--78.Google ScholarCross Ref
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119.Google Scholar
- Vasanth Krishna Namasivayam and Viktor K Prasanna. 2006. Scalable parallel implementation of exact inference in Bayesian networks. In 12th International Conference on Parallel and Distributed Systems-(ICPADS'06), Vol. 1. IEEE, 8--pp.Google ScholarDigital Library
- Frank Olken. 1993. Random sampling from databases. Ph.D. Dissertation. University of California, Berkeley.Google Scholar
- Yongjoo Park, Barzan Mozafari, Joseph Sorenson, and Junhao Wang. 2018. VerdictDB: universalizing approximate query processing. In Proceedings of the 2018 International Conference on Management of Data. ACM, 1461--1476.Google ScholarDigital Library
- Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, and Barzan Mozafari. 2017. Database learning: Toward a database that becomes smarter every time. In Proceedings of the 2017 ACM International Conference on Management of Data. 587--602.Google ScholarDigital Library
- Judea Pearl. 1982. Reverend Bayes on inference engines: A distributed hierarchical approach .Cognitive Systems Laboratory, School of Engineering and Applied Science...Google Scholar
- Magnus Sahlgren. 2008. The distributional hypothesis. Italian Journal of Disability Studies, Vol. 20 (2008), 33--53.Google Scholar
- Saravanan Thirumuruganathan, Shohedul Hasan, Nick Koudas, and Gautam Das. 2019. Approximate query processing using deep generative models. arXiv preprint arXiv:1903.10000 (2019).Google Scholar
- Yinglong Xia and Viktor K Prasanna. 2010. Parallel exact inference on the cell broadband engine processor. J. Parallel and Distrib. Comput., Vol. 70, 5 (2010), 558--572.Google ScholarDigital Library
- Zongheng Yang, Amog Kamsetty, Sifei Luan, Eric Liang, Yan Duan, Xi Chen, and Ion Stoica. 2020. NeuroCard: one cardinality estimator for all tables. Proceedings of the VLDB Endowment, Vol. 14, 1 (2020), 61--73.Google ScholarDigital Library
- Nevin L Zhang and David Poole. 1994. A simple approach to Bayesian network computations. In Proc. of the Tenth Canadian Conference on Artificial Intelligence .Google Scholar
- Zhuoyue Zhao, Robert Christensen, Feifei Li, Xiao Hu, and Ke Yi. 2018. Random sampling over joins revisited. In Proceedings of the 2018 International Conference on Management of Data. 1525--1539.Google ScholarDigital Library
Index Terms
- XLJoins
Recommendations
Towards WAN-aware join sampling over geo-distributed data
EdgeSys '22: Proceedings of the 5th International Workshop on Edge Systems, Analytics and NetworkingLarge scale data analytics over geographically distributed data sources is challenging primarily due to the constrained and heterogeneous resource availability such as the wide area network (WAN) bandwidth. In this work, we look at the problem of ...
Sampling time-based sliding windows in bounded space
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataRandom sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in ...
Edge-colouring of join graphs
A join graph is the complete union of two arbitrary graphs. We give sufficient conditions for a join graph to be 1-factorizable. As a consequence of our results, the Hilton's Overfull Subgraph Conjecture holds true for several subclasses of join graphs.
...
Comments