Abstract
The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4x–19.7x faster to get a result within 1% error while the confidence interval always covers the accurate results very well.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
- 3.
- 4.
Precisely, at least min(k, number of documents for that distinct value).
- 5.
- 6.
- 7.
For convenience, we use English words with the same meaning in the paper.
References
https://www.elastic.co/, ElasticSearch 7.4.2
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: SIGMOD 1999 (1999)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys 2013 (2013)
Bouakkaz, M., Ouinten, Y., Loudcher, S., Strekalova, Y.: Textual aggregation approaches in OLAP context: a survey. Int J. Inf. Manag. 37(6), 684–692 (2017)
Corral, A., Boleda, G., Ferrer-i-Cancho, R.: Zipf’s law for word frequencies: word forms versus lemmas in long texts. CoRR abs/1407.8322 (2014)
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Interactive data exploration based on user relevance feedback. In: ICDE 2014 (2014)
Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. PVLDB 10(10), 1142–1153 (2017)
Gray J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. CoRR abs/cs/0701155 (2007)
Haas, P.J.: Hoeffding inequalities for join-selectivity estimation and online aggregation. IBM (1996)
Li, K., Li, G.: Approximate query processing: what is new and where to go? - A survey on approximate query processing. Data Sci. Eng. 3(4), 379–397 (2018)
Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: computing IR measures for multidimensional text database analysis. In: ICDM 2008 (2008)
Lins, L.D., Klosowski, J.T., Scheidegger, C.E.: Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph. 19(12), 2456–2465 (2013)
Liu, Z., Jiang, B., Heer, J.: imMens: real-time visual querying of big data. Comput. Graph. Forum 32(3), 421–430 (2013)
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: VerdictDB: universalizing approximate query processing. In: SIGMOD 2018 (2018)
Rice, J.A.: Mathematical statistics and data analysis. Cengage Learning (2006)
Rossi, R.J.: Mathematical Statistics An Introduction to Likelihood Based Inference. Wiley, Hoboken (2018)
Wu, Z., Jing, Y., He, Z., Guo, C., Wang, X.S.: Polytope: a flexible sampling system for answering exploratory queries. In: World Wide Web, pp. 1–22 (2019)
Zeng, K., Agarwal, S., Stoica, I.: iOLAP: managing uncertainty for efficient incremental OLAP. In: SIGMOD 2016 (2016)
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD 2014 (2014)
Zgraggen, E., Galakatos, A., Crotty, A., Fekete, J., Kraska, T.: How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph. 23(8), 1977–1987 (2017)
Acknowledgements
This work is supported by the National Key R&D Program of China (No. 2018YFB1004404 and No. 2018YFB1402600), the NSFC (No. 61732004 and No. 61802066) and the Shanghai Sailing Program (No. 18YF1401300).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, Y., Zhang, H., He, Z., Jing, Y., Zhang, K., Wang, X.S. (2020). Progressive Term Frequency Analysis on Large Text Collections. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12113. Springer, Cham. https://doi.org/10.1007/978-3-030-59416-9_10
Download citation
DOI: https://doi.org/10.1007/978-3-030-59416-9_10
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59415-2
Online ISBN: 978-3-030-59416-9
eBook Packages: Computer ScienceComputer Science (R0)