Progressive Term Frequency Analysis on Large Text Collections

Zhang, Yazhong; Zhang, Hanbing; He, Zhenying; Jing, Yinan; Zhang, Kai; Wang, X. Sean

doi:10.1007/978-3-030-59416-9_10

Yazhong Zhang^14,16,
Hanbing Zhang^15,16,
Zhenying He^15,16,
Yinan Jing^15,16,
Kai Zhang^15,16 &
…
X. Sean Wang^14,15,16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12113))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2035 Accesses
1 Citations

Abstract

The size of textual data continues to grow along with the need for timely and cost-effective analysis, while the growth of computation power cannot keep up with the growth of data. The delays when processing huge textual data can negatively impact user activity and insight. This calls for a paradigm shift from blocking fashion to progressive processing. In this paper, we propose a sample-based progressive processing model that focuses on term frequency calculation on text. The model is based on an incremental execution engine and will calculate a series of approximate results for a single query in a progressive way to provide a smooth trade-off between accuracy and latency. As a part, we proposed a new variant of the bootstrap technique to quantify result error progressively. We implemented this method in our system called Parrot on top of Apache Spark and used real-world data to test its performance. Experiments demonstrate that our method is 2.4x–19.7x faster to get a result within 1% error while the confidence interval always covers the accurate results very well.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.00; Price excludes VAT (USA)

Softcover Book: USD 159.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Parrot: A Progressive Analysis System on Large Text Collections

Article Open access 22 October 2020

Efficient indexing structure to handle durable queries through web crawling

Article 11 July 2016

Pre-indexing Pruning Strategies

Notes

1.
http://www.internetlivestats.com/twitter-statistics/.
2.
https://stanfordnlp.github.io/CoreNLP/.
3.
https://github.com/fxsjy/jieba.
4.
Precisely, at least min(k, number of documents for that distinct value).
5.
https://trec.nist.gov/data/reuters/reuters.html.
6.
https://webhose.io/free-datasets/english-news-articles/.
7.
For convenience, we use English words with the same meaning in the paper.

References

https://www.elastic.co/, ElasticSearch 7.4.2
Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: The aqua approximate query answering system. In: SIGMOD 1999 (1999)
Google Scholar
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys 2013 (2013)
Google Scholar
Bouakkaz, M., Ouinten, Y., Loudcher, S., Strekalova, Y.: Textual aggregation approaches in OLAP context: a survey. Int J. Inf. Manag. 37(6), 684–692 (2017)
Article Google Scholar
Corral, A., Boleda, G., Ferrer-i-Cancho, R.: Zipf’s law for word frequencies: word forms versus lemmas in long texts. CoRR abs/1407.8322 (2014)
Google Scholar
Dimitriadou, K., Papaemmanouil, O., Diao, Y.: Interactive data exploration based on user relevance feedback. In: ICDE 2014 (2014)
Google Scholar
Galakatos, A., Crotty, A., Zgraggen, E., Binnig, C., Kraska, T.: Revisiting reuse for approximate query processing. PVLDB 10(10), 1142–1153 (2017)
Google Scholar
Gray J., et al.: Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-totals. CoRR abs/cs/0701155 (2007)
Google Scholar
Haas, P.J.: Hoeffding inequalities for join-selectivity estimation and online aggregation. IBM (1996)
Google Scholar
Li, K., Li, G.: Approximate query processing: what is new and where to go? - A survey on approximate query processing. Data Sci. Eng. 3(4), 379–397 (2018)
Article Google Scholar
Lin, C.X., Ding, B., Han, J., Zhu, F., Zhao, B.: Text cube: computing IR measures for multidimensional text database analysis. In: ICDM 2008 (2008)
Google Scholar
Lins, L.D., Klosowski, J.T., Scheidegger, C.E.: Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph. 19(12), 2456–2465 (2013)
Article Google Scholar
Liu, Z., Jiang, B., Heer, J.: imMens: real-time visual querying of big data. Comput. Graph. Forum 32(3), 421–430 (2013)
Article Google Scholar
Park, Y., Mozafari, B., Sorenson, J., Wang, J.: VerdictDB: universalizing approximate query processing. In: SIGMOD 2018 (2018)
Google Scholar
Rice, J.A.: Mathematical statistics and data analysis. Cengage Learning (2006)
Google Scholar
Rossi, R.J.: Mathematical Statistics An Introduction to Likelihood Based Inference. Wiley, Hoboken (2018)
Google Scholar
Wu, Z., Jing, Y., He, Z., Guo, C., Wang, X.S.: Polytope: a flexible sampling system for answering exploratory queries. In: World Wide Web, pp. 1–22 (2019)
Google Scholar
Zeng, K., Agarwal, S., Stoica, I.: iOLAP: managing uncertainty for efficient incremental OLAP. In: SIGMOD 2016 (2016)
Google Scholar
Zeng, K., Gao, S., Mozafari, B., Zaniolo, C.: The analytical bootstrap: a new method for fast error estimation in approximate query processing. In: SIGMOD 2014 (2014)
Google Scholar
Zgraggen, E., Galakatos, A., Crotty, A., Fekete, J., Kraska, T.: How progressive visualizations affect exploratory analysis. IEEE Trans. Vis. Comput. Graph. 23(8), 1977–1987 (2017)
Article Google Scholar

Download references

Acknowledgements

This work is supported by the National Key R&D Program of China (No. 2018YFB1004404 and No. 2018YFB1402600), the NSFC (No. 61732004 and No. 61802066) and the Shanghai Sailing Program (No. 18YF1401300).

Author information

Authors and Affiliations

School of Software, Fudan University, Shanghai, China
Yazhong Zhang & X. Sean Wang
School of Computer Science, Fudan University, Shanghai, China
Hanbing Zhang, Zhenying He, Yinan Jing, Kai Zhang & X. Sean Wang
Shanghai Key Laboratory of Data Science, Shanghai, China
Yazhong Zhang, Hanbing Zhang, Zhenying He, Yinan Jing, Kai Zhang & X. Sean Wang
Shanghai Insititute of Intelligent Electronics and Systems, Shanghai, China
X. Sean Wang

Authors

Yazhong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Hanbing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhenying He
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
X. Sean Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Zhenying He , Yinan Jing or X. Sean Wang .

Editor information

Editors and Affiliations

Dankook University, Yongin, Korea (Republic of)
Yunmook Nah
Peking University, Haidian, China
Bin Cui
Sungkyunkwan University, Suwon, Korea (Republic of)
Sang-Won Lee
Department of System Engineering and Engineering Management, The Chinese University of Hong Kong, Hong Kong, Hong Kong
Jeffrey Xu Yu
Kangwon National University, Chunchon, Korea (Republic of)
Yang-Sae Moon
Korea Advanced Institute of Science and Technology, Daejeon, Korea (Republic of)
Steven Euijong Whang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, Y., Zhang, H., He, Z., Jing, Y., Zhang, K., Wang, X.S. (2020). Progressive Term Frequency Analysis on Large Text Collections. In: Nah, Y., Cui, B., Lee, SW., Yu, J.X., Moon, YS., Whang, S.E. (eds) Database Systems for Advanced Applications. DASFAA 2020. Lecture Notes in Computer Science(), vol 12113. Springer, Cham. https://doi.org/10.1007/978-3-030-59416-9_10

Download citation

DOI: https://doi.org/10.1007/978-3-030-59416-9_10
Published: 22 September 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59415-2
Online ISBN: 978-3-030-59416-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Progressive Term Frequency Analysis on Large Text Collections