Zebra: A novel method for optimizing text classification query in overload scenario

Yu, Tianhuan; He, Zhenying; Yang, Zhihui; Ye, Fei; Fan, Yuankai; Jing, Yinan; Zhang, Kai; Wang, X. Sean

doi:10.1007/s11280-022-01061-y

Zebra: A novel method for optimizing text classification query in overload scenario

Published: 02 June 2022

Volume 26, pages 905–931, (2023)
Cite this article

World Wide Web Aims and scope Submit manuscript

Tianhuan Yu¹,
Zhenying He²,
Zhihui Yang³,
Fei Ye²,
Yuankai Fan²,
Yinan Jing²,
Kai Zhang² &
…
X. Sean Wang²

377 Accesses
1 Altmetric
Explore all metrics

Abstract

Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload phenomena. At the same time, when the system is overloaded, the delays in the query process can negatively affect the user experience in a streaming scenario. This paper focuses on the query with text classification on streaming data. We propose a novel method called Zebra with progressive pipelines to optimize the overload query situations. The core module of Zebra is the probabilistic filter which can reduce an incredible amount of text data based on semantic information of the query predicate. We train weak classifiers as filters using data with labels from brute-force pipelines. Next, we use a parameter search method to choose a suitable filter with the best settings and apply it to progressive pipelines. Experiments with several text workloads on real-world datasets show that Zebra can achieve higher accuracy stably while answering the query in time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 3

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Notes

https://blog.hootsuite.com/twitter-statistics/
AllenNLP library: https://allennlp.org/

References

Anderson, M.R., Cafarella, M.J., Ros, G., et al: Physical representation-based predicate optimization for a visual analytics database. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pp 1466–1477. IEEE (2019)
Armbrust, M., Xin, R.S., Lian, C., et al.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp 1383–1394. ACM (2015)
Babcock, B., Datar, M., Motwani, R.: Load shedding for aggregation queries over data streams. In: Özsoyoglu, Z.M., Zdonik, S.B. (eds.) Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, pp 350–361. IEEE Computer Society, Boston, MA, USA (2004)
Bastani, F., He, S., Balasingam, A., et al: MIRIS: fast object track queries in video. In: Maier, D., Pottinger, R., Doan, A., et al. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp 1907–1921. ACM (2020)
Chaiken, R., Jenkins, B., Larson, P., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2), 1265–1276 (2008)
Article Google Scholar
Chaudhuri, S., Narasayya, V.R., Sarawagi, S.: Efficient evaluation of queries with mining predicates. In: Agrawal, R., Dittrich, K.R. (eds.) Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pp 529–540. IEEE Computer Society (2002)
Chung, J., Gülçehre, Ç, Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555(2014)
Coulom, R.: Efficient selectivity and backup operators in monte-carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M. (eds.) Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, Lecture Notes in Computer Science, vol. 4630, pp 72–83. Springer (2006)
Frank, E., Bouckaert, R.R.: Naive bayes for text classification with unbalanced classes. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings, Lecture Notes in Computer Science, vol. 4213, pp 503–510. Springer (2006)
Gallant, S.I.: Perceptron-based learning algorithms. IEEE Trans. Neural Netw. 1(2), 179–191 (1990)
Article Google Scholar
He, W., Anderson, M.R., Strome, M., et al.: A method for optimizing opaque filter queries. In: Maier, D., Pottinger, R., Doan, A., et al. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp 1257–1272. ACM (2020)
Hellerstein, J.M., Stonebraker, M.: Predicate migration: Optimizing queries with expensive predicates. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993, pp 267–276. ACM Press (1993)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Hueske, F., Peters, M., Sax, M., et al.: Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5(11), 1256–1267 (2012)
Article Google Scholar
Hueske, F., Peters, M., Krettek, A., et al.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: Jensen, C.S., Jermaine, C.M., Zhou, X. (eds.) 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pp 1292–1295. IEEE Computer Society (2013)
Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4(6), 385–396 (2011)
Article Google Scholar
Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics, Springer (1986)
Kang, D., Emmons, J., Abuzaid, F., et al.: Noscope: Optimizing deep cnn-based queries over video streams at scale. Proc. VLDB Endow. 10 (11), 1586–1597 (2017)
Article Google Scholar
Kang, D., Bailis, P., Zaharia, M.: Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. Proc. VLDB Endow. 13(4), 533–546 (2019)
Article Google Scholar
Kipf, A, Kipf, T., Radke, B, et al: Learned cardinalities: Estimating correlated joins with deep learning. In: 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019. Online Proceedings. www.cidrdb.org (2019)
Krishnan, S., Yang, Z., Goldberg, K., et al.: Learning to optimize join queries with deep reinforcement learning. arXiv:1808.03196 (2018)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp 1106–1114 (2012)
Lakshmi, M.S., Zhou, S.: Selectivity estimation in extensible databases - A neural network approach. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, pp 623–627. Morgan Kaufmann, New York City, New York, USA (1998)
LeCun, Y., Boser, B.E., Denker, J.S., et al.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp 396–404. Morgan Kaufmann (1989)
LeCun, Y., Haffner, P., Bottou, L., et al.: Object recognition with gradient-based learning. In: Forsyth, D.A., Mundy, J.L., Gesù, V.D., et al. (eds.) Shape, Contour and Grouping in Computer Vision, Lecture Notes in Computer Science, vol. 1681, p 319. Springer (1999)
LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
Li, G., Zhou, X., Cao, L.: AI meets database: AI4DB and DB4AI. In: Li, G., Li, Z., Idreos, S., et al. (eds.) SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pp 2859–2866. ACM (2021)
Lu, Y., Chowdhery, A., Kandula, S.: Optasia: A relational platform for efficient large-scale video analytics. In: Aguilera, M.K., Cooper, B., Diao, Y. (eds.) Proceedings of the Seventh ACM Symposium on Cloud Computing, Santa Clara, CA, USA, October 5-7, 2016, pp 57–70. ACM (2016)
Lu, Y., Chowdhery, A., Kandula, S., et al.: Accelerating machine learning inference with probabilistic predicates. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp 1493–1508. ACM (2018)
Marcus, R.C., Negi, P., Mao, H., et al.: Neo: A learned query optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019)
Article Google Scholar
Mikolov, T., Karafiát, M., Burget, L., et al.: Recurrent neural network based language model. In: Kobayashi, T., Hirose, K., Nakamura, S. (eds.) INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp 1045–1048. ISCA (2010)
Ramachandra, K., Park, K., Emani, K.V., et al.: Froid: Optimization of imperative programs in a relational database. Proc. VLDB Endow. 11(4), 432–444 (2017)
Article Google Scholar
Singh, G., Kumar, B., Gaur, L., et al.: Comparison between multinomial and bernoulli naïve bayes for text classification. In: 2019 International Conference on Automation, Computational and Technology Management (ICACTM), pp 593–596. IEEE (2019)
Sutton, R.S., Barto, A.G. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press (1998)
Tatbul, N., Çetintemel, U., Zdonik, S.B., et al.: Load shedding in a data stream manager. In: Freytag, J.C., Lockemann, P.C., Abiteboul, S., et al. (eds.) Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9-12, 2003, pp 309–320. Morgan Kaufmann (2003)
Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive - A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Trummer, I., Wang, J., Maram, D., et al.: Skinnerdb: Regret-bounded query evaluation via reinforcement learning. In: Boncz, P.A., Manegold, S., Ailamaki, A., et al. (eds.) Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pp 1153–1170. ACM (2019)
Tu, Y., Liu, S., Prabhakar, S., et al.: Load shedding in stream databases: A control-based approach. In: Dayal, U., Whang, K., Lomet, D.B., et al. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pp 787–798. ACM (2006)
Vaswani, A., Bengio, S., Brevdo, E., et al.: Tensor2tensor for neural machine translation. arXiv:1803.07416 (2018)
Weinberger, K.Q., Dasgupta, A., Langford, J., et al.: Feature hashing for large scale multitask learning. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, ACM International Conference Proceeding Series, vol. 382, pp 1113–1120. ACM (2009)
Wu, C., Jindal, A., Amizadeh, S., et al.: Towards a learning optimizer for shared clouds. Proc. VLDB Endow. 12(3), 210–222 (2018)
Article Google Scholar
Xarchakos, I., Koudas, N.: SVQ: streaming video queries. In: Boncz, P.A., Manegold, S., Ailamaki, A., et al. (eds.) Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pp 2013–2016. ACM (2019)

Download references

Acknowledgements

This work was mainly supported by National Natural Science Foundation of China under Grant Nos. 61732004, 62072113. This work was also supported by the Research Projects of Zhejiang Lab (No. 2021PE0AC01).

We are also grateful for the insightful comments offered by the anonymous reviewers. The generosity and expertise of one and all have improved this study in innumerable ways and saved us from many errors.

Author information

Authors and Affiliations

School of Software, Fudan University, 2005, Songhu Road, Shanghai, 200438, China
Tianhuan Yu
School of Computer Science, Fudan University, 2005, Songhu Road, Shanghai, 200438, China
Zhenying He, Fei Ye, Yuankai Fan, Yinan Jing, Kai Zhang & X. Sean Wang
Research Institute of Intelligent Computing, Zhejiang Lab, Yuhang District, Hangzhou, 311100, China
Zhihui Yang

Authors

Tianhuan Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zhenying He
View author publications
You can also search for this author in PubMed Google Scholar
Zhihui Yang
View author publications
You can also search for this author in PubMed Google Scholar
Fei Ye
View author publications
You can also search for this author in PubMed Google Scholar
Yuankai Fan
View author publications
You can also search for this author in PubMed Google Scholar
Yinan Jing
View author publications
You can also search for this author in PubMed Google Scholar
Kai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
X. Sean Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhenying He.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, T., He, Z., Yang, Z. et al. Zebra: A novel method for optimizing text classification query in overload scenario. World Wide Web 26, 905–931 (2023). https://doi.org/10.1007/s11280-022-01061-y

Download citation

Received: 20 January 2022
Revised: 06 April 2022
Accepted: 26 April 2022
Published: 02 June 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11280-022-01061-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Zebra: A novel method for optimizing text classification query in overload scenario

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Social media analytics: a survey of techniques, tools and platforms

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Zebra: A novel method for optimizing text classification query in overload scenario

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Social media analytics: a survey of techniques, tools and platforms

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation