Skip to main content
Log in

Zebra: A novel method for optimizing text classification query in overload scenario

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Text classification is a crucial task in the text mining field, and it can be included in queries with user-defined functions(UDF). In many web applications, such as Twitter mining or Weibo real-time processing, when the amount of text data to be processed is enormous, there will be many overload phenomena. At the same time, when the system is overloaded, the delays in the query process can negatively affect the user experience in a streaming scenario. This paper focuses on the query with text classification on streaming data. We propose a novel method called Zebra with progressive pipelines to optimize the overload query situations. The core module of Zebra is the probabilistic filter which can reduce an incredible amount of text data based on semantic information of the query predicate. We train weak classifiers as filters using data with labels from brute-force pipelines. Next, we use a parameter search method to choose a suitable filter with the best settings and apply it to progressive pipelines. Experiments with several text workloads on real-world datasets show that Zebra can achieve higher accuracy stably while answering the query in time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. https://blog.hootsuite.com/twitter-statistics/

  2. AllenNLP library: https://allennlp.org/

References

  1. Anderson, M.R., Cafarella, M.J., Ros, G., et al: Physical representation-based predicate optimization for a visual analytics database. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8-11, 2019, pp 1466–1477. IEEE (2019)

  2. Armbrust, M., Xin, R.S., Lian, C., et al.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31 - June 4, 2015, pp 1383–1394. ACM (2015)

  3. Babcock, B., Datar, M., Motwani, R.: Load shedding for aggregation queries over data streams. In: Özsoyoglu, Z.M., Zdonik, S.B. (eds.) Proceedings of the 20th International Conference on Data Engineering, ICDE 2004, 30 March - 2 April 2004, pp 350–361. IEEE Computer Society, Boston, MA, USA (2004)

  4. Bastani, F., He, S., Balasingam, A., et al: MIRIS: fast object track queries in video. In: Maier, D., Pottinger, R., Doan, A., et al. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp 1907–1921. ACM (2020)

  5. Chaiken, R., Jenkins, B., Larson, P., et al.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow. 1(2), 1265–1276 (2008)

    Article  Google Scholar 

  6. Chaudhuri, S., Narasayya, V.R., Sarawagi, S.: Efficient evaluation of queries with mining predicates. In: Agrawal, R., Dittrich, K.R. (eds.) Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26 - March 1, 2002, pp 529–540. IEEE Computer Society (2002)

  7. Chung, J., Gülçehre, Ç, Cho, K., et al.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv:1412.3555(2014)

  8. Coulom, R.: Efficient selectivity and backup operators in monte-carlo tree search. In: van den Herik, H.J., Ciancarini, P., Donkers, H.H.L.M. (eds.) Computers and Games, 5th International Conference, CG 2006, Turin, Italy, May 29-31, 2006. Revised Papers, Lecture Notes in Computer Science, vol. 4630, pp 72–83. Springer (2006)

  9. Frank, E., Bouckaert, R.R.: Naive bayes for text classification with unbalanced classes. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) Knowledge Discovery in Databases: PKDD 2006, 10th European Conference on Principles and Practice of Knowledge Discovery in Databases, Berlin, Germany, September 18-22, 2006, Proceedings, Lecture Notes in Computer Science, vol. 4213, pp 503–510. Springer (2006)

  10. Gallant, S.I.: Perceptron-based learning algorithms. IEEE Trans. Neural Netw. 1(2), 179–191 (1990)

    Article  Google Scholar 

  11. He, W., Anderson, M.R., Strome, M., et al.: A method for optimizing opaque filter queries. In: Maier, D., Pottinger, R., Doan, A., et al. (eds.) Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, pp 1257–1272. ACM (2020)

  12. Hellerstein, J.M., Stonebraker, M.: Predicate migration: Optimizing queries with expensive predicates. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, DC, USA, May 26-28, 1993, pp 267–276. ACM Press (1993)

  13. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  14. Hueske, F., Peters, M., Sax, M., et al.: Opening the black boxes in data flow optimization. Proc. VLDB Endow. 5(11), 1256–1267 (2012)

    Article  Google Scholar 

  15. Hueske, F., Peters, M., Krettek, A., et al.: Peeking into the optimization of data flow programs with mapreduce-style udfs. In: Jensen, C.S., Jermaine, C.M., Zhou, X. (eds.) 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8-12, 2013, pp 1292–1295. IEEE Computer Society (2013)

  16. Jahani, E., Cafarella, M.J., Ré, C.: Automatic optimization for mapreduce programs. Proc. VLDB Endow. 4(6), 385–396 (2011)

    Article  Google Scholar 

  17. Jolliffe, I.T.: Principal Component Analysis. Springer Series in Statistics, Springer (1986)

  18. Kang, D., Emmons, J., Abuzaid, F., et al.: Noscope: Optimizing deep cnn-based queries over video streams at scale. Proc. VLDB Endow. 10 (11), 1586–1597 (2017)

    Article  Google Scholar 

  19. Kang, D., Bailis, P., Zaharia, M.: Blazeit: Optimizing declarative aggregation and limit queries for neural network-based video analytics. Proc. VLDB Endow. 13(4), 533–546 (2019)

    Article  Google Scholar 

  20. Kipf, A, Kipf, T., Radke, B, et al: Learned cardinalities: Estimating correlated joins with deep learning. In: 9th Biennial Conference on Innovative Data Systems Research, CIDR 2019, Asilomar, CA, USA, January 13-16, 2019. Online Proceedings. www.cidrdb.org (2019)

  21. Krishnan, S., Yang, Z., Goldberg, K., et al.: Learning to optimize join queries with deep reinforcement learning. arXiv:1808.03196 (2018)

  22. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Bartlett, P.L., Pereira, F.C.N., Burges, C.J.C. (eds.) Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pp 1106–1114 (2012)

  23. Lakshmi, M.S., Zhou, S.: Selectivity estimation in extensible databases - A neural network approach. In: Gupta, A., Shmueli, O., Widom, J. (eds.) VLDB’98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, pp 623–627. Morgan Kaufmann, New York City, New York, USA (1998)

  24. LeCun, Y., Boser, B.E., Denker, J.S., et al.: Handwritten digit recognition with a back-propagation network. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp 396–404. Morgan Kaufmann (1989)

  25. LeCun, Y., Haffner, P., Bottou, L., et al.: Object recognition with gradient-based learning. In: Forsyth, D.A., Mundy, J.L., Gesù, V.D., et al. (eds.) Shape, Contour and Grouping in Computer Vision, Lecture Notes in Computer Science, vol. 1681, p 319. Springer (1999)

  26. LeCun, Y., Bengio, Y., Hinton, G.E.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  27. Li, G., Zhou, X., Cao, L.: AI meets database: AI4DB and DB4AI. In: Li, G., Li, Z., Idreos, S., et al. (eds.) SIGMOD ’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pp 2859–2866. ACM (2021)

  28. Lu, Y., Chowdhery, A., Kandula, S.: Optasia: A relational platform for efficient large-scale video analytics. In: Aguilera, M.K., Cooper, B., Diao, Y. (eds.) Proceedings of the Seventh ACM Symposium on Cloud Computing, Santa Clara, CA, USA, October 5-7, 2016, pp 57–70. ACM (2016)

  29. Lu, Y., Chowdhery, A., Kandula, S., et al.: Accelerating machine learning inference with probabilistic predicates. In: Das, G., Jermaine, C.M., Bernstein, P.A. (eds.) Proceedings of the 2018 International Conference on Management of Data, SIGMOD Conference 2018, Houston, TX, USA, June 10-15, 2018, pp 1493–1508. ACM (2018)

  30. Marcus, R.C., Negi, P., Mao, H., et al.: Neo: A learned query optimizer. Proc. VLDB Endow. 12(11), 1705–1718 (2019)

    Article  Google Scholar 

  31. Mikolov, T., Karafiát, M., Burget, L., et al.: Recurrent neural network based language model. In: Kobayashi, T., Hirose, K., Nakamura, S. (eds.) INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010, pp 1045–1048. ISCA (2010)

  32. Ramachandra, K., Park, K., Emani, K.V., et al.: Froid: Optimization of imperative programs in a relational database. Proc. VLDB Endow. 11(4), 432–444 (2017)

    Article  Google Scholar 

  33. Singh, G., Kumar, B., Gaur, L., et al.: Comparison between multinomial and bernoulli naïve bayes for text classification. In: 2019 International Conference on Automation, Computational and Technology Management (ICACTM), pp 593–596. IEEE (2019)

  34. Sutton, R.S., Barto, A.G. Reinforcement learning - an introduction. Adaptive computation and machine learning. MIT Press (1998)

  35. Tatbul, N., Çetintemel, U., Zdonik, S.B., et al.: Load shedding in a data stream manager. In: Freytag, J.C., Lockemann, P.C., Abiteboul, S., et al. (eds.) Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, Berlin, Germany, September 9-12, 2003, pp 309–320. Morgan Kaufmann (2003)

  36. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive - A warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  37. Trummer, I., Wang, J., Maram, D., et al.: Skinnerdb: Regret-bounded query evaluation via reinforcement learning. In: Boncz, P.A., Manegold, S., Ailamaki, A., et al. (eds.) Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pp 1153–1170. ACM (2019)

  38. Tu, Y., Liu, S., Prabhakar, S., et al.: Load shedding in stream databases: A control-based approach. In: Dayal, U., Whang, K., Lomet, D.B., et al. (eds.) Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12-15, 2006, pp 787–798. ACM (2006)

  39. Vaswani, A., Bengio, S., Brevdo, E., et al.: Tensor2tensor for neural machine translation. arXiv:1803.07416 (2018)

  40. Weinberger, K.Q., Dasgupta, A., Langford, J., et al.: Feature hashing for large scale multitask learning. In: Danyluk, A.P., Bottou, L., Littman, M.L. (eds.) Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, ACM International Conference Proceeding Series, vol. 382, pp 1113–1120. ACM (2009)

  41. Wu, C., Jindal, A., Amizadeh, S., et al.: Towards a learning optimizer for shared clouds. Proc. VLDB Endow. 12(3), 210–222 (2018)

    Article  Google Scholar 

  42. Xarchakos, I., Koudas, N.: SVQ: streaming video queries. In: Boncz, P.A., Manegold, S., Ailamaki, A., et al. (eds.) Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, pp 2013–2016. ACM (2019)

Download references

Acknowledgements

This work was mainly supported by National Natural Science Foundation of China under Grant Nos. 61732004, 62072113. This work was also supported by the Research Projects of Zhejiang Lab (No. 2021PE0AC01).

We are also grateful for the insightful comments offered by the anonymous reviewers. The generosity and expertise of one and all have improved this study in innumerable ways and saved us from many errors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhenying He.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, T., He, Z., Yang, Z. et al. Zebra: A novel method for optimizing text classification query in overload scenario. World Wide Web 26, 905–931 (2023). https://doi.org/10.1007/s11280-022-01061-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-022-01061-y

Keywords

Navigation