Abstract
Words don’t come easy, which fosters the use of generative artificial intelligence models in ongoing popularity of widely available applications such as ChatGPT. The result is an even greater flood of online content that takes time to process. It is where Natural Language Processing tools for classification come in handy. Distinguishing fake news, event types, and other tasks can help process everyday information. In practice, such systems must work on data streams where fast prediction is needed. To achieve it, methods not based on neural networks can be used. Instead, they require feature extractions from the text to convert it to the model input. The primary methods used for this purpose are bag-of-words and n-grams, which allow converting the corpus of texts into a numerical format. This paper proposes a new strategy for creating n-grams – Hollow n-grams – which can be used to create classifiers ensembles with a higher generalization ability than models based on regular n-grams only.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B., et al.: Modern information retrieval, vol. 463. ACM press, New York (1999)
Bharadwaj, P., Shao, Z.: Fake news detection with semantic features and text mining. Int. J. Natural Lang. Comput. (IJNLC) 8, 1–6 (2019)
Dale, R.: Gpt-3: what’s it good for? Natural Lang. Eng. 27(1), 113–118 (2021)
Fanmuy, G., Fraga, A., Llorens, J.: Requirements verification in the industry. In: Complex Systems Design & Management: Proceedings of the Second International Conference on Complex Systems Design & Management CSDM 2011, pp. 145–160. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-25203-7_10
Floridi, L., Chiriatti, M.: Gpt-3: its nature, scope, limits, and consequences. Minds Mach. 30, 681–694 (2020)
Gruppi, M., Horne, B.D., Adalı, S.: Nela-gt-2020: a large multi-labelled news dataset for the study of misinformation in news articles. arXiv preprint arXiv:2102.04567 (2021)
Henderson, P., Hu, J., Romoff, J., Brunskill, E., Jurafsky, D., Pineau, J.: Towards the systematic reporting of the energy and carbon footprints of machine learning. J. Mach. Learn. Res. 21(1), 10039–10081 (2020)
Jelodar, H., Wang, Y., Orji, R., Huang, S.: Deep sentiment classification and topic discovery on novel coronavirus or covid-19 online discussions: Nlp using lstm recurrent neural network approach. IEEE J. Biomed. Health Inf. 24(10), 2733–2742 (2020)
Kelleher, J.D., Mac Namee, B., D’arcy, A.: Fundamentals of Machine Learning for Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. MIT press, Cambridge (2020)
Ksieniewicz, P., Choraś, M., Kozik, R., Woźniak, M.: Machine learning methods for fake news classification. In: Intelligent Data Engineering and Automated Learning–IDEAL 2019: 20th International Conference, Manchester, UK, 14–16 November 2019, Proceedings, Part II 20, pp. 332–339. Springer, Heidelberg (2019). https://doi.org/10.1007/978-3-030-33617-2_34
Ksieniewicz, P., Zyblewski, P., Borek-Marciniec, W., Kozik, R., Choraś, M., Woźniak, M.: Alphabet flatting as a variant of n-gram feature extraction method in ensemble classification of fake news. Eng. Appl. Artif. Intell. 120, 105882 (2023)
Liu, D., Ye, X.: A matrix factorization based dynamic granularity recommendation with three-way decisions. Knowl. Based Syst. 191, 105243 (2020)
Pedregosa, F., et al.: Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
Rincy, T.N., Gupta, R.: Ensemble learning techniques and its efficiency in machine learning: a survey. In: 2nd International Conference on Data, Engineering and Applications (IDEA), pp. 1–6. IEEE (2020)
Young, I.J.B., Luz, S., Lone, N.: A systematic review of natural language processing for classification tasks in the field of incident reporting and adverse event analysis. Int. J. Med. Inf. 132, 103971 (2019)
Acknowledgement
This work was supported by the statutory funds of the Department of Systems and Computer Networks, Faculty of Information and Communication Technology, Wroclaw University of Science and Technology.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Borek-Marciniec, W., Ksieniewicz, P. (2023). Hollow n-grams Vectorizer for Natural Language Processing Problems. In: Burduk, R., Choraś, M., Kozik, R., Ksieniewicz, P., Marciniak, T., Trajdos, P. (eds) Progress on Pattern Classification, Image Processing and Communications. CORES IP&C 2023 2023. Lecture Notes in Networks and Systems, vol 766. Springer, Cham. https://doi.org/10.1007/978-3-031-41630-9_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-41630-9_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-41629-3
Online ISBN: 978-3-031-41630-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)