An Empirical Investigation of Online News Classification on an Open-Domain, Large-Scale and High-Quality Dataset in Vietnamese

Tran, Khanh Quoc; Trinh, Phap Ngoc; Tran, Khoa Nguyen-Anh; Le, An Tran-Hoai; Ha, Luan Van; Nguyen, Kiet Van

doi:10.3233/FAIA210036

Abstract

In this paper, we build a new dataset UIT-ViON (Vietnamese Online Newspaper) collected from well-known online newspapers in Vietnamese. We collect, process, and create the dataset, then experiment with different machine learning models. In particular, we propose an open-domain, large-scale, and high-quality dataset consisting of 260,000 textual data points annotated with multiple labels for evaluating Vietnamese short text classification. In addition, we present the proposed approach using transformer-based learning (PhoBERT) for Vietnamese short text classification on the dataset, which outperforms traditional machine learning (Naive Bayes and Logistic Regression) and deep learning (Text-CNN and LSTM). As a result, the proposed approach achieves the F1-score of 80.62%. This is a positive result and a premise for developing an automatic news classification system. The study is proposed to significantly save time, costs, and human resources and make it easier for readers to find news related to their interesting topics. In future, we will propose solutions to improve the quality of the dataset and improve the performance of classification models.

Contact

IOS Press Copyright 2025

Contact

IOS Press Copyright 2025

This website uses cookies

This website uses cookies