Abstract:
The detrimental effects of misinformation are diverse and severe. Despite substantial advances in automatic misinformation detection, several challenges persist, and new ...Show MoreMetadata
Abstract:
The detrimental effects of misinformation are diverse and severe. Despite substantial advances in automatic misinformation detection, several challenges persist, and new ones emerge with technological progress. Our hybrid approach, which combines natural language processing and machine learning models, yields promising results. We focus on detecting misinformation in Spanish using multi-level labeling and addressing class imbalance—an issue scarcely explored in the literature. Our methodology incorporates feature selection in the initial stages, simplifying model structure, reducing computational demands, and enhancing interpretability. Following feature selection, we apply traditional class balancing techniques during training. A series of experiments on the CLNews dataset allowed us to identify linguistic features corresponding to Surface variables and the Emotions and feelings lexicon through ANOVA analysis. This selection demonstrated an improvement in the Random Forest model's performance from 0.35 on the imbalanced dataset to 0.917 after applying random undersampling (RUS). Additionally, mutual information analysis helped identify linguistic features associated with both Surface variables and Readability, with the accuracy of the XGBoost model improving from 0.3 to 0.9 after applying random oversampling (ROS). These results validate the effectiveness of our methodology. To date, no studies have surpassed our results on the CLNews dataset or addressed the multi-level classification problem using this dataset. Moreover, we have made the code for our experiments publicly available to ensure transparency and reproducibility.
Date of Conference: 28-30 October 2024
Date Added to IEEE Xplore: 03 December 2024
ISBN Information: