Abstract
Today’s information society has led to the emergence of a large number of applications that generate and consume digital data. Many of these applications are based on social networks, and therefore their information often comes in the form of unstructured text. This text from social media also tends to contain a high level of noise and untrustworthy content. Therefore, having systems capable of dealing with it efficiently is a very relevant issue. In order to verify the trustworthiness of the social media content, it is necessary to analyse and explore social media data by using text mining techniques. One of the most widespread techniques in the field of text mining is text clustering, that allows us to automatically group similar documents into categories. Text clustering is very sensitive to the presence of noise and so in this paper we propose a pre-processing pipeline based on word embedding that allows selecting trustworthy content and discarding noise in a way that improves clustering results. To validate the proposed pipeline, a real use case is provided on a Twitter dataset related to COVID-19.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abu-Salih, B., Wongthongtham, P., Chan, K.Y., Zhu, D.: CredSat: credibility ranking of users in big social data incorporating semantic analysis and temporal factor. J. Inf. Sci. 45(2), 259–280 (2019)
Abualigah, L.M., Khader, A.T., Al-Betar, M.A.: Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6. IEEE (2016)
Abualigah, L.M., Khader, A.T., AlBetar, M.A., Hanandeh, E.S.: Unsupervised text feature selection technique based on particle swarm optimization algorithm for improving the text clustering. In: 1st EAI International Conference on Computer Science and Engineering, p. 169. European Alliance for Innovation (EAI) (2016)
Alrubaian, M., Al-Qurishi, M., Hassan, M.M., Alamri, A.: A credibility analysis system for assessing information on twitter. IEEE Trans. Depend. Secure Comput. 15(4), 661–674 (2018). https://doi.org/10.1109/TDSC.2016.2602338
Alrubaian, M., AL-Qurishi, M., Alrakhami, M., Hassan, M., Alamri, A.: Reputation-based credibility analysis of Twitter social network users: reputation-based credibility analysis of Twitter social network users. Concurrency Comput. Pract. Exp. 29 (2016). https://doi.org/10.1002/cpe.3873
Alshabeeb, I.A., Ali, N.G., Naser, S.A., Shakir, W.M.: A clustering algorithm application in Parkinson disease based on k-means method. Comput. Sci. 15(4), 1005–1014 (2020)
Arenas, A., Danon, L., Díaz-Guilera, A., Gleiser, P.M., Guimerá, R.: Community analysis in social networks. Eur. Phys. J. B 38(2), 373–380 (2004). https://doi.org/10.1140/epjb/e2004-00130-1
Arpaci, I., et al.: Analysis of Twitter data using evolutionary clustering during the Covid-19 pandemic. Comput. Mater. Continua 65(1), 193–204 (2020)
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. Technical report, Stanford (2006)
Asyaky, M.S., Mandala, R.: Improving the performance of HDBSCAN on short text clustering by using word embedding and UMAP. In: 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pp. 1–6 (2021). https://doi.org/10.1109/ICAICTA53211.2021.9640285
Berry, M.W., Castellanos, M.: Survey of text mining. Comput. Rev. 45(9), 548 (2004)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Chaudhary, G., Kshirsagar, M.: Enhanced text clustering approach using hierarchical agglomerative clustering with principal components analysis to design document recommendation system. Adv. Res. Comput. Eng. Res. Transcripts Comput. Electr. Electron. Eng. 2, 1–18 (2021)
Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recogn. Lett. 12(11), 657–664 (1991)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Diaz-Garcia, J.A., Fernandez-Basso, C., Ruiz, M.D., Martin-Bautista, M.J.: Mining text patterns over fake and real tweets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1238, pp. 648–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50143-3_51
Diaz-Garcia, J.A., Ruiz, M.D., Martin-Bautista, M.J.: Non-query-based pattern mining and sentiment analysis for massive microblogging online texts. IEEE Access 8, 78166–78182 (2020). https://doi.org/10.1109/ACCESS.2020.2990461
Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., Gummadi, K.: Cognos: crowdsourcing search for topic experts in microblogs. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 575–590 (2012)
Godara, N., Kumar, S.: Twitter sentiment classification using machine learning techniques. Waffen-Und Kostumkunde J. 11(8), 10–20 (2020)
Huang, C., et al.: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223), 497–506 (2020)
Jalil, A.M., Hafidi, I., Alami, L., Ensa, K.: Comparative study of clustering algorithms in text mining context (2016)
Jin, C., Zhang, S.: Micro-blog short text clustering algorithm based on bootstrapping. In: 2019 12th International Symposium on Computational Intelligence and Design (ISCID), vol. 2, pp. 264–266. IEEE (2019)
Jin, Y., Liu, Y., Zhang, W., Zhang, S., Lou, Y.: A novel multi-stage ensemble model with multiple k-means-based selective undersampling: an application in credit scoring. J. Intell. Fuzzy Syst. 1–14 (2021, Preprint)
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)
Lamsal, R.: Coronavirus (Covid-19) tweets dataset (2020). https://doi.org/10.21227/781w-ef42
Li, N., Wu, D.D.: Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decis. Support Syst. 48(2), 354–368 (2010)
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)
Maaten, L.v.d., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Shi, K., Li, L., He, J., Zhang, N., Liu, H., Song, W.: Improved GA-based text clustering algorithm. In: 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology, pp. 675–679. IEEE (2011)
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)
Xingliang, M., Fangfang, L.: Clustering of short text in micro-blog based on k-means algorithm. In: 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pp. 812–815 (2018). https://doi.org/10.1109/IICSPI.2018.8690507
Yedla, M., Pathakota, S.R., Srinivasa, T.: Enhancing k-means clustering algorithm with improved initial center. Int. J. Comput. Sci. Inf. Technol. 1(2), 121–125 (2010)
Yuan, S., Wenbin, G.: A text clustering algorithm based on simplified cluster hypothesis. In: 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA), pp. 412–415 (2013). https://doi.org/10.1109/IMSNA.2013.6743303
Zhang, G., Zhang, C., Zhang, H.: Improved k-means algorithm based on density canopy. Knowl.-Based Syst. 145, 289–297 (2018)
Zhang, G., Li, Y., Deng, X.: K-means clustering-based electrical equipment identification for smart building application. Information 11(1), 27 (2020)
Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020)
Acknowledgment
The research reported in this paper was partially supported by the Andalusian government and the FEDER operative program under the project BigDataMed (P18-RT-2947 and B-TIC-145-UGR18) and grant PLEC2021-007681 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Finally the project is also partially supported by the Spanish Ministry of Education, Culture and Sport (FPU18/00150).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Diaz-Garcia, J.A., Fernandez-Basso, C., Gutiérrez-Batista, K., Ruiz, M.D., Martin-Bautista, M.J. (2022). Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks. In: Ciucci, D., et al. Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. Communications in Computer and Information Science, vol 1602. Springer, Cham. https://doi.org/10.1007/978-3-031-08974-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-031-08974-9_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08973-2
Online ISBN: 978-3-031-08974-9
eBook Packages: Computer ScienceComputer Science (R0)