Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks

Diaz-Garcia, J. Angel; Fernandez-Basso, Carlos; Gutiérrez-Batista, Karel; Ruiz, M. Dolores; Martin-Bautista, Maria J.

doi:10.1007/978-3-031-08974-9_22

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1602))

Included in the following conference series:

International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems

755 Accesses
2 Altmetric

Abstract

Today’s information society has led to the emergence of a large number of applications that generate and consume digital data. Many of these applications are based on social networks, and therefore their information often comes in the form of unstructured text. This text from social media also tends to contain a high level of noise and untrustworthy content. Therefore, having systems capable of dealing with it efficiently is a very relevant issue. In order to verify the trustworthiness of the social media content, it is necessary to analyse and explore social media data by using text mining techniques. One of the most widespread techniques in the field of text mining is text clustering, that allows us to automatically group similar documents into categories. Text clustering is very sensitive to the presence of noise and so in this paper we propose a pre-processing pipeline based on word embedding that allows selecting trustworthy content and discarding noise in a way that improves clustering results. To validate the proposed pipeline, a real use case is provided on a Twitter dataset related to COVID-19.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abu-Salih, B., Wongthongtham, P., Chan, K.Y., Zhu, D.: CredSat: credibility ranking of users in big social data incorporating semantic analysis and temporal factor. J. Inf. Sci. 45(2), 259–280 (2019)
Article Google Scholar
Abualigah, L.M., Khader, A.T., Al-Betar, M.A.: Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6. IEEE (2016)
Google Scholar
Abualigah, L.M., Khader, A.T., AlBetar, M.A., Hanandeh, E.S.: Unsupervised text feature selection technique based on particle swarm optimization algorithm for improving the text clustering. In: 1st EAI International Conference on Computer Science and Engineering, p. 169. European Alliance for Innovation (EAI) (2016)
Google Scholar
Alrubaian, M., Al-Qurishi, M., Hassan, M.M., Alamri, A.: A credibility analysis system for assessing information on twitter. IEEE Trans. Depend. Secure Comput. 15(4), 661–674 (2018). https://doi.org/10.1109/TDSC.2016.2602338
Article Google Scholar
Alrubaian, M., AL-Qurishi, M., Alrakhami, M., Hassan, M., Alamri, A.: Reputation-based credibility analysis of Twitter social network users: reputation-based credibility analysis of Twitter social network users. Concurrency Comput. Pract. Exp. 29 (2016). https://doi.org/10.1002/cpe.3873
Alshabeeb, I.A., Ali, N.G., Naser, S.A., Shakir, W.M.: A clustering algorithm application in Parkinson disease based on k-means method. Comput. Sci. 15(4), 1005–1014 (2020)
MathSciNet Google Scholar
Arenas, A., Danon, L., Díaz-Guilera, A., Gleiser, P.M., Guimerá, R.: Community analysis in social networks. Eur. Phys. J. B 38(2), 373–380 (2004). https://doi.org/10.1140/epjb/e2004-00130-1
Article MATH Google Scholar
Arpaci, I., et al.: Analysis of Twitter data using evolutionary clustering during the Covid-19 pandemic. Comput. Mater. Continua 65(1), 193–204 (2020)
Article Google Scholar
Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. Technical report, Stanford (2006)
Google Scholar
Asyaky, M.S., Mandala, R.: Improving the performance of HDBSCAN on short text clustering by using word embedding and UMAP. In: 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pp. 1–6 (2021). https://doi.org/10.1109/ICAICTA53211.2021.9640285
Berry, M.W., Castellanos, M.: Survey of text mining. Comput. Rev. 45(9), 548 (2004)
Google Scholar
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
Chaudhary, G., Kshirsagar, M.: Enhanced text clustering approach using hierarchical agglomerative clustering with principal components analysis to design document recommendation system. Adv. Res. Comput. Eng. Res. Transcripts Comput. Electr. Electron. Eng. 2, 1–18 (2021)
Google Scholar
Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recogn. Lett. 12(11), 657–664 (1991)
Article Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)
Google Scholar
Diaz-Garcia, J.A., Fernandez-Basso, C., Ruiz, M.D., Martin-Bautista, M.J.: Mining text patterns over fake and real tweets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1238, pp. 648–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50143-3_51
Chapter Google Scholar
Diaz-Garcia, J.A., Ruiz, M.D., Martin-Bautista, M.J.: Non-query-based pattern mining and sentiment analysis for massive microblogging online texts. IEEE Access 8, 78166–78182 (2020). https://doi.org/10.1109/ACCESS.2020.2990461
Article Google Scholar
Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., Gummadi, K.: Cognos: crowdsourcing search for topic experts in microblogs. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 575–590 (2012)
Google Scholar
Godara, N., Kumar, S.: Twitter sentiment classification using machine learning techniques. Waffen-Und Kostumkunde J. 11(8), 10–20 (2020)
Google Scholar
Huang, C., et al.: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223), 497–506 (2020)
Article Google Scholar
Jalil, A.M., Hafidi, I., Alami, L., Ensa, K.: Comparative study of clustering algorithms in text mining context (2016)
Google Scholar
Jin, C., Zhang, S.: Micro-blog short text clustering algorithm based on bootstrapping. In: 2019 12th International Symposium on Computational Intelligence and Design (ISCID), vol. 2, pp. 264–266. IEEE (2019)
Google Scholar
Jin, Y., Liu, Y., Zhang, W., Zhang, S., Lou, Y.: A novel multi-stage ensemble model with multiple k-means-based selective undersampling: an application in credit scoring. J. Intell. Fuzzy Syst. 1–14 (2021, Preprint)
Google Scholar
Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)
Google Scholar
Lamsal, R.: Coronavirus (Covid-19) tweets dataset (2020). https://doi.org/10.21227/781w-ef42
Li, N., Wu, D.D.: Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decis. Support Syst. 48(2), 354–368 (2010)
Article Google Scholar
Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)
Article Google Scholar
Maaten, L.v.d., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)
Google Scholar
Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)
Article Google Scholar
Shi, K., Li, L., He, J., Zhang, N., Liu, H., Song, W.: Improved GA-based text clustering algorithm. In: 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology, pp. 675–679. IEEE (2011)
Google Scholar
Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)
Article Google Scholar
Xingliang, M., Fangfang, L.: Clustering of short text in micro-blog based on k-means algorithm. In: 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pp. 812–815 (2018). https://doi.org/10.1109/IICSPI.2018.8690507
Yedla, M., Pathakota, S.R., Srinivasa, T.: Enhancing k-means clustering algorithm with improved initial center. Int. J. Comput. Sci. Inf. Technol. 1(2), 121–125 (2010)
Google Scholar
Yuan, S., Wenbin, G.: A text clustering algorithm based on simplified cluster hypothesis. In: 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA), pp. 412–415 (2013). https://doi.org/10.1109/IMSNA.2013.6743303
Zhang, G., Zhang, C., Zhang, H.: Improved k-means algorithm based on density canopy. Knowl.-Based Syst. 145, 289–297 (2018)
Article Google Scholar
Zhang, G., Li, Y., Deng, X.: K-means clustering-based electrical equipment identification for smart building application. Information 11(1), 27 (2020)
Article Google Scholar
Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020)
Article Google Scholar

Download references

Acknowledgment

The research reported in this paper was partially supported by the Andalusian government and the FEDER operative program under the project BigDataMed (P18-RT-2947 and B-TIC-145-UGR18) and grant PLEC2021-007681 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Finally the project is also partially supported by the Spanish Ministry of Education, Culture and Sport (FPU18/00150).

Author information

Authors and Affiliations

Department of Computer Science and A.I., University of Granada, Granada, Spain
J. Angel Diaz-Garcia, Carlos Fernandez-Basso, Karel Gutiérrez-Batista, M. Dolores Ruiz & Maria J. Martin-Bautista

Authors

J. Angel Diaz-Garcia
View author publications
You can also search for this author in PubMed Google Scholar
Carlos Fernandez-Basso
View author publications
You can also search for this author in PubMed Google Scholar
Karel Gutiérrez-Batista
View author publications
You can also search for this author in PubMed Google Scholar
M. Dolores Ruiz
View author publications
You can also search for this author in PubMed Google Scholar
Maria J. Martin-Bautista
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to J. Angel Diaz-Garcia .

Editor information

Editors and Affiliations

University of Milano-Bicocca, Milan, Italy
Davide Ciucci
University of Oviedo, Oviedo, Spain
Inés Couso
University of Cádiz, Cádiz, Spain
Jesús Medina
University of Warsaw, Warsaw, Poland
Dominik Ślęzak
University of Perugia, Perugia, Italy
Davide Petturiti
Sorbonne Université, Paris, France
Bernadette Bouchon-Meunier
Iona College, New Rochelle, NY, USA
Ronald R. Yager

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Diaz-Garcia, J.A., Fernandez-Basso, C., Gutiérrez-Batista, K., Ruiz, M.D., Martin-Bautista, M.J. (2022). Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks. In: Ciucci, D., et al. Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. Communications in Computer and Information Science, vol 1602. Springer, Cham. https://doi.org/10.1007/978-3-031-08974-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-031-08974-9_22
Published: 04 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08973-2
Online ISBN: 978-3-031-08974-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks