Skip to main content

Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks

  • Conference paper
  • First Online:
Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU 2022)

Abstract

Today’s information society has led to the emergence of a large number of applications that generate and consume digital data. Many of these applications are based on social networks, and therefore their information often comes in the form of unstructured text. This text from social media also tends to contain a high level of noise and untrustworthy content. Therefore, having systems capable of dealing with it efficiently is a very relevant issue. In order to verify the trustworthiness of the social media content, it is necessary to analyse and explore social media data by using text mining techniques. One of the most widespread techniques in the field of text mining is text clustering, that allows us to automatically group similar documents into categories. Text clustering is very sensitive to the presence of noise and so in this paper we propose a pre-processing pipeline based on word embedding that allows selecting trustworthy content and discarding noise in a way that improves clustering results. To validate the proposed pipeline, a real use case is provided on a Twitter dataset related to COVID-19.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abu-Salih, B., Wongthongtham, P., Chan, K.Y., Zhu, D.: CredSat: credibility ranking of users in big social data incorporating semantic analysis and temporal factor. J. Inf. Sci. 45(2), 259–280 (2019)

    Article  Google Scholar 

  2. Abualigah, L.M., Khader, A.T., Al-Betar, M.A.: Unsupervised feature selection technique based on genetic algorithm for improving the text clustering. In: 2016 7th International Conference on Computer Science and Information Technology (CSIT), pp. 1–6. IEEE (2016)

    Google Scholar 

  3. Abualigah, L.M., Khader, A.T., AlBetar, M.A., Hanandeh, E.S.: Unsupervised text feature selection technique based on particle swarm optimization algorithm for improving the text clustering. In: 1st EAI International Conference on Computer Science and Engineering, p. 169. European Alliance for Innovation (EAI) (2016)

    Google Scholar 

  4. Alrubaian, M., Al-Qurishi, M., Hassan, M.M., Alamri, A.: A credibility analysis system for assessing information on twitter. IEEE Trans. Depend. Secure Comput. 15(4), 661–674 (2018). https://doi.org/10.1109/TDSC.2016.2602338

    Article  Google Scholar 

  5. Alrubaian, M., AL-Qurishi, M., Alrakhami, M., Hassan, M., Alamri, A.: Reputation-based credibility analysis of Twitter social network users: reputation-based credibility analysis of Twitter social network users. Concurrency Comput. Pract. Exp. 29 (2016). https://doi.org/10.1002/cpe.3873

  6. Alshabeeb, I.A., Ali, N.G., Naser, S.A., Shakir, W.M.: A clustering algorithm application in Parkinson disease based on k-means method. Comput. Sci. 15(4), 1005–1014 (2020)

    MathSciNet  Google Scholar 

  7. Arenas, A., Danon, L., Díaz-Guilera, A., Gleiser, P.M., Guimerá, R.: Community analysis in social networks. Eur. Phys. J. B 38(2), 373–380 (2004). https://doi.org/10.1140/epjb/e2004-00130-1

    Article  MATH  Google Scholar 

  8. Arpaci, I., et al.: Analysis of Twitter data using evolutionary clustering during the Covid-19 pandemic. Comput. Mater. Continua 65(1), 193–204 (2020)

    Article  Google Scholar 

  9. Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. Technical report, Stanford (2006)

    Google Scholar 

  10. Asyaky, M.S., Mandala, R.: Improving the performance of HDBSCAN on short text clustering by using word embedding and UMAP. In: 2021 8th International Conference on Advanced Informatics: Concepts, Theory and Applications (ICAICTA), pp. 1–6 (2021). https://doi.org/10.1109/ICAICTA53211.2021.9640285

  11. Berry, M.W., Castellanos, M.: Survey of text mining. Comput. Rev. 45(9), 548 (2004)

    Google Scholar 

  12. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)

  13. Chaudhary, G., Kshirsagar, M.: Enhanced text clustering approach using hierarchical agglomerative clustering with principal components analysis to design document recommendation system. Adv. Res. Comput. Eng. Res. Transcripts Comput. Electr. Electron. Eng. 2, 1–18 (2021)

    Google Scholar 

  14. Dave, R.N.: Characterization and detection of noise in clustering. Pattern Recogn. Lett. 12(11), 657–664 (1991)

    Article  Google Scholar 

  15. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)

    Google Scholar 

  16. Diaz-Garcia, J.A., Fernandez-Basso, C., Ruiz, M.D., Martin-Bautista, M.J.: Mining text patterns over fake and real tweets. In: Lesot, M.-J., et al. (eds.) IPMU 2020. CCIS, vol. 1238, pp. 648–660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50143-3_51

    Chapter  Google Scholar 

  17. Diaz-Garcia, J.A., Ruiz, M.D., Martin-Bautista, M.J.: Non-query-based pattern mining and sentiment analysis for massive microblogging online texts. IEEE Access 8, 78166–78182 (2020). https://doi.org/10.1109/ACCESS.2020.2990461

    Article  Google Scholar 

  18. Ghosh, S., Sharma, N., Benevenuto, F., Ganguly, N., Gummadi, K.: Cognos: crowdsourcing search for topic experts in microblogs. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 575–590 (2012)

    Google Scholar 

  19. Godara, N., Kumar, S.: Twitter sentiment classification using machine learning techniques. Waffen-Und Kostumkunde J. 11(8), 10–20 (2020)

    Google Scholar 

  20. Huang, C., et al.: Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet 395(10223), 497–506 (2020)

    Article  Google Scholar 

  21. Jalil, A.M., Hafidi, I., Alami, L., Ensa, K.: Comparative study of clustering algorithms in text mining context (2016)

    Google Scholar 

  22. Jin, C., Zhang, S.: Micro-blog short text clustering algorithm based on bootstrapping. In: 2019 12th International Symposium on Computational Intelligence and Design (ISCID), vol. 2, pp. 264–266. IEEE (2019)

    Google Scholar 

  23. Jin, Y., Liu, Y., Zhang, W., Zhang, S., Lou, Y.: A novel multi-stage ensemble model with multiple k-means-based selective undersampling: an application in credit scoring. J. Intell. Fuzzy Syst. 1–14 (2021, Preprint)

    Google Scholar 

  24. Kodinariya, T.M., Makwana, P.R.: Review on determining number of cluster in k-means clustering. Int. J. 1(6), 90–95 (2013)

    Google Scholar 

  25. Lamsal, R.: Coronavirus (Covid-19) tweets dataset (2020). https://doi.org/10.21227/781w-ef42

  26. Li, N., Wu, D.D.: Using text mining and sentiment analysis for online forums hotspot detection and forecast. Decis. Support Syst. 48(2), 354–368 (2010)

    Article  Google Scholar 

  27. Likas, A., Vlassis, N., Verbeek, J.J.: The global k-means clustering algorithm. Pattern Recogn. 36(2), 451–461 (2003)

    Article  Google Scholar 

  28. Maaten, L.v.d., Hinton, G.: Visualizing data using T-SNE. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)

    Google Scholar 

  29. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2013)

    Google Scholar 

  31. Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)

    Article  Google Scholar 

  32. Shi, K., Li, L., He, J., Zhang, N., Liu, H., Song, W.: Improved GA-based text clustering algorithm. In: 2011 4th IEEE International Conference on Broadband Network and Multimedia Technology, pp. 675–679. IEEE (2011)

    Google Scholar 

  33. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bull. 1(6), 80–83 (1945)

    Article  Google Scholar 

  34. Xingliang, M., Fangfang, L.: Clustering of short text in micro-blog based on k-means algorithm. In: 2018 IEEE International Conference of Safety Produce Informatization (IICSPI), pp. 812–815 (2018). https://doi.org/10.1109/IICSPI.2018.8690507

  35. Yedla, M., Pathakota, S.R., Srinivasa, T.: Enhancing k-means clustering algorithm with improved initial center. Int. J. Comput. Sci. Inf. Technol. 1(2), 121–125 (2010)

    Google Scholar 

  36. Yuan, S., Wenbin, G.: A text clustering algorithm based on simplified cluster hypothesis. In: 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA), pp. 412–415 (2013). https://doi.org/10.1109/IMSNA.2013.6743303

  37. Zhang, G., Zhang, C., Zhang, H.: Improved k-means algorithm based on density canopy. Knowl.-Based Syst. 145, 289–297 (2018)

    Article  Google Scholar 

  38. Zhang, G., Li, Y., Deng, X.: K-means clustering-based electrical equipment identification for smart building application. Information 11(1), 27 (2020)

    Article  Google Scholar 

  39. Zhou, P., et al.: A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature 579(7798), 270–273 (2020)

    Article  Google Scholar 

Download references

Acknowledgment

The research reported in this paper was partially supported by the Andalusian government and the FEDER operative program under the project BigDataMed (P18-RT-2947 and B-TIC-145-UGR18) and grant PLEC2021-007681 funded by MCIN/AEI/10.13039/501100011033 and by the European Union NextGenerationEU/PRTR. Finally the project is also partially supported by the Spanish Ministry of Education, Culture and Sport (FPU18/00150).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to J. Angel Diaz-Garcia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Diaz-Garcia, J.A., Fernandez-Basso, C., Gutiérrez-Batista, K., Ruiz, M.D., Martin-Bautista, M.J. (2022). Improving Text Clustering Using a New Technique for Selecting Trustworthy Content in Social Networks. In: Ciucci, D., et al. Information Processing and Management of Uncertainty in Knowledge-Based Systems. IPMU 2022. Communications in Computer and Information Science, vol 1602. Springer, Cham. https://doi.org/10.1007/978-3-031-08974-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08974-9_22

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08973-2

  • Online ISBN: 978-3-031-08974-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics