Skip to main content

Advertisement

Log in

Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data Availability

Supporting data/code will be made available upon request.

References

  1. Aggarwal CC, Reddy CK (2013) Data Clustering: Algorithms and Applications, 1st edn. Chapman & Hall/CRC, ???

  2. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323. https://doi.org/10.1145/331499.331504

    Article  MATH  Google Scholar 

  3. Jain AK (2010) Data clustering: 50 years beyond k-means. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR). Pattern Recogn Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011.

  4. McQuarrie DA (1976) Statistical Mechanics. Harper’s chemistry series. Harper Collins, New York

  5. O’Reilly J (2017) Chapter 9 - high-performance computing. In: O’Reilly, J. (ed.) Network Storage, pp 151–161. Morgan Kaufmann, Boston . https://doi.org/10.1016/B978-0-12-803863-5.00009-1 .https://www.sciencedirect.com/science/article/pii/B9780128038635000091

  6. Chen J, Li K, Rong H, Bilal K, Yang N, Li K (2018) A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Inf Sci 435:124–149 . https://doi.org/10.1016/j.ins.2018.01.001

  7. Habibi M, Popescu-Belis A (2015) Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Trans Audio, Speech, and Language Process 23(4):746–759. https://doi.org/10.1109/TASLP.2015.2405482

    Article  MATH  Google Scholar 

  8. Audretsch DB, Feldman MP (1996) Innovative clusters and the industry life cycle. Rev Indust Org 11(2):253–273 . Accessed 2024-02-13

  9. Halkidi M, Batistakis Y, Vazirgiannis M (2001) Clustering algorithms and validity measures. In: Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001, pp 3–22. https://doi.org/10.1109/SSDM.2001.938534

  10. Theodoridis S, Koutroumbas K (2009) Chapter 16 - cluster validity. In: Theodoridis S, Koutroumbas K (eds.) Pattern Recognition (Fourth Edition), Fourth edition edn., pp 863–913. Academic Press, Boston. https://doi.org/10.1016/B978-1-59749-272-0.50018-9 . https://www.sciencedirect.com/science/article/pii/B9781597492720500189

  11. Wu D, Mendel JM, Joo J (2010) Linguistic summarization using if-then rules. In: International Conference on Fuzzy Systems, pp 1–8. https://doi.org/10.1109/FUZZY.2010.5584500

  12. Kaczmarek-Majer K, Hryniewicz O (2019) Application of linguistic summarization methods in time series forecasting. Inf Sci 478:580–594. https://doi.org/10.1016/j.ins.2018.11.036

  13. Marín N, Sánchez D (2016) On generating linguistic descriptions of time series. Fuzzy Sets and Syst 285:6–30

    Article  MathSciNet  MATH  Google Scholar 

  14. Castillo-Ortega R, Marín N, Sánchez D, Tettamanzi A (2011/08) Linguistic summarization of time series data using genetic algorithms. In: Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-11), pp 416–423

  15. Dijkman R, Wilbik A (2017) Linguistic summarization of event logs – a practical approach. Inf Syst 67:114–125

    Article  MATH  Google Scholar 

  16. Wilbik A, Kaymak U (2015/06) Linguistic summarization of processes – a research agenda. In: Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology, pp 1636–1643

  17. Wilbik A, Keller J, Bezdek J (2014) Linguistic prototypes for data from eldercare residents. IEEE Trans Fuzzy Syst 22(1):110–123. https://doi.org/10.1109/TFUZZ.2013.2249517

    Article  MATH  Google Scholar 

  18. Genç S, Akay D, Boran F, Yager R (2020) Linguistic summarization of fuzzy social and economic networks: an application on the international trade network. Soft Comput 24(2):1511–1527

    Article  MATH  Google Scholar 

  19. Khare A, Kumar D (2023) Summarizing clustering results using sentence prototype based language models. In: 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), pp 1–6 . https://doi.org/10.1109/ICECCME57830.2023.10252800

  20. Yager R (1982) A new approach to the summarization of data. Inf Sci 28(1):69–86

    Article  MathSciNet  MATH  Google Scholar 

  21. Yager R, Ford KM, Cañas AJ (1991) An approach to the linguistic summarization of data. In: Uncertainty in Knowledge Bases, pp 456–468. Springer, ???

  22. Sharma M, Gogineni A, Ramakrishnan N (2024) Innovations in Neural Data-to-text Generation: A Survey . arXiv:2207.12571

  23. Juraska J, Bowden KK, Walker M (2019) ViGGO: A Video Game Corpus for Data-To-Text Generation in Open-Domain Conversation . arXiv:1910.12129

  24. Gardent C, Shimorina A, Narayan S, Perez-Beltrachini L (2017) Creating training corpora for NLG micro-planners. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 179–188. Association for Computational Linguistics, Vancouver, Canada . https://doi.org/10.18653/v1/P17-1017 . https://aclanthology.org/P17-1017

  25. Harkous H, Groves I, Saffari A (2020) Have your text and use it too! end-to-end neural data-to-text generation with semantic fidelity. In: COLING 2020 . https://www.amazon.science/publications/have-your-text-and-use-it-too-end-to-end-neural-data-to-text-generation-with-semantic-fidelity

  26. Tom B (2020) Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and Tom Henighan and Rewon Child and Aditya Ramesh and Daniel M. Ziegler and Jeffrey Wu and Clemens Winter and Christopher Hesse and Mark Chen and Eric Sigler and Mateusz Litwin and Scott Gray and Benjamin Chess and Jack Clark and Christopher Berner and Sam McCandlish and Alec Radford and Ilya Sutskever and Dario Amodei: Language models are few-shot learners. CoRR arXiv:2005.14165

  27. Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67

    MathSciNet  Google Scholar 

  28. Yang S, Chen B (2023) Effective surrogate gradient learning with high-order information bottleneck for spike-based machine intelligence. IEEE Trans Neural Netw Learn Syst PP https://doi.org/10.1109/TNNLS.2023.3329525

  29. Yang S, Linares-Barranco B, Chen B (2022) Heterogeneous ensemble-based spike-driven few-shot online learning. Frontier Neurosci 16:850932. https://doi.org/10.3389/fnins.2022.850932

  30. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Berkeley Sympos Math Statist and Probability, pp –281297

  31. Gower J, Ross G (1969) Minimum spanning trees and single linkage cluster analysis. J Royal Stat Soc. Series C (Appl Stat) 18(1):54–64

  32. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp 226–231

  33. Ankerst M, Breunig M, Kriegel H, Sander J (1999) Optics: Ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data, pp 49–60

  34. Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez J, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256

    Article  MATH  Google Scholar 

  35. Dunn J (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57

  36. Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227

    Article  MATH  Google Scholar 

  37. Rousseeuw P (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65

    Article  MATH  Google Scholar 

  38. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2023) Attention Is All You Need . arXiv:1706.03762

  39. Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training. https://api.semanticscholar.org/CorpusID:49313245

  40. Freitag M, Al-Onaizan Y (2017) Beam search strategies for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, ??? . https://doi.org/10.18653/v1/w17-3207

  41. Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On Calibration of Modern Neural Networks . arXiv:1706.04599

  42. Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding . https://gluebenchmark.com/

  43. Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems . https://gluebenchmark.com/

  44. Fisher R . Iris Data Set. https://archive.ics.uci.edu/ml/datasets/iris

  45. Airlines Dataset to predict a delay. https://www.kaggle.com/datasets/jimschacko/airlines-dataset-to-predict-a-delay

  46. CreditCardData. https://www.kaggle.com/datasets/vikashsingh999/creditcarddata

  47. Single Linkage Distance. https://en.wikipedia.org/wiki/Single-linkage_clustering

  48. Recall-Oriented Understudy for Gisting Evaluation. https://en.wikipedia.org/wiki/ROUGE_%28metric%29

Download references

Funding

This work is funded by the Science and Engineering Research Board (SERB) Startup Research Grant no. SRG/2021/000744 and SER-1796-ECD.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed equally towards performing the experiments and writing the manuscript.

Corresponding author

Correspondence to Dheeraj Kumar.

Ethics declarations

Competing interests

Not applicable.

Ethical Approval and Consent to participate

Not applicable.

Human Ethics

Not applicable.

Consent for Publication

Author consent for the publication of this article if accepted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

K, N., Verma, S. & Kumar, D. Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models. Appl Intell 55, 317 (2025). https://doi.org/10.1007/s10489-025-06250-6

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10489-025-06250-6

Keywords

Navigation