Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

K, Natarajan; Verma, Srikar; Kumar, Dheeraj

doi:10.1007/s10489-025-06250-6

Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

Published: 14 January 2025

Volume 55, article number 317, (2025)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

56 Accesses
Explore all metrics

Abstract

In today’s technology-driven era, the proliferation of data is inevitable across various domains. Within engineering, sciences, and business domains, particularly in the context of big data, it can extract actionable insights that can revolutionize the field. Amid data management and analysis, patterns or groups of interconnected data points, commonly referred to as clusters, frequently emerge. These clusters represent distinct subsets containing closely related data points, showcasing unique characteristics compared to other clusters within the same dataset. Spanning across disciplines such as physics, biology, business, and sales, clustering is important in understanding these novel datasets’ essential characteristics, developing complex statistical models, and testing various hypotheses. However, interpreting the characteristics and physical implications of generated clusters by different clustering algorithms is challenging for researchers unfamiliar with these algorithms’ inner workings. This research addresses the intricacies of comprehending data clustering, cluster attributes, and evaluation metrics, especially for individuals lacking proficiency in clustering or related disciplines like statistics. The primary objective of this study is to simplify cluster analysis by furnishing users or analysts from diverse domains with succinct linguistic synopses of clustering results, circumventing the necessity for intricate numerical or mathematical terms. Deep learning techniques based on large language models, such as encoder-decoders (for example, the T5 model) and generative pre-trained transformers (GPTs), are employed to achieve this. This study aims to construct a summarization model capable of ingesting data clusters, producing a condensed overview of the contained insights in a simplified, easily understandable linguistic format. The evaluation process revealed a clear preference among evaluators for the summaries generated by GPT, with T5 summaries following closely behind. GPT and T5 summaries were good at fluency, demonstrating their ability to capture the original content in a human-like manner. In contrast, while providing a structured framework for summarization, the linguistic protoform-based approach is needed to match the quality and coherence of the GPT and T5 summaries.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Summarization-Guided Greedy Optimization of Machine Learning Model

CafeLLM: Context-Aware Fine-Grained Semantic Clustering Using Large Language Models

A data value metric for quantifying information content and utility

Article Open access 05 June 2021

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

Supporting data/code will be made available upon request.

References

Aggarwal CC, Reddy CK (2013) Data Clustering: Algorithms and Applications, 1st edn. Chapman & Hall/CRC, ???
Jain AK, Murty MN, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323. https://doi.org/10.1145/331499.331504
Article MATH Google Scholar
Jain AK (2010) Data clustering: 50 years beyond k-means. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR). Pattern Recogn Lett 31(8):651–666. https://doi.org/10.1016/j.patrec.2009.09.011.
McQuarrie DA (1976) Statistical Mechanics. Harper’s chemistry series. Harper Collins, New York
O’Reilly J (2017) Chapter 9 - high-performance computing. In: O’Reilly, J. (ed.) Network Storage, pp 151–161. Morgan Kaufmann, Boston . https://doi.org/10.1016/B978-0-12-803863-5.00009-1 .https://www.sciencedirect.com/science/article/pii/B9780128038635000091
Chen J, Li K, Rong H, Bilal K, Yang N, Li K (2018) A disease diagnosis and treatment recommendation system based on big data mining and cloud computing. Inf Sci 435:124–149 . https://doi.org/10.1016/j.ins.2018.01.001
Habibi M, Popescu-Belis A (2015) Keyword extraction and clustering for document recommendation in conversations. IEEE/ACM Trans Audio, Speech, and Language Process 23(4):746–759. https://doi.org/10.1109/TASLP.2015.2405482
Article MATH Google Scholar
Audretsch DB, Feldman MP (1996) Innovative clusters and the industry life cycle. Rev Indust Org 11(2):253–273 . Accessed 2024-02-13
Halkidi M, Batistakis Y, Vazirgiannis M (2001) Clustering algorithms and validity measures. In: Proceedings Thirteenth International Conference on Scientific and Statistical Database Management. SSDBM 2001, pp 3–22. https://doi.org/10.1109/SSDM.2001.938534
Theodoridis S, Koutroumbas K (2009) Chapter 16 - cluster validity. In: Theodoridis S, Koutroumbas K (eds.) Pattern Recognition (Fourth Edition), Fourth edition edn., pp 863–913. Academic Press, Boston. https://doi.org/10.1016/B978-1-59749-272-0.50018-9 . https://www.sciencedirect.com/science/article/pii/B9781597492720500189
Wu D, Mendel JM, Joo J (2010) Linguistic summarization using if-then rules. In: International Conference on Fuzzy Systems, pp 1–8. https://doi.org/10.1109/FUZZY.2010.5584500
Kaczmarek-Majer K, Hryniewicz O (2019) Application of linguistic summarization methods in time series forecasting. Inf Sci 478:580–594. https://doi.org/10.1016/j.ins.2018.11.036
Marín N, Sánchez D (2016) On generating linguistic descriptions of time series. Fuzzy Sets and Syst 285:6–30
Article MathSciNet MATH Google Scholar
Castillo-Ortega R, Marín N, Sánchez D, Tettamanzi A (2011/08) Linguistic summarization of time series data using genetic algorithms. In: Conference of the European Society for Fuzzy Logic and Technology (EUSFLAT-11), pp 416–423
Dijkman R, Wilbik A (2017) Linguistic summarization of event logs – a practical approach. Inf Syst 67:114–125
Article MATH Google Scholar
Wilbik A, Kaymak U (2015/06) Linguistic summarization of processes – a research agenda. In: Conference of the International Fuzzy Systems Association and the European Society for Fuzzy Logic and Technology, pp 1636–1643
Wilbik A, Keller J, Bezdek J (2014) Linguistic prototypes for data from eldercare residents. IEEE Trans Fuzzy Syst 22(1):110–123. https://doi.org/10.1109/TFUZZ.2013.2249517
Article MATH Google Scholar
Genç S, Akay D, Boran F, Yager R (2020) Linguistic summarization of fuzzy social and economic networks: an application on the international trade network. Soft Comput 24(2):1511–1527
Article MATH Google Scholar
Khare A, Kumar D (2023) Summarizing clustering results using sentence prototype based language models. In: 2023 3rd International Conference on Electrical, Computer, Communications and Mechatronics Engineering (ICECCME), pp 1–6 . https://doi.org/10.1109/ICECCME57830.2023.10252800
Yager R (1982) A new approach to the summarization of data. Inf Sci 28(1):69–86
Article MathSciNet MATH Google Scholar
Yager R, Ford KM, Cañas AJ (1991) An approach to the linguistic summarization of data. In: Uncertainty in Knowledge Bases, pp 456–468. Springer, ???
Sharma M, Gogineni A, Ramakrishnan N (2024) Innovations in Neural Data-to-text Generation: A Survey . arXiv:2207.12571
Juraska J, Bowden KK, Walker M (2019) ViGGO: A Video Game Corpus for Data-To-Text Generation in Open-Domain Conversation . arXiv:1910.12129
Gardent C, Shimorina A, Narayan S, Perez-Beltrachini L (2017) Creating training corpora for NLG micro-planners. In: Barzilay, R., Kan, M.-Y. (eds.) Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 179–188. Association for Computational Linguistics, Vancouver, Canada . https://doi.org/10.18653/v1/P17-1017 . https://aclanthology.org/P17-1017
Harkous H, Groves I, Saffari A (2020) Have your text and use it too! end-to-end neural data-to-text generation with semantic fidelity. In: COLING 2020 . https://www.amazon.science/publications/have-your-text-and-use-it-too-end-to-end-neural-data-to-text-generation-with-semantic-fidelity
Tom B (2020) Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert-Voss and Gretchen Krueger and Tom Henighan and Rewon Child and Aditya Ramesh and Daniel M. Ziegler and Jeffrey Wu and Clemens Winter and Christopher Hesse and Mark Chen and Eric Sigler and Mateusz Litwin and Scott Gray and Benjamin Chess and Jack Clark and Christopher Berner and Sam McCandlish and Alec Radford and Ilya Sutskever and Dario Amodei: Language models are few-shot learners. CoRR arXiv:2005.14165
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res 21(140):1–67
MathSciNet Google Scholar
Yang S, Chen B (2023) Effective surrogate gradient learning with high-order information bottleneck for spike-based machine intelligence. IEEE Trans Neural Netw Learn Syst PP https://doi.org/10.1109/TNNLS.2023.3329525
Yang S, Linares-Barranco B, Chen B (2022) Heterogeneous ensemble-based spike-driven few-shot online learning. Frontier Neurosci 16:850932. https://doi.org/10.3389/fnins.2022.850932
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Berkeley Sympos Math Statist and Probability, pp –281297
Gower J, Ross G (1969) Minimum spanning trees and single linkage cluster analysis. J Royal Stat Soc. Series C (Appl Stat) 18(1):54–64
Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International Conference on Knowledge Discovery and Data Mining, pp 226–231
Ankerst M, Breunig M, Kriegel H, Sander J (1999) Optics: Ordering points to identify the clustering structure. In: ACM SIGMOD International Conference on Management of Data, pp 49–60
Arbelaitz O, Gurrutxaga I, Muguerza J, Pérez J, Perona I (2013) An extensive comparative study of cluster validity indices. Pattern Recogn 46(1):243–256
Article MATH Google Scholar
Dunn J (1973) A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1(2):224–227
Article MATH Google Scholar
Rousseeuw P (1987) Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Article MATH Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2023) Attention Is All You Need . arXiv:1706.03762
Radford A, Narasimhan K (2018) Improving language understanding by generative pre-training. https://api.semanticscholar.org/CorpusID:49313245
Freitag M, Al-Onaizan Y (2017) Beam search strategies for neural machine translation. In: Proceedings of the First Workshop on Neural Machine Translation. Association for Computational Linguistics, ??? . https://doi.org/10.18653/v1/w17-3207
Guo C, Pleiss G, Sun Y, Weinberger KQ (2017) On Calibration of Modern Neural Networks . arXiv:1706.04599
Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2018) GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding . https://gluebenchmark.com/
Wang A, Pruksachatkun Y, Nangia N, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems . https://gluebenchmark.com/
Fisher R . Iris Data Set. https://archive.ics.uci.edu/ml/datasets/iris
Airlines Dataset to predict a delay. https://www.kaggle.com/datasets/jimschacko/airlines-dataset-to-predict-a-delay
CreditCardData. https://www.kaggle.com/datasets/vikashsingh999/creditcarddata
Single Linkage Distance. https://en.wikipedia.org/wiki/Single-linkage_clustering
Recall-Oriented Understudy for Gisting Evaluation. https://en.wikipedia.org/wiki/ROUGE_%28metric%29

Download references

Funding

This work is funded by the Science and Engineering Research Board (SERB) Startup Research Grant no. SRG/2021/000744 and SER-1796-ECD.

Author information

Authors and Affiliations

Department of Metallurgical and Materials Engineering, National Institute of Technology Tiruchirappalli, Tanjore Main Road, NH67, near BHEL, Tiruchirappalli, Tamil Nadu, India
Natarajan K
Department of Mechanical Engineering, Indian Institute of Technology (BHU) Varanasi Campus, Varanasi, Uttar Pradesh, India
Srikar Verma
Department of Electronics and Communication Engineering, Indian Institute of Technology Roorkee, Street, Roorkee, Uttarakhand, India
Dheeraj Kumar

Authors

Natarajan K
View author publications
You can also search for this author in PubMed Google Scholar
Srikar Verma
View author publications
You can also search for this author in PubMed Google Scholar
Dheeraj Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed equally towards performing the experiments and writing the manuscript.

Corresponding author

Correspondence to Dheeraj Kumar.

Ethics declarations

Competing interests

Not applicable.

Ethical Approval and Consent to participate

Not applicable.

Human Ethics

Not applicable.

Consent for Publication

Author consent for the publication of this article if accepted.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

K, N., Verma, S. & Kumar, D. Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models. Appl Intell 55, 317 (2025). https://doi.org/10.1007/s10489-025-06250-6

Download citation

Accepted: 30 December 2024
Published: 14 January 2025
DOI: https://doi.org/10.1007/s10489-025-06250-6

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Summarization-Guided Greedy Optimization of Machine Learning Model

CafeLLM: Context-Aware Fine-Grained Semantic Clustering Using Large Language Models

A data value metric for quantifying information content and utility

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical Approval and Consent to participate

Human Ethics

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Enhanced interpretation of novel datasets by summarizing clustering results using deep-learning based linguistic models

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Summarization-Guided Greedy Optimization of Machine Learning Model

CafeLLM: Context-Aware Fine-Grained Semantic Clustering Using Large Language Models

A data value metric for quantifying information content and utility

Explore related subjects

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical Approval and Consent to participate

Human Ethics

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation