Abstract
The growth of Internet has produced a high volume of natural language textual data. Such data can be sparse and may contain uninformative features which increase the dimensions of the data. This high dimensionality in turn, decreases the efficiency of text mining tasks such as clustering. Transforming the high dimensional data into a lower dimension is an important pre-processing step before applying clustering. In this paper, dimensionality reduction method based on deep Autoencoder neural network named as DRDAE, is proposed to provide optimized and robust features for text clustering. DRDAE selects less correlated and salient feature space from the high dimensional feature space. To evaluate proposed algorithm, k-means is used to cluster text documents. The proposed method is tested on five benchmark text datasets. Simulation results demonstrate that the proposed algorithm clearly outperforms other conventional dimensionality reduction methods in the literature in terms of RI measure.
Similar content being viewed by others
References
Abualigah LM, Khader AT, Al-Betar MA (2016) Multi-objectives-based text clustering technique using K-mean algorithm. In: 2016 7th international conference on computer science and information technology (CSIT). IEEE: 1–6
Agarwal B, Mittal N (2014) Text classification using machine learning methods-a survey. Springer, New Delhi, pp 701–709
Arzeno NM, Vikalo H (2015) Semi-supervised affinity propagation with soft instance-level constraints. IEEE Trans Pattern Anal Mach Intell 37:1041–1052. https://doi.org/10.1109/TPAMI.2014.2359454
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42:3105–3114. https://doi.org/10.1016/J.ESWA.2014.11.038
Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput J 43:20–34. https://doi.org/10.1016/j.asoc.2016.01.019
Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. Proc 16th ACM SIGKDD Int Conf Knowl Discov data Min - KDD 10:333. https://doi.org/10.1145/1835804.1835848
Chouaib H, Terrades OR, Tabbone S, et al (2008) Feature selection combining genetic algorithm and Adaboost classifiers. In: 2008 19th international conference on pattern recognition. IEEE: 1–4
Cover TM, Thomas JA, Bellamy J, et al (1991) Elements of Information Theory WILEY SERIES IN Expert System Applications to Telecommunications
Duda RO, Hart PE, Stork DG PATTERN CLASSIFICATION Second Edition
Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis
Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25:345–366. https://doi.org/10.1016/S0306-4379(00)00022-3
Hartigan JA (175AD) Clustering Algorithems. a Wiley Publ Appl Stat 1–351. doi:https://doi.org/10.1002/0471725382.scard
Hull DA (2013) Stemming algorithms - a case study for detailed Evalation. J Chem Inf Model 53:1689–1699. https://doi.org/10.1017/CBO9781107415324.004
Jolliffe IT (2002) Principal component analysis. Springer
Kant S, Mahara T, Kumar Jain V et al (2018) LeaderRank based k-means clustering initialization method for collaborative filtering. Comput Electr Eng 69:598–609. https://doi.org/10.1016/J.COMPELECENG.2017.12.001
Koller D, Sahami M (1996) Toward optimal feature selection. Int Conf Mach learn 284–292 . doi: citeulike-article-id:393144
Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput 1–10. doi:https://doi.org/10.1007/s12652-018-0941-x
Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Futur Gener Comput Syst 82. doi:https://doi.org/10.1016/j.future.2017.12.005
Kushwaha N, Pant M, Kant S, Kumar V (2017) Magnetic optimization algorithm for data clustering. Pattern Recogn Lett 0:1–7. https://doi.org/10.1016/j.patrec.2017.10.031
Lee Rodgers J, Alan Nice Wander W (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42:59–66. https://doi.org/10.1080/00031305.1988.10475524
Li YH, Jain AK Classification of Text Documents
Li M, Yuan B (2005) 2D-LDA: a statistical linear discriminant analysis for image matrix. Pattern Recogn Lett 26:527–532. https://doi.org/10.1016/J.PATREC.2004.09.007
Li Z, Yang Y, Liu J, et al (2012) Unsupervised Feature Selection Using Nonnegative Spectral Analysis. Twenty-Sixth AAAI Conf Artif Intell Unsupervised 1026–1032
Liu H, Yu L, Member SS et al (2005) Toward integrating feature selection algorithms for classification and clustering. Knowl Data Eng IEEE Trans 17:491–502. https://doi.org/10.1109/TKDE.2005.66
Ludwig C (2007) Text Retrieval 24:1–21
Nie F, Xiang S, Jia Y, et al (2008) Trace ratio criterion for feature selection. Twenty-third AAAI Conf Artif Intell 671–676
Xu R, Member S, Ii DW (2005) Survey of clustering. Algorithms 16:645–678
Yang Y, Shen HT, Ma Z, et al (2011) ℓ2,1-norm regularized discriminative feature selection for unsupervised learning. IJCAI Int Jt Conf Artif Intell 1589–1594. doi:https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-267
Zareapoor M, Yang J, Jain DK, et al (2018) Deep semantic preserving hashing for large scale image retrieval. Multimed Tools Appl 1–16 . doi:https://doi.org/10.1007/s11042-018-5970-0
Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. Icml 1151–1157
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Kushwaha, N., Pant, M. Textual data dimensionality reduction - a deep learning approach. Multimed Tools Appl 79, 11039–11050 (2020). https://doi.org/10.1007/s11042-018-6900-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-6900-x