Skip to main content
Log in

Textual data dimensionality reduction - a deep learning approach

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The growth of Internet has produced a high volume of natural language textual data. Such data can be sparse and may contain uninformative features which increase the dimensions of the data. This high dimensionality in turn, decreases the efficiency of text mining tasks such as clustering. Transforming the high dimensional data into a lower dimension is an important pre-processing step before applying clustering. In this paper, dimensionality reduction method based on deep Autoencoder neural network named as DRDAE, is proposed to provide optimized and robust features for text clustering. DRDAE selects less correlated and salient feature space from the high dimensional feature space. To evaluate proposed algorithm, k-means is used to cluster text documents. The proposed method is tested on five benchmark text datasets. Simulation results demonstrate that the proposed algorithm clearly outperforms other conventional dimensionality reduction methods in the literature in terms of RI measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Abualigah LM, Khader AT, Al-Betar MA (2016) Multi-objectives-based text clustering technique using K-mean algorithm. In: 2016 7th international conference on computer science and information technology (CSIT). IEEE: 1–6

  2. Agarwal B, Mittal N (2014) Text classification using machine learning methods-a survey. Springer, New Delhi, pp 701–709

    Google Scholar 

  3. Arzeno NM, Vikalo H (2015) Semi-supervised affinity propagation with soft instance-level constraints. IEEE Trans Pattern Anal Mach Intell 37:1041–1052. https://doi.org/10.1109/TPAMI.2014.2359454

    Article  Google Scholar 

  4. Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42:3105–3114. https://doi.org/10.1016/J.ESWA.2014.11.038

    Article  Google Scholar 

  5. Bharti KK, Singh PK (2016) Opposition chaotic fitness mutation based adaptive inertia weight BPSO for feature selection in text clustering. Appl Soft Comput J 43:20–34. https://doi.org/10.1016/j.asoc.2016.01.019

    Article  Google Scholar 

  6. Cai D, Zhang C, He X (2010) Unsupervised feature selection for multi-cluster data. Proc 16th ACM SIGKDD Int Conf Knowl Discov data Min - KDD 10:333. https://doi.org/10.1145/1835804.1835848

    Article  Google Scholar 

  7. Chouaib H, Terrades OR, Tabbone S, et al (2008) Feature selection combining genetic algorithm and Adaboost classifiers. In: 2008 19th international conference on pattern recognition. IEEE: 1–4

  8. Cover TM, Thomas JA, Bellamy J, et al (1991) Elements of Information Theory WILEY SERIES IN Expert System Applications to Telecommunications

  9. Duda RO, Hart PE, Stork DG PATTERN CLASSIFICATION Second Edition

  10. Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis

  11. Guha S, Rastogi R, Shim K (2000) Rock: a robust clustering algorithm for categorical attributes. Inf Syst 25:345–366. https://doi.org/10.1016/S0306-4379(00)00022-3

    Article  Google Scholar 

  12. Hartigan JA (175AD) Clustering Algorithems. a Wiley Publ Appl Stat 1–351. doi:https://doi.org/10.1002/0471725382.scard

  13. Hull DA (2013) Stemming algorithms - a case study for detailed Evalation. J Chem Inf Model 53:1689–1699. https://doi.org/10.1017/CBO9781107415324.004

    Article  Google Scholar 

  14. Jolliffe IT (2002) Principal component analysis. Springer

  15. Kant S, Mahara T, Kumar Jain V et al (2018) LeaderRank based k-means clustering initialization method for collaborative filtering. Comput Electr Eng 69:598–609. https://doi.org/10.1016/J.COMPELECENG.2017.12.001

    Article  Google Scholar 

  16. Koller D, Sahami M (1996) Toward optimal feature selection. Int Conf Mach learn 284–292 . doi: citeulike-article-id:393144

  17. Kushwaha N, Pant M (2018) Fuzzy magnetic optimization clustering algorithm with its application to health care. J Ambient Intell Humaniz Comput 1–10. doi:https://doi.org/10.1007/s12652-018-0941-x

  18. Kushwaha N, Pant M (2018) Link based BPSO for feature selection in big data text clustering. Futur Gener Comput Syst 82. doi:https://doi.org/10.1016/j.future.2017.12.005

  19. Kushwaha N, Pant M, Kant S, Kumar V (2017) Magnetic optimization algorithm for data clustering. Pattern Recogn Lett 0:1–7. https://doi.org/10.1016/j.patrec.2017.10.031

    Article  Google Scholar 

  20. Lee Rodgers J, Alan Nice Wander W (1988) Thirteen ways to look at the correlation coefficient. Am Stat 42:59–66. https://doi.org/10.1080/00031305.1988.10475524

    Article  Google Scholar 

  21. Li YH, Jain AK Classification of Text Documents

  22. Li M, Yuan B (2005) 2D-LDA: a statistical linear discriminant analysis for image matrix. Pattern Recogn Lett 26:527–532. https://doi.org/10.1016/J.PATREC.2004.09.007

    Article  Google Scholar 

  23. Li Z, Yang Y, Liu J, et al (2012) Unsupervised Feature Selection Using Nonnegative Spectral Analysis. Twenty-Sixth AAAI Conf Artif Intell Unsupervised 1026–1032

  24. Liu H, Yu L, Member SS et al (2005) Toward integrating feature selection algorithms for classification and clustering. Knowl Data Eng IEEE Trans 17:491–502. https://doi.org/10.1109/TKDE.2005.66

    Article  Google Scholar 

  25. Ludwig C (2007) Text Retrieval 24:1–21

  26. Nie F, Xiang S, Jia Y, et al (2008) Trace ratio criterion for feature selection. Twenty-third AAAI Conf Artif Intell 671–676

  27. Xu R, Member S, Ii DW (2005) Survey of clustering. Algorithms 16:645–678

    Google Scholar 

  28. Yang Y, Shen HT, Ma Z, et al (2011) ℓ2,1-norm regularized discriminative feature selection for unsupervised learning. IJCAI Int Jt Conf Artif Intell 1589–1594. doi:https://doi.org/10.5591/978-1-57735-516-8/IJCAI11-267

  29. Zareapoor M, Yang J, Jain DK, et al (2018) Deep semantic preserving hashing for large scale image retrieval. Multimed Tools Appl 1–16 . doi:https://doi.org/10.1007/s11042-018-5970-0

  30. Zhao Z, Liu H (2007) Spectral feature selection for supervised and unsupervised learning. Icml 1151–1157

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Neetu Kushwaha.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kushwaha, N., Pant, M. Textual data dimensionality reduction - a deep learning approach. Multimed Tools Appl 79, 11039–11050 (2020). https://doi.org/10.1007/s11042-018-6900-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6900-x

Keywords

Navigation