Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data

Leyli-Abadi, Milad; Labiod, Lazhar; Nadif, Mohamed

doi:10.1007/978-3-319-57529-2_62

Milad Leyli-Abadi¹⁹,
Lazhar Labiod¹⁹ &
Mohamed Nadif¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10235))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

Abstract

Deep learning methods are widely used in vision and face recognition, however there is a real lack of application of such methods in the field of text data. In this context, the data is often represented by a sparse high dimensional document-term matrix. Dealing with such data matrices, we present, in this paper, a new denoising auto-encoder for dimensionality reduction, where each document is not only affected by its own information, but also affected by the information from its neighbors according to the cosine similarity measure. It turns out that the proposed auto-encoder can discover the low dimensional embeddings, and as a result reveal the underlying effective manifold structure. The visual representation of these embeddings suggests the suitability of performing the clustering on the set of documents relying on the Expectation-Maximization algorithm for Gaussian mixture models. On real-world datasets, the relevance of the presented auto-encoder in the visualisation and document clustering field is shown by a comparison with five widely used unsupervised dimensionality reduction methods including the classic auto-encoder.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://dataexpertise.org/research.

References

Gittins, R.: Canonical Analysis - A Review with Applications in Ecology. Springer, Heidelberg (1985)
Book MATH Google Scholar
van der Maaten, L., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
MATH Google Scholar
van der Maaten, L.: Learning a parametric embedding by preserving local structure. RBM, 500:500 (2009)
Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. NIPS 14, 585–591 (2001)
Google Scholar
Bengio, Y.: Learning deep architectures for ai. Found. Trends Mach. Learn. 2(1), 1–127 (2009)
Article MATH Google Scholar
Vincent, P.: A connection between score matching and denoising autoencoders. Neural Comput. 23(7), 1661–1674 (2011)
Article MathSciNet MATH Google Scholar
Dempster, A.P., Nan Laird, M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm. J. Roy. Stat. Soc. Ser. B (methodological) 39, 1–38 (1977)
Google Scholar
Schwarz, G., et al.: Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978)
Article MathSciNet MATH Google Scholar
LeCun, Y.A., Bottou, L., Orr, G.B., Müller, K.-R.: Efficient BackProp. In: Montavon, G., Orr, G.B., Müller, K.-R. (eds.) Neural Networks: Tricks of the Trade. LNCS, vol. 7700, pp. 9–48. Springer, Heidelberg (2012). doi:10.1007/978-3-642-35289-8_3
Chapter Google Scholar
Jégou, H., Chum, O.: Negative evidences and co-occurences in image retrieval: the benefit of PCA and whitening. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, pp. 774–787. Springer, Heidelberg (2012). doi:10.1007/978-3-642-33709-3_55
Chapter Google Scholar
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Article MathSciNet MATH Google Scholar
Wang, W., Huang, Y., Wang, Y., Wang, L.: Generalized autoencoder: a neural network framework for dimensionality reduction. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 490–497 (2014)
Google Scholar
Ng, A.: Sparse autoencoder. CS294A Lecture Notes, vol. 72, pp. 1–19 (2011)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3, 583–617 (2003)
MathSciNet MATH Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2, 193–218 (1985)
Article MATH Google Scholar
Banfield, J.D., Raftery, A.E.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: Mclust version 3: an R package for normal mixture modeling and model-based clustering. Technical report (2006)
Google Scholar
Priam, R., Nadif, M.: Data visualization via latent variables and mixture models: a brief survey. Pattern Anal. Appl. 19(3), 807–819 (2016)
Article MathSciNet Google Scholar
Allab, K., Labiod, L., Nadif, M.: A semi-NMF-PCA unified framework for data clustering. IEEE Trans. Knowl. Data Eng. 29(1), 2–16 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

LIPADE, Paris Descartes University, 75006, Paris, France
Milad Leyli-Abadi, Lazhar Labiod & Mohamed Nadif

Authors

Milad Leyli-Abadi
View author publications
You can also search for this author in PubMed Google Scholar
Lazhar Labiod
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Milad Leyli-Abadi .

Editor information

Editors and Affiliations

Kangwon National University, Chuncheon, Korea (Republic of)
Jinho Kim
Seoul National University, Seoul, Korea (Republic of)
Kyuseok Shim
University of Technology Sydney, Sydney, New South Wales, Australia
Longbing Cao
KAIST, Daejeon, Korea (Republic of)
Jae-Gil Lee
University of New South Wales, Sydney, New South Wales, Australia
Xuemin Lin
Kangwon National University, Chuncheon, Korea (Republic of)
Yang-Sae Moon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leyli-Abadi, M., Labiod, L., Nadif, M. (2017). Denoising Autoencoder as an Effective Dimensionality Reduction and Clustering of Text Data. In: Kim, J., Shim, K., Cao, L., Lee, JG., Lin, X., Moon, YS. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2017. Lecture Notes in Computer Science(), vol 10235. Springer, Cham. https://doi.org/10.1007/978-3-319-57529-2_62

Download citation

DOI: https://doi.org/10.1007/978-3-319-57529-2_62
Published: 23 April 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57528-5
Online ISBN: 978-3-319-57529-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics