Abstract
One of the most critical challenges in document modeling is the efficiency of the extraction of the high level representations. In this paper, a document modeling method based on deep generative model and spectral hashing is proposed. Firstly, dense and low-dimensional features are well learned from a deep generative model with word-count vectors as its input. And then, these features are used for training a spectral hashing model to compress a novel document into compact binary code, and the Hamming distances between these codewords correlate with semantic similarity. Taken together, retrieving similar neighbors is then done simply by retrieving all items with codewords within a small Hamming distance of the codewords for the query, which can be exceedingly fast and shows superior performance compared with conventional methods as well as guarantees accessibility to the large-scale dataset.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57. ACM, New York (1999)
David, M.B., Andrew, Y.N., Michael, I.J.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural Comput. 14(8), 1711–1800 (2002)
Hinton, G.E., Osindero, S.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Xu, J., Li, H., Zhou, S.: An overview of deep generative models. IETE Techn. Rev. 32(2), 131–139 (2015)
Li, J., Luong, M.T., Dan, J.: A hierarchical neural autoencoder for paragraphs and documents. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, pp. 1106–1115. Association for Computational Linguistics, Stroudsburg (2015)
Le, Q.V., Tomas, M.: Distributed representations of sentences and documents. In: Proceedings of the 31st International Conference on Machine Learning, pp. 1188–1196 (2014)
Salakhutdinov, R.R., Hinton, G.E.: Semantic hashing. Int. J. Approximate Reasoning 50(7), 969–978 (2009)
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: Advances in Neural Information Processing Systems, vol. 21, pp. 1753–1760 (2009)
Yu, G., Sapiro, G., Mallat, S.: Solving inverse problems with piecewise linear estimators: from Gaussian mixture models to structured sparsity. IEEE Trans. Image Process. 21(5), 2481–2499 (2012)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22, 888–905 (1997)
Kannan, R., Vempala, S., Vetta, A.: On clusterings-good, bad and spectral. J. ACM 51(3), 497–515 (2004)
Andrew, Y.N., Michael, I.J., Yair, W.: On spectral clustering: analysis and an algorithm. In: Advances in Neural Information Processing Systems, vol. 14, pp. 849–856 (2002)
Xu, J., Li, H., Zhou, S.: Improving mixing rate with tempered transition for learning restricted Boltzmann machines. Neurocomputing 139, 328–335 (2014)
Bekkerman, R., Yaniv, R.E., Tishby, N., Winter, Y.: On feature distributional clustering for text categorization. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 146–153. ACM, New York (2001)
Li, B., Vogel, C.: Improving multiclass text classification with error-correcting output coding and sub-class partitions. Adv. Artif. Intell. 6085, 4–15 (2010)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)
Acknowledgments
This work is supported in part by the Beijing Natural Science Foundation under Grant No. 4162067/4142050 and the National Science Foundation of China under Grant No. 61472391/61372171.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Chen, H., Xu, J., Wang, Q., He, B. (2016). A Document Modeling Method Based on Deep Generative Model and Spectral Hashing. In: Lehner, F., Fteimi, N. (eds) Knowledge Science, Engineering and Management. KSEM 2016. Lecture Notes in Computer Science(), vol 9983. Springer, Cham. https://doi.org/10.1007/978-3-319-47650-6_32
Download citation
DOI: https://doi.org/10.1007/978-3-319-47650-6_32
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-47649-0
Online ISBN: 978-3-319-47650-6
eBook Packages: Computer ScienceComputer Science (R0)