Abstract
We introduce a novel text representation method to be applied on corpora containing short / medium length textual documents. The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation. The representation that we propose has multiple levels (granularities) by using different numbers of topics. We postulate that interpreting data in a more general space, with fewer dimensions, can improve the representation quality. Experimental results support the informative power of our multi-level representation vectors. We show that choosing the correct granularity of representation is an important aspect of text classification. We propose a multi-level representation, at different topical granularities, rather than choosing one level. The documents are represented by topical relevancy weights, in a low-dimensional vector representation. Finally, the proposed representation is applied to a text classification task using several well-known classification algorithms. We show that it leads to very good classification performance. Another advantage is that, with a small compromise on accuracy, our low-dimensional representation can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C., Zhao, P.: Towards graphical models for text processing. Knowledge Information Systems (2012), doi:10.1007/ s10115-012-0552-3
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. In: Proceedings of the Conference on Neutral Processing Information Systems, NIPS 2003 (2003)
Blei, D.M., McAulie, J.: Supervised topic models. In: Proceedings of the Conference on Neutral Processing Information Systems, NIPS 2007 (2007)
Blei, D.M., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Cardoso-Cachopo, A., Oliveira, A.L.: Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization. In: Proceedings of the First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning, EWLSATEL 2007 (2007)
Chen, Y.-L., Yu, T.-L.: News Classification based on experts’ work knowledge. In: Proceedings of the 2nd International Conference on Networking and Information Technology IPCSIT 2011, vol. 17. IACSIT Press, Singapore (2011)
McCallum, A., Wang, X.: Topic and role discovery in social networks. In: Proceedings of IJCAI 2005 (2005)
Firth, J.R., et al.: Studies in Linguistic Analysis. A synopsis of linguistic theory, 1930-1955. Special volume of the Philological Society. Blackwell, Oxford (1957)
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Harris, Z.: Distributional structure. In: Katz, J.J., Fodor, J.A. (eds.) The Philosophy of Linguistics. Oxford University Press, New York (1964)
Harris, Z.: Distributional structure. In: Katz, J.J. (ed.) The Philosophy of Linguistics, pp. 26–47. Oxford University Press (1985)
Heinrich, G.: Parameter estimation for text analysis. Technical Report (2004), For further information please refer to JGibbLDA at the following link: http://jgibblda.sourceforge.net/
Li, W., McCallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proceedings of ICML 2006 (2006)
McDonald, S., Ramscar, M.: Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In: Proceedings of the 23rd Annual Conference of the Cognitive Science Society 2001 (2001)
Minka, T., Lafferty, J.: Expectation propagation for the generative aspect model. In: Proceedings of the 18th Annual Conference on Uncertainty in Artificial Intelligence, UAI 2002 (2002), https://research.microsoft.com/minka/papers/aspect/minka-aspect.pdf
Mimno, D., Hoffman, M., Blei, D.M.: Sparse stochastic inference for latent Dirichlet allocation. In: Proceedings of International Conference on Machine Learning, ICML 2012 (2012)
Pan, X., Assal, H.: Providing context for free text interpretation. In: Proceedings of Natural Language Processing and Knowledge Engineering, pp. 704–709 (2003)
Sebastiani, F.: Classification of text, automatic. In: Brown, K. (ed.) The Encyclopedia of Language and Linguistics, 2nd edn., vol. 14, pp. 457–462. Elsevier Science Publishers, Amsterdam (2006)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research (JAIR) 37, 141–188 (2010)
Wang, X., McCallum, A.: Topics over time: A non-Markov continuous-time model of topical trends. In: Proceedings of ACM SIGKDD conference on Knowledge Discovery and Data Mining, KDD 2006 (2006)
Yuan, M., Ouyang, Y.X., Xiong, Z.: A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets. Journal of Information Science and Engineering 29, 99–114 (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Razavi, A.H., Inkpen, D. (2014). Text Representation Using Multi-level Latent Dirichlet Allocation. In: Sokolova, M., van Beek, P. (eds) Advances in Artificial Intelligence. Canadian AI 2014. Lecture Notes in Computer Science(), vol 8436. Springer, Cham. https://doi.org/10.1007/978-3-319-06483-3_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-06483-3_19
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06482-6
Online ISBN: 978-3-319-06483-3
eBook Packages: Computer ScienceComputer Science (R0)