Text Representation Using Multi-level Latent Dirichlet Allocation

Razavi, Amir H.; Inkpen, Diana

doi:10.1007/978-3-319-06483-3_19

Amir H. Razavi²¹ &
Diana Inkpen²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8436))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

2693 Accesses
7 Citations

Abstract

We introduce a novel text representation method to be applied on corpora containing short / medium length textual documents. The method applies Latent Dirichlet Allocation (LDA) on a corpus to infer its major topics, which will be used for document representation. The representation that we propose has multiple levels (granularities) by using different numbers of topics. We postulate that interpreting data in a more general space, with fewer dimensions, can improve the representation quality. Experimental results support the informative power of our multi-level representation vectors. We show that choosing the correct granularity of representation is an important aspect of text classification. We propose a multi-level representation, at different topical granularities, rather than choosing one level. The documents are represented by topical relevancy weights, in a low-dimensional vector representation. Finally, the proposed representation is applied to a text classification task using several well-known classification algorithms. We show that it leads to very good classification performance. Another advantage is that, with a small compromise on accuracy, our low-dimensional representation can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Zhao, P.: Towards graphical models for text processing. Knowledge Information Systems (2012), doi:10.1007/ s10115-012-0552-3
Google Scholar
Blei, D.M., Griffiths, T.L., Jordan, M.I., Tenenbaum, J.B.: Hierarchical topic models and the nested Chinese restaurant process. In: Proceedings of the Conference on Neutral Processing Information Systems, NIPS 2003 (2003)
Google Scholar
Blei, D.M., McAulie, J.: Supervised topic models. In: Proceedings of the Conference on Neutral Processing Information Systems, NIPS 2007 (2007)
Google Scholar
Blei, D.M., Ng, A., Jordan, M.: Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Cardoso-Cachopo, A., Oliveira, A.L.: Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization. In: Proceedings of the First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning, EWLSATEL 2007 (2007)
Google Scholar
Chen, Y.-L., Yu, T.-L.: News Classification based on experts’ work knowledge. In: Proceedings of the 2nd International Conference on Networking and Information Technology IPCSIT 2011, vol. 17. IACSIT Press, Singapore (2011)
Google Scholar
McCallum, A., Wang, X.: Topic and role discovery in social networks. In: Proceedings of IJCAI 2005 (2005)
Google Scholar
Firth, J.R., et al.: Studies in Linguistic Analysis. A synopsis of linguistic theory, 1930-1955. Special volume of the Philological Society. Blackwell, Oxford (1957)
Google Scholar
Griffiths, T.L., Steyvers, M.: Finding scientific topics. Proceedings of the National Academy of Sciences 101(suppl. 1), 5228–5235 (2004)
Article Google Scholar
Harris, Z.: Distributional structure. In: Katz, J.J., Fodor, J.A. (eds.) The Philosophy of Linguistics. Oxford University Press, New York (1964)
Google Scholar
Harris, Z.: Distributional structure. In: Katz, J.J. (ed.) The Philosophy of Linguistics, pp. 26–47. Oxford University Press (1985)
Google Scholar
Heinrich, G.: Parameter estimation for text analysis. Technical Report (2004), For further information please refer to JGibbLDA at the following link: http://jgibblda.sourceforge.net/
Li, W., McCallum, A.: Pachinko allocation: Dag-structured mixture models of topic correlations. In: Proceedings of ICML 2006 (2006)
Google Scholar
McDonald, S., Ramscar, M.: Testing the distributional hypothesis: The influence of context on judgements of semantic similarity. In: Proceedings of the 23rd Annual Conference of the Cognitive Science Society 2001 (2001)
Google Scholar
Minka, T., Lafferty, J.: Expectation propagation for the generative aspect model. In: Proceedings of the 18th Annual Conference on Uncertainty in Artificial Intelligence, UAI 2002 (2002), https://research.microsoft.com/minka/papers/aspect/minka-aspect.pdf
Mimno, D., Hoffman, M., Blei, D.M.: Sparse stochastic inference for latent Dirichlet allocation. In: Proceedings of International Conference on Machine Learning, ICML 2012 (2012)
Google Scholar
Pan, X., Assal, H.: Providing context for free text interpretation. In: Proceedings of Natural Language Processing and Knowledge Engineering, pp. 704–709 (2003)
Google Scholar
Sebastiani, F.: Classification of text, automatic. In: Brown, K. (ed.) The Encyclopedia of Language and Linguistics, 2nd edn., vol. 14, pp. 457–462. Elsevier Science Publishers, Amsterdam (2006)
Google Scholar
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28(1), 11–21 (1972)
Article Google Scholar
Turney, P.D., Pantel, P.: From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research (JAIR) 37, 141–188 (2010)
MATH MathSciNet Google Scholar
Wang, X., McCallum, A.: Topics over time: A non-Markov continuous-time model of topical trends. In: Proceedings of ACM SIGKDD conference on Knowledge Discovery and Data Mining, KDD 2006 (2006)
Google Scholar
Yuan, M., Ouyang, Y.X., Xiong, Z.: A Text Categorization Method using Extended Vector Space Model by Frequent Term Sets. Journal of Information Science and Engineering 29, 99–114 (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, University of Ottawa, Canada
Amir H. Razavi & Diana Inkpen

Authors

Amir H. Razavi
View author publications
You can also search for this author in PubMed Google Scholar
Diana Inkpen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Medicine and School of Electrical Engineering and Computer Science, Department of Epidemiology & Community Medicine, University of Ottawa, 451 Smyth Road, Room 3105, K1H 8M5, Ottawa, ON, Canada
Marina Sokolova
Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, N2L 3G1, Waterloo, ON, Canada
Peter van Beek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Razavi, A.H., Inkpen, D. (2014). Text Representation Using Multi-level Latent Dirichlet Allocation. In: Sokolova, M., van Beek, P. (eds) Advances in Artificial Intelligence. Canadian AI 2014. Lecture Notes in Computer Science(), vol 8436. Springer, Cham. https://doi.org/10.1007/978-3-319-06483-3_19

Download citation

DOI: https://doi.org/10.1007/978-3-319-06483-3_19
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-06482-6
Online ISBN: 978-3-319-06483-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics