Abstract
This paper addresses the problem of classifying documents using the kernel approaches based on topic sequences. Previously, the string kernel uses the ordered subsequence of characters as features and the word sequence kernel is proposed to use words as the subsequences. However, they both face the problem of computational complexity because of the large amount of symbols (characters or words). This paper, therefore, proposes to use sequences of topics rather than characters or words to reduce the number of symbols, thus increasing the computational efficiency. Documents that exhibit similar posterior topic proportions are expected to have similar topic sequence and then should be classified into the same category. Experiments conducted on the Reuters-21578 datasets have proven this hypothesis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Joachims, T.: Text Categorization with Support Vector Machines. Technical report, LS VIII NO. 23. University of Dortmund (1997)
Osuna, E., Freund, R., Girosi, F.: Training support vector machines: an application to face detection. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 130–136 (1997)
Wang, J.Y.: Application of Support Vector Machines in Bioinformatics. Master’s thesis, Dept. Computer Sci. Info. Eng., National Taiwan University (2002)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.: Text classification using string kernels. The Journal of Machine Learning Research 2, 419–444 (2002)
Cancedda, N., Gaussier, E., Goutte, C., Renders, J.M.: Word sequence kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004)
Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society London (A) 209, 415–446 (1909)
Deerwester, S., Dumais, S., Landauer, T., Furnas, G., Harshman, R.: Indexing by latent semantic analysis. Journal of the American Society of Information Science 41(6), 391–407 (1990)
Hofmann, T.: Probabilistic latent semantic indexing. In: Research and Development in Information Retrieval, pp. 50–57 (1999)
Blei, D., Ng, A., Jordan, M.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Blei, D., Lafferty, J.: Dynamic topic models. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 113–120 (2006)
Blei, D., Lafferty, J.: A correlated topic model of science. Annals of Applied Statistics 1(1), 17–35 (2007)
Minka, T., Lafferty, J.: Expectation-propagation for the generative aspect model. In: Uncertainty in Artificial Intelligence, UAI (2002)
Steyvers, M., Griffiths, T.: Probabilistic topic models. In: Landauer, T., McNamara, D., Dennis, S., Kintsch, W. (eds.) Latent Semantic Analysis: A Road to Meaning. Lawrence Erlbaum (2006)
Teh, Y., Newman, D., Welling, M.: A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In: Neural Information Processing Systems (2006)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Xu, J., Lu, Q., Liu, Z., Chai, J. (2012). Topic Sequence Kernel. In: Hou, Y., Nie, JY., Sun, L., Wang, B., Zhang, P. (eds) Information Retrieval Technology. AIRS 2012. Lecture Notes in Computer Science, vol 7675. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35341-3_41
Download citation
DOI: https://doi.org/10.1007/978-3-642-35341-3_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35340-6
Online ISBN: 978-3-642-35341-3
eBook Packages: Computer ScienceComputer Science (R0)