Abstract
In this work, a new approach to vocabulary reduction is presented. It is based on filtering words in the topic feature space instead of directly in the original word space. The main goal is to analyze the differences between the application of the Cumulative Count-based word filter (f cc ) in word feature space (BoW: Bag of Words) with respect to its application in topic descriptions (obtained by LDA: Latent Dirichlet Allocation). Three well-known text datasets (Reuters, WebKB and NewsGroup) have been used to show the performance of the proposed approach.
This work was partially supported by FPU-AP-2009-4435 from the Spanish Ministry of Education, PROMETEO/2010/028 project from Generalitat Valenciana and P1-1B2010-27 project from the Plan de Promoció de la Investigació UJI.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Fei-Fei, L., Perona, P.: A Bayesian Hierarchical Model for Learning Natural Scene Categories. In: IEEE Computer Vision and Pattern Recognition, pp. 524–531 (2005)
Sivic, J.: Efficient visual search of videos cast as text retrieval. IEEE Trans. on Pattern Analysis and Machine Intelligence 31(4), 591–605 (2009)
Liu, H., Motoda, H.: Computational Methods of Feature Selection. Chapman Hall/CRC (2007)
Blei, D.: Probabilistic topic models. Communications of the ACM 55(4), 77–84 (2012)
Blei, D., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: International Conference on Information and Knowledge Management (CIKM), McLean, VA, pp. 211–218 (2002)
Monay, F., Gatica-Perez, D.: On image auto-annotation with latent space models. In: 11th ACM International Conference on Multimedia, pp. 275–278. ACM, New York (2003)
Bosch, A., Zisserman, A., Muñoz, X.: Scene classification via pLSA. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3954, pp. 517–530. Springer, Heidelberg (2006)
Farrahi, K., Gatica-Perez, D.: Discovering Routines from Large-Scale Human Locations using Probabilistic Topic Models. ACM Transactions on Intelligent Systems and Technology, Special Issue on Activity Recognition 2(1) (2011)
Montoliu, R.: Discovering mobility patterns on bicycle-based public transportation system by using probabilistic topic models. In: Novais, P., Hallenborg, K., Tapia, D.I., Rodríguez, J.M.C. (eds.) Ambient Intelligence - Software and Applications. AISC, vol. 153, pp. 145–153. Springer, Heidelberg (2012)
Quelhas, P., Monay, F., Odobez, J.-M., Gatica-Perez, D., Tuytelaars, T.: A Thousand Words in a Scene. IEEE Trans. on Pattern Analysis and Machine Intelligence 29(9), 1575–1589 (2007)
Cardoso-Cachopo, A., Oliveira, A.: Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization. In: First European Workshop on Latent Semantic Analysis in Technology Enhanced Learning (2007)
Jones, K.S., Willet, P.: Readings in Information Retrieval. Morgan Kaufmann (1997)
van Rijsbergen, C.J., Robertson, S.E., Porter, M.F.: New models in probabilistic information retrieval. British Library, London (1980) (British Library Research and Development Report, no. 5587)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based. Learning Methods, 1st edn. Cambridge University Press (2000)
Hsu, C.-W., Lin, C.-J.: A comparison of methods for multi-class support vector machines. IEEE Trans. on Neural Networks 13, 415–425 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fernández-Beltran, R., Montoliu, R., Pla, F. (2013). Vocabulary Reduction in BoW Representing by Topic Modeling. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds) Pattern Recognition and Image Analysis. IbPRIA 2013. Lecture Notes in Computer Science, vol 7887. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38628-2_77
Download citation
DOI: https://doi.org/10.1007/978-3-642-38628-2_77
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38627-5
Online ISBN: 978-3-642-38628-2
eBook Packages: Computer ScienceComputer Science (R0)