A New Dimensionality Reduction Technique Based on HMM for Boosting Document Classification

Vieira, A. Seara; Iglesias, E. L.; Borrajo, L.

doi:10.1007/978-3-319-19776-0_8

A. Seara Vieira⁶,
E. L. Iglesias⁶ &
L. Borrajo⁶

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 375))

668 Accesses

Abstract

Many classification problems, such as text classification, require the ability to handle the high dimension of a structured representation of the documents. The enormous size of the data would result in burdensome computations. Consequently, there is a strong need for reducing the quantity of handled information to develop the classification process. In this paper, we propose a dimensionality reduction technique on text datasets based on a clustering method to group documents with a simple Hidden Markov Model to represent them. We have applied the new method on the OHSUMED benchmark text corpora using the \(k\)-NN and SVM classifiers. The results obtained are very satisfactory and demonstrate the suitability of the proposed technique for the problem of dimensionality reduction and document classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sebastiani, F.: Text categorization. In: Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pp. 109–129. WIT Press (2005)
Google Scholar
Tsimboukakis, N., Tambouratzis, G.: Document classification system based on hmm word map. In Proceedings of the 5th International Conference on Soft Computing as Transdisciplinary Science and Technology, CSTST ’08, ACM, pp. 7–12, New York, NY, USA (2008)
Google Scholar
Janecek, A.G., Gansterer, W.N., Demel, M.A., Ecker, G.F.: On the relationship between feature selection and classification accuracy. JMLR Workshop Conf. Proc. 4, 90–105 (2008)
Google Scholar
Pekalska, E., Duin, R.P.W.: Dissimilarity representations allow for building good classifiers. Pattern Recogn. Lett. 23, 943–956 (2002)
Article MATH Google Scholar
Bicego, M., Murino, V., Figueiredo, M.A.T.: Similarity-based classification of sequences using hidden markov models. Pattern Recogn. 37(12), 2281–2291 (2004)
Article MATH Google Scholar
Seara Vieira, A., Iglesias, E.L., Borrajo, L.: T-HMM: a novel biomedical text classifier based on hidden markov models. In: 8th International Conference on Practical Applications of Computational Biology and Bioinformatics (PACBB 2014), volume 294 of Advances in Intelligent Systems and Computing, pp. 225–234. Springer International Publishing (2014)
Google Scholar
Pelleg, D., Moore, A.W.: X-means: extending k-means with efficient estimation of the number of clusters. In Proceedings of the Seventeenth International Conference on Machine Learning, ICML ’00, Morgan Kaufmann Publishers Inc, pp. 727–734, San Francisco, CA, USA (2000)
Google Scholar
Rabiner, L.R.: Readings in speech recognition. Chapter A tutorial on hidden Markov models and selected applications in speech recognition, pp. 267–296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1990)
Google Scholar
Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In SIGIR, pp. 192–201 (1994)
Google Scholar
Baeza-Yates, R.A., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman (1999)
Google Scholar
Caporaso, J.G., Baumgartner, W.A., Cohen, K.B., Johnson, H.L., Paquette, J., Hunter, L.: Concept recognition and the trec genomics tasks. In: Voorhees, E.M., Buckland, L.P. (eds.), TREC, volume Special Publication 500–266. National Institute of Standards and Technology (NIST) (2005)
Google Scholar
Chang, C., Lin, C.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3):27:1–27:27 (2011)
Google Scholar

Download references

Acknowledgments

This work has been funded from the European Union Seventh Framework Programme [FP7/REGPOT-2012-2013.1] under grant agreement n 316265, BIOCAPS, and the “Platform of integration of intelligent techniques for analysis of biomedical information” project (TIN2013-47153-C3-3-R) from Spanish Ministry of Economy and Competitiveness.

Author information

Authors and Affiliations

Computer Science Dept., University of Vigo, Escola Superior de Enxeñería Informática, Ourense, Spain
A. Seara Vieira, E. L. Iglesias & L. Borrajo

Authors

A. Seara Vieira
View author publications
You can also search for this author in PubMed Google Scholar
E. L. Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
L. Borrajo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Seara Vieira .

Editor information

Editors and Affiliations

Fellowship for the Interpretation of Genomes, Burr Ridge, Illinois, USA
Ross Overbeek
Centre of Biological Engineering, Department of Informatics, University of Minho, Braga, Portugal
Miguel P. Rocha
Department of Informatics, University of Vigo, ESEI: Escuela Superior de Ingeniería Informática, Ourense, Spain
Florentino Fdez-Riverola
Departamento de Informática y Automática, University of Salamanca, Facultad de Ciencias, Salamanca, Spain
Juan F. De Paz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vieira, A.S., Iglesias, E.L., Borrajo, L. (2015). A New Dimensionality Reduction Technique Based on HMM for Boosting Document Classification. In: Overbeek, R., Rocha, M., Fdez-Riverola, F., De Paz, J. (eds) 9th International Conference on Practical Applications of Computational Biology and Bioinformatics. Advances in Intelligent Systems and Computing, vol 375. Springer, Cham. https://doi.org/10.1007/978-3-319-19776-0_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-19776-0_8
Published: 19 May 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19775-3
Online ISBN: 978-3-319-19776-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics

A New Dimensionality Reduction Technique Based on HMM for Boosting Document Classification