MEDLINE Abstracts Classification Based on Noun Phrases Extraction

Ruiz-Rico, Fernando; Vicedo, José-Luis; Rubio-Sánchez, María-Consuelo

doi:10.1007/978-3-540-92219-3_38

MEDLINE Abstracts Classification Based on Noun Phrases Extraction

Fernando Ruiz-Rico⁴,
José-Luis Vicedo⁴ &
María-Consuelo Rubio-Sánchez⁴

Conference paper

1891 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 25))

Abstract

Many algorithms have come up in the last years to tackle automated text categorization. They have been exhaustively studied, leading to several variants and combinations not only in the particular procedures but also in the treatment of the input data. A widely used approach is representing documents as Bag-Of-Words (BOW) and weighting tokens with the TFIDF schema. Many researchers have thrown into precision and recall improvements and classification time reduction enriching BOW with stemming, n-grams, feature selection, noun phrases, metadata, weight normalization, etc. We contribute to this field with a novel combination of these techniques. For evaluation purposes, we provide comparisons to previous works with SVM against the simple BOW. The well known OHSUMED corpus is exploited and different sets of categories are selected, as previously done in the literature. The conclusion is that the proposed method can be successfully applied to existing binary classifiers such as SVM outperforming the mixture of BOW and TFIDF approaches.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Sebastiani, F.: A tutorial on automated text categorisation. In: Amandi, A., Zunino, R. (eds.) Proceedings of ASAI 1999, 1st Argentinian Symposium on Artificial Intelligence, Buenos Aires, AR, pp. 7–35 (1999)
Google Scholar
Aas, K., Eikvil, L.: Text categorisation: A survey. Technical report, Norwegian Computer Center (June 1999)
Google Scholar
Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Hearst, M.A., Gey, F., Tong, R. (eds.) Proceedings of SIGIR 1999, 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, US, pp. 42–49. ACM Press, New York (1999)
Google Scholar
Scott, S., Matwin, S.: Feature engineering for text classification. In: Bratko, I., Dzeroski, S. (eds.) Proceedings of ICML 1999, 16th International Conference on Machine Learning, Bled, SL, pp. 379–388. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Information Processing and Management 38(4), 529–546 (2002)
Article Google Scholar
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: DocEng 2006: Proceedings of the 2006 ACM symposium on Document engineering, pp. 138–146. ACM Press, New York (2006)
Google Scholar
Antonie, M., Zaane, O.: Text document categorization by term association. In: IEEE International Conference on Data Mining (ICDM), pp. 19–26 (2002)
Google Scholar
Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of naive bayes text classifier. In: Fifth International Conference on Computational Intelligence and Multimedia Applications, ICCIMA 2003, pp. 336–341 (2003)
Google Scholar
Basili, R., Moschitti, A., Pazienza, M.T.: Language-sensitive text classification. In: Proceeding of RIAO 2000, 6th International Conference Recherche d’Information Assistee par Ordinateur, Paris, FR, pp. 331–343 (2000)
Google Scholar
Granitzer, M.: Hierarchical text classification using methods from machine learning. Master’s thesis, Graz University of Technology (2003)
Google Scholar
Moschitti, A., Basili, R.: Complex linguistic features for text classification: A comprehensive study. In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004)
Chapter Google Scholar
Buckley, C.: The importance of proper weighting methods. In: Bates, M. (ed.) Human Language Technology. Morgan Kaufman, San Francisco (1993)
Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. Department of Computer Science, Cornell University, Ithaca, NY 14853 (1996)
Google Scholar
Ruiz-Rico, F., Vicedo, J.L., Rubio-Sánchez, M.C.: Newpar: an automatic feature selection and weighting schema for category ranking. In: Proceedings of DocEng 2006, 6th ACM symposium on Document engineering, pp. 128–137 (2006)
Google Scholar
Màrquez, L., Giménez, J.: A general pos tagger generator based on support vector machines. Journal of Machine Learning Research (2004), www.lsi.upc.edu/~nlp/SVMTool
Kongovi, M., Guzman, J.C., Dasigi, V.: Text categorization: An experiment using phrases. In: Crestani, F., Girolami, M., van Rijsbergen, C.J.K. (eds.) ECIR 2002. LNCS, vol. 2291, pp. 213–228. Springer, Heidelberg (2002)
Chapter Google Scholar
Joachims, T.: Making large-Scale SVM Learning Practical. Advances in Kernel Methods - Support Vector Learning (1999), http://svmlight.joachims.org/
Joachims, T.: Support Vector and Kernel Methods. In: SIGIR 2003 Tutorial (2003)
Google Scholar
Zu, G., Ohyama, W., Wakabayashi, T., Kimura, F.: Accuracy improvement of automatic text classification based on feature transformation. In: Proceedings of DOCENG 2003, ACM Symposium on Document engineering, Grenoble, FR, pp. 118–120. ACM Press, New York (2003)
Chapter Google Scholar
Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)
Chapter Google Scholar
Joachims, T.: Estimating the generalization performance of a svm efficiently. In: Langley, P. (ed.) Proceedings of ICML 2000, 17th International Conference on Machine Learning, Stanford, US, pp. 431–438. Morgan Kaufmann Publishers, San Francisco (2000)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Gardarin, G., French, J.C., Pissinou, N., Makki, K., Bouganim, L. (eds.) Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, US, pp. 148–155. ACM Press, New York (1998)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Alicante, Spain
Fernando Ruiz-Rico, José-Luis Vicedo & María-Consuelo Rubio-Sánchez

Authors

Fernando Ruiz-Rico
View author publications
You can also search for this author in PubMed Google Scholar
José-Luis Vicedo
View author publications
You can also search for this author in PubMed Google Scholar
María-Consuelo Rubio-Sánchez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IST - Instituto Superior Técnico, Instituto de Telecomunicaçöes, Av. Rovisco Pais, 1, 1049-001, Lisbon, Portugal
Ana Fred
Polytechnic Institute of Setúbal – INSTICC, Departament of Systems and Informatics, Rua do Vale de Chaves - Estefanilha, 2910-761, Setúbal, Portugal
Joaquim Filipe
Institute of Telecommunications, Av. Rovisco Pais, 1, 1049-001, Lisboa, Portugal
Hugo Gamboa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ruiz-Rico, F., Vicedo, JL., Rubio-Sánchez, MC. (2008). MEDLINE Abstracts Classification Based on Noun Phrases Extraction. In: Fred, A., Filipe, J., Gamboa, H. (eds) Biomedical Engineering Systems and Technologies. BIOSTEC 2008. Communications in Computer and Information Science, vol 25. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-92219-3_38

Download citation

DOI: https://doi.org/10.1007/978-3-540-92219-3_38
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-92218-6
Online ISBN: 978-3-540-92219-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics