Comparative Study of Feature Selection Methods for Medical Full Text Classification

Adriano Gonçalves, Carlos; Lorenzo Iglesias, Eva; Borrajo, Lourdes; Camacho, Rui; Seara Vieira, Adrián; Talma Gonçalves, Célia

doi:10.1007/978-3-030-17935-9_49

Carlos Adriano Gonçalves^18,20,
Eva Lorenzo Iglesias¹⁸,
Lourdes Borrajo¹⁸,
Rui Camacho^19,20,
Adrián Seara Vieira¹⁸ &
…
Célia Talma Gonçalves^21,22

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 11466))

Included in the following conference series:

International Work-Conference on Bioinformatics and Biomedical Engineering

1206 Accesses
1 Citations
2 Altmetric

Abstract

There is a lot of work in text categorization using only the title and abstract of the papers. However, in a full paper there is a much larger amount of information that could be used to improve the text classification performance. The potential benefits of using full texts come with an additional problem: the increased size of the data sets.

To overcome the increased the size of full text data sets we performed an assessment study on the use of feature selection methods for full text classification. We have compared two existing feature selection methods (Information Gain and Correlation) and a novel method called k-Best-Discriminative-Terms. The assessment was conducted using the Ohsumed corpora. We have made two sets of experiments: using title and abstract only; and full text.

The results achieved by the novel method show that the novel method does not perform well in small amounts of text like title and abstract but performs much better for the full text data sets and requires a much smaller number of attributes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
We have used single words in our study but the k-BDT can also be used with other groupings of words like n-grams (n > 1), NERs, etc.
2.
It has been used in binary text classification but can also be adapted to non binary classification problems.

References

Gonçalves, C.A., Iglesias, E.L., Borrajo, L., Camacho, R., Vieira, A. S., Gonçalves, C.T.: LearnSec: a framework for full text analysis. In: de Cos Juez, F. et al. (eds) Hybrid Artificial Intelligent Systems HAIS 2018, vol. 10870, pp. 502–513. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92639-1_42
Google Scholar
Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23, 2507–2517 (2007)
Article Google Scholar
Markov, A.A., Nitussov, A.Y., Voropai, L., Link, D., Custance, G., Mahoney, M.S.: Classical Text in Translation: An Example of Statistical Investigation of the Text Eugene Onegin Concerning the Connection of Samples in Chains (2006)
Google Scholar
Borasem, P.N., Kinariwala, S.A.: Image re-ranking using information gain and relative consistency through multigraph learning (2016)
Google Scholar
Vieira, A.S., Iglesias, E.L., Borrajo, L.: An HMM-based text classier less sensitive to document management problems. Bioinformatics 11, 503–515 (2016)
Google Scholar
Mladenic, D., Grobelnik, M.: Feature selection for unbalanced class distribution and Naive Bayes. In: 16th International Conference on Machine Learning (ICML), pp. 258–267. Morgan Kaufmann Publishers, San Francisco (1999)
Google Scholar
Yang, Y., Pedersen, J. O.: A comparative study on feature selection in text categorization. In: Fourteenth International Conference on Machine Learning, pp. 412–420. Morgan Kaufmann Publishers Inc., San Francisco (1997)
Google Scholar
Parlak, B., Uysal, A. K.: The impact of feature selection on medical document classification. In: 11th Iberian Conference on Information Systems and Technologies (CISTI), pp. 1–5 (2016)
Google Scholar
Imambi, S.S., Sudha, T.: Article: a novel feature selection method for classification of medical documents from pubmed. Int. J. Comput. Appl. 26(9), 29–33 (2011)
Google Scholar
Monta, E., Ranilla, J., Fernandez, J., Combarro, E.F., Diaz, I.: Scoring and selecting terms for text categorization. IEEE Intell. Syst. 20, 40–47 (2005)
Google Scholar
Forman, G.: Feature selection for text classification. In: Liu, H., Motoda, H. (eds.) Computational Methods of Feature Selection, Data Mining and Knowledge Discoveries Series, pp. 257–276. Chapman and Hall/CRC, Boca Raton (2007)
Chapter Google Scholar
Hall, M.A., Smith, L.A.: Feature selection for machine learning: comparing a correlation-based filter approach to the wrapper. In: Proceedings of the Twelfth International Florida Artificial Intelligence Research Society Conference, pp. 235–239. AAAI Press (1999)
Google Scholar
Hersh, W.R., Buckley, C., Leone, T.J., Hickam, D.H.: Ohsumed: an interactive retrieval evaluation and new large test collection for research. In: 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press (1994)
Google Scholar
Zdravevski, E., Lameski, P., Kulakov, A., Filiposka, S., Trajanov, D., Boro, J.: Parallel computation of information gain using Hadoop and MapReduce. In: Federated Conference on Computer Science and Information Systems (2015)
Google Scholar
Shang, C., Li, M., Feng, S., Jiang, Q, Fan, J.: Feature selection via maximizing global information gain for text classification. J. Know.-Based Syst. 54, 298–309 (2013)
Article Google Scholar
Wang, F., Li, C., Wang, J., Xu, J., Li, L.: A two-stage feature selection method for text categorization by using category correlation degree and latent semantic indexing. J. Shanghai Jiaotong Univ. (Sci.) 20(1), 44–50 (2015)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Xu, Y., Wang, B., Li, J.T., Jing, H.: An extended document frequency metric for feature selection in text categorization. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.) AIRS 2008. LNCS, vol. 4993, pp. 71–82. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68636-1_8
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article Google Scholar
Talma Gonçalves, C., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2(3), 21–36 (2011)
Article Google Scholar

Download references

Acknowledgements

This work was supported by the Consellería de Educación, Universidades e Formación Profesional (Xunta de Galicia) under the scope of the strategic funding of ED431C2018/55-GRC Competitive Reference Group. This work was also partially funded by the ERDF through the COMPETE 2020 Programme within project POCI-01-0145-FEDER-006961, and by National Funds through the FCT as part of project UID/EEA/50014/2013.

Author information

Authors and Affiliations

Computer Science Department, University of Vigo, Escola Superior de Enxeñería Informática, Ourense, Spain
Carlos Adriano Gonçalves, Eva Lorenzo Iglesias, Lourdes Borrajo & Adrián Seara Vieira
Faculdade de Engenharia da Universidade do Porto, Rua Dr. Roberto Frias s/n, 4200-465, Porto, Portugal
Rui Camacho
LIAAD - INESC TEC, Campus da FEUP, Rua Dr. Roberto Frias s/n, 4200-465, Porto, Portugal
Carlos Adriano Gonçalves & Rui Camacho
CEOS.PP/ISCAP-P.PORTO, Rua Jaime Lopes Amorim s/n, 4465-004, Porto, Portugal
Célia Talma Gonçalves
LIACC, Rua Dr. Roberto Frias s/n, 4200-465, Porto, Portugal
Célia Talma Gonçalves

Authors

Carlos Adriano Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar
Eva Lorenzo Iglesias
View author publications
You can also search for this author in PubMed Google Scholar
Lourdes Borrajo
View author publications
You can also search for this author in PubMed Google Scholar
Rui Camacho
View author publications
You can also search for this author in PubMed Google Scholar
Adrián Seara Vieira
View author publications
You can also search for this author in PubMed Google Scholar
Célia Talma Gonçalves
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Camacho .

Editor information

Editors and Affiliations

Department of Computer Architecture and Computer Technology Higher Technical School of Information Technology and Telecommunications Engineering, CITIC-UGR, Granada, Spain
Ignacio Rojas
ETSIIT, University of Granada, Granada, Spain
Olga Valenzuela
CITIC-UGR, University of Granada, Granada, Spain
Fernando Rojas
Fundacion Progreso y Salud, Granada, Spain
Francisco Ortuño

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adriano Gonçalves, C., Lorenzo Iglesias, E., Borrajo, L., Camacho, R., Seara Vieira, A., Talma Gonçalves, C. (2019). Comparative Study of Feature Selection Methods for Medical Full Text Classification. In: Rojas, I., Valenzuela, O., Rojas, F., Ortuño, F. (eds) Bioinformatics and Biomedical Engineering. IWBBIO 2019. Lecture Notes in Computer Science(), vol 11466. Springer, Cham. https://doi.org/10.1007/978-3-030-17935-9_49

Download citation

DOI: https://doi.org/10.1007/978-3-030-17935-9_49
Published: 13 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-17934-2
Online ISBN: 978-3-030-17935-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics