The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study

Płaczek, Aleksander; Płuciennik, Alicja; Pach, Mirosław; Jarząb, Michał; Mrozek, Dariusz

doi:10.1007/978-3-030-19093-4_19

Aleksander Płaczek^14,15,
Alicja Płuciennik^14,16,
Mirosław Pach^14,15,
Michał Jarząb¹⁷ &
…
Dariusz Mrozek¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1018))

Included in the following conference series:

International Conference: Beyond Databases, Architectures and Structures

754 Accesses
1 Citations

Abstract

Vocabulary used by the doctors to describe the results of medical procedures changes alongside with the new standards. Text data, which is immediately understandable by the medical professional, is difficult to use in mass scale analysis. Extraction of data relevant to the given case, e.g. Bethesda class, means taking on the challenge of normalizing the freeform text and all the grammatical forms associated with it. This is particularly difficult in the Polish language where words change their form significantly according to their function in the sentence. We found common black-box methods for text mining inaccurate for this purpose. Here we described a word-frequency-based method for annotation of text data for Bethesda class extraction. We compared them with an algorithm based on a decision tree C4.5. We showed how important is the choice of the method and range of features to avoid conflicting classification. Proposed algorithms allowed to avoid the rule-base limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Al Dawish, M.A., et al.: Bethesda system for reporting thyroid cytopathology: a three-year study at a tertiary care referral center in Saudi Arabia. World J. Clin. Oncol. 8(2), 151–157 (2017)
Article Google Scholar
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
MATH Google Scholar
Cibas, E.S., Ali, S.Z.: The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid 27(11), 1341–1346 (2017)
Article Google Scholar
Gharib, H.: Fine-needle aspiration biopsy of thyroid nodules: advantages, limitations, and effect. Mayo Clin. Proc. 69(1), 44–49 (1994)
Article Google Scholar
Guo, Z., Gao, X., Di, R.: Learning Bayesian network parameters with domain knowledge and insufficient data, vol. 73, pp. 93–104 (2017)
Google Scholar
Iavindrasana, J., Cohen, G., Depeursinge, A., Müler, H., Meyer, R., Geissbuhler, A.: Clinical data mining: a review. Yearb. Med. Inform. 18(1), 121–133 (2009)
Article Google Scholar
Jarząb, B., et al.: Guidelines of Polish national societies diagnostics and treatment of thyroid carcinoma. 2018 update. Endokrynologia Polska 69(1), 34–74 (2018)
Article Google Scholar
Kocbek, S., et al.: Text mining electronic hospital records to automatically classify admissions against disease: measuring the impact of linking data sources. J. Biomed. Inform. 64, 158–167 (2016)
Article Google Scholar
Kwon, O.S., Kim, J., Choi, K.H., Ryu, Y., Park, J.E.: Trends in deqi research: a text mining and network analysis. Integr. Med. Res. 7(3), 231–237 (2018)
Article Google Scholar
Lamy, J.B., Ellini, A., Ebrahiminia, V., Zucker, J.D., Falcoff, H., Venot, A.: Use of the C4.5 machine learning algorithm to test a clinical guideline-based decision support system. Stud. Health Technol. Inform. 136, 223–228 (2008)
Google Scholar
Miłkowski, M.: Morfologik: LanguageTool 2.5. http://morfologik.blogspot.com/2014/03/languagetool-25.html
Nguyen, A.N., et al.: Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J. Am. Med. Inform. Assoc. 17(4), 440–445 (2010)
Article Google Scholar
Psiuk-Maksymowicz, K., et al.: A holistic approach to testing biomedical hypotheses and analysis of biomedical data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 449–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_34
Chapter Google Scholar
Qaiser, S., Ali, R.: Text mining: use of TF-IDF to examine the relevance of words to documents. Int. J. Comput. Appl. 181(1), 25–29 (2018)
Google Scholar
Razia, S., Rao, M.R.N.: Machine learning techniques for thyroid disease diagnosis - a review. Indian J. Sci. Technol. 9(28), 1–9 (2016)
Article Google Scholar
Seethala, R.R., et al.: Noninvasive follicular thyroid neoplasm with papillary-like nuclear features: a review for pathologists, 31(1), 39–55. https://doi.org/10.1038/modpathol.2017.130
Silge, J., Robinson, D.: tidytext: text mining and analysis using tidy data principles in R. https://doi.org/10.21105/joss.00037
Song, J.S.A., Hart, R.D.: Fine-needle aspiration biopsy of thyroid nodules. Can. Fam. Phys. 64(2), 127–128 (2018)
Google Scholar
Stanek-Widera, A., Biskup-Frużyńska, M., Zembala-Nożyńska, E., Śnietura, M., Lange, D.: The diagnosis of cancer in thyroid fine needle aspiration biopsy. Surgery, repeat biopsy or specimen consultation? Pol. J. Pathol. 67(1), 19–23 (2016)
Article Google Scholar
Szwed, P.: Enhancing concept extraction from Polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_27
Chapter Google Scholar
Wiharto, W., Kusnanto, H., Herianto, H.: Interpretation of clinical data based on C4.5 algorithm for the diagnosis of coronary heart disease. Healthc. Inform. Res. 22(3), 186–195 (2016)
Article Google Scholar

Download references

Acknowledgments

This work was supported by The National Center for Research and Development project MILESTONE under the program STRATEGMED (contract No. STRATEGMED2/267398/4/NCBR/2015). Full protocol of study was approved by ethics committee. This work was partially supported by the Polish Ministry of Science and Higher Education as part of the Implementation Doctorate program at the Silesian University of Technology, Gliwice, Poland (contract No. 10/DW/2017/01/1).

Author information

Authors and Affiliations

Research and Development Department, WASKO S.A., ul. Berbeckiego 6, 44-100, Gliwice, Poland
Aleksander Płaczek, Alicja Płuciennik & Mirosław Pach
Institute of Informatics, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Aleksander Płaczek, Mirosław Pach & Dariusz Mrozek
Institute of Automatic Control, Silesian University of Technology, ul. Akademicka 16, 44-100, Gliwice, Poland
Alicja Płuciennik
Maria Skłodowska-Curie Memorial Cancer Center and Institute of Oncology, Gliwice Branch, Gliwice, Poland
Michał Jarząb

Authors

Aleksander Płaczek
View author publications
You can also search for this author in PubMed Google Scholar
Alicja Płuciennik
View author publications
You can also search for this author in PubMed Google Scholar
Mirosław Pach
View author publications
You can also search for this author in PubMed Google Scholar
Michał Jarząb
View author publications
You can also search for this author in PubMed Google Scholar
Dariusz Mrozek
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aleksander Płaczek .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Płaczek, A., Płuciennik, A., Pach, M., Jarząb, M., Mrozek, D. (2019). The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis. BDAS 2019. Communications in Computer and Information Science, vol 1018. Springer, Cham. https://doi.org/10.1007/978-3-030-19093-4_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-19093-4_19
Published: 27 April 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19092-7
Online ISBN: 978-3-030-19093-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics