Abstract
Vocabulary used by the doctors to describe the results of medical procedures changes alongside with the new standards. Text data, which is immediately understandable by the medical professional, is difficult to use in mass scale analysis. Extraction of data relevant to the given case, e.g. Bethesda class, means taking on the challenge of normalizing the freeform text and all the grammatical forms associated with it. This is particularly difficult in the Polish language where words change their form significantly according to their function in the sentence. We found common black-box methods for text mining inaccurate for this purpose. Here we described a word-frequency-based method for annotation of text data for Bethesda class extraction. We compared them with an algorithm based on a decision tree C4.5. We showed how important is the choice of the method and range of features to avoid conflicting classification. Proposed algorithms allowed to avoid the rule-base limitations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Al Dawish, M.A., et al.: Bethesda system for reporting thyroid cytopathology: a three-year study at a tertiary care referral center in Saudi Arabia. World J. Clin. Oncol. 8(2), 151–157 (2017)
Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)
Cibas, E.S., Ali, S.Z.: The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid 27(11), 1341–1346 (2017)
Gharib, H.: Fine-needle aspiration biopsy of thyroid nodules: advantages, limitations, and effect. Mayo Clin. Proc. 69(1), 44–49 (1994)
Guo, Z., Gao, X., Di, R.: Learning Bayesian network parameters with domain knowledge and insufficient data, vol. 73, pp. 93–104 (2017)
Iavindrasana, J., Cohen, G., Depeursinge, A., Müler, H., Meyer, R., Geissbuhler, A.: Clinical data mining: a review. Yearb. Med. Inform. 18(1), 121–133 (2009)
Jarząb, B., et al.: Guidelines of Polish national societies diagnostics and treatment of thyroid carcinoma. 2018 update. Endokrynologia Polska 69(1), 34–74 (2018)
Kocbek, S., et al.: Text mining electronic hospital records to automatically classify admissions against disease: measuring the impact of linking data sources. J. Biomed. Inform. 64, 158–167 (2016)
Kwon, O.S., Kim, J., Choi, K.H., Ryu, Y., Park, J.E.: Trends in deqi research: a text mining and network analysis. Integr. Med. Res. 7(3), 231–237 (2018)
Lamy, J.B., Ellini, A., Ebrahiminia, V., Zucker, J.D., Falcoff, H., Venot, A.: Use of the C4.5 machine learning algorithm to test a clinical guideline-based decision support system. Stud. Health Technol. Inform. 136, 223–228 (2008)
Miłkowski, M.: Morfologik: LanguageTool 2.5. http://morfologik.blogspot.com/2014/03/languagetool-25.html
Nguyen, A.N., et al.: Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J. Am. Med. Inform. Assoc. 17(4), 440–445 (2010)
Psiuk-Maksymowicz, K., et al.: A holistic approach to testing biomedical hypotheses and analysis of biomedical data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 449–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_34
Qaiser, S., Ali, R.: Text mining: use of TF-IDF to examine the relevance of words to documents. Int. J. Comput. Appl. 181(1), 25–29 (2018)
Razia, S., Rao, M.R.N.: Machine learning techniques for thyroid disease diagnosis - a review. Indian J. Sci. Technol. 9(28), 1–9 (2016)
Seethala, R.R., et al.: Noninvasive follicular thyroid neoplasm with papillary-like nuclear features: a review for pathologists, 31(1), 39–55. https://doi.org/10.1038/modpathol.2017.130
Silge, J., Robinson, D.: tidytext: text mining and analysis using tidy data principles in R. https://doi.org/10.21105/joss.00037
Song, J.S.A., Hart, R.D.: Fine-needle aspiration biopsy of thyroid nodules. Can. Fam. Phys. 64(2), 127–128 (2018)
Stanek-Widera, A., Biskup-Frużyńska, M., Zembala-Nożyńska, E., Śnietura, M., Lange, D.: The diagnosis of cancer in thyroid fine needle aspiration biopsy. Surgery, repeat biopsy or specimen consultation? Pol. J. Pathol. 67(1), 19–23 (2016)
Szwed, P.: Enhancing concept extraction from Polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_27
Wiharto, W., Kusnanto, H., Herianto, H.: Interpretation of clinical data based on C4.5 algorithm for the diagnosis of coronary heart disease. Healthc. Inform. Res. 22(3), 186–195 (2016)
Acknowledgments
This work was supported by The National Center for Research and Development project MILESTONE under the program STRATEGMED (contract No. STRATEGMED2/267398/4/NCBR/2015). Full protocol of study was approved by ethics committee. This work was partially supported by the Polish Ministry of Science and Higher Education as part of the Implementation Doctorate program at the Silesian University of Technology, Gliwice, Poland (contract No. 10/DW/2017/01/1).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Płaczek, A., Płuciennik, A., Pach, M., Jarząb, M., Mrozek, D. (2019). The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis. BDAS 2019. Communications in Computer and Information Science, vol 1018. Springer, Cham. https://doi.org/10.1007/978-3-030-19093-4_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-19093-4_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19092-7
Online ISBN: 978-3-030-19093-4
eBook Packages: Computer ScienceComputer Science (R0)