Skip to main content

The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study

  • Conference paper
  • First Online:
Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis (BDAS 2019)

Abstract

Vocabulary used by the doctors to describe the results of medical procedures changes alongside with the new standards. Text data, which is immediately understandable by the medical professional, is difficult to use in mass scale analysis. Extraction of data relevant to the given case, e.g. Bethesda class, means taking on the challenge of normalizing the freeform text and all the grammatical forms associated with it. This is particularly difficult in the Polish language where words change their form significantly according to their function in the sentence. We found common black-box methods for text mining inaccurate for this purpose. Here we described a word-frequency-based method for annotation of text data for Bethesda class extraction. We compared them with an algorithm based on a decision tree C4.5. We showed how important is the choice of the method and range of features to avoid conflicting classification. Proposed algorithms allowed to avoid the rule-base limitations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Al Dawish, M.A., et al.: Bethesda system for reporting thyroid cytopathology: a three-year study at a tertiary care referral center in Saudi Arabia. World J. Clin. Oncol. 8(2), 151–157 (2017)

    Article  Google Scholar 

  2. Allahyari, M., et al.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)

  3. Bishop, C.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York (2006)

    MATH  Google Scholar 

  4. Cibas, E.S., Ali, S.Z.: The 2017 Bethesda system for reporting thyroid cytopathology. Thyroid 27(11), 1341–1346 (2017)

    Article  Google Scholar 

  5. Gharib, H.: Fine-needle aspiration biopsy of thyroid nodules: advantages, limitations, and effect. Mayo Clin. Proc. 69(1), 44–49 (1994)

    Article  Google Scholar 

  6. Guo, Z., Gao, X., Di, R.: Learning Bayesian network parameters with domain knowledge and insufficient data, vol. 73, pp. 93–104 (2017)

    Google Scholar 

  7. Iavindrasana, J., Cohen, G., Depeursinge, A., Müler, H., Meyer, R., Geissbuhler, A.: Clinical data mining: a review. Yearb. Med. Inform. 18(1), 121–133 (2009)

    Article  Google Scholar 

  8. Jarząb, B., et al.: Guidelines of Polish national societies diagnostics and treatment of thyroid carcinoma. 2018 update. Endokrynologia Polska 69(1), 34–74 (2018)

    Article  Google Scholar 

  9. Kocbek, S., et al.: Text mining electronic hospital records to automatically classify admissions against disease: measuring the impact of linking data sources. J. Biomed. Inform. 64, 158–167 (2016)

    Article  Google Scholar 

  10. Kwon, O.S., Kim, J., Choi, K.H., Ryu, Y., Park, J.E.: Trends in deqi research: a text mining and network analysis. Integr. Med. Res. 7(3), 231–237 (2018)

    Article  Google Scholar 

  11. Lamy, J.B., Ellini, A., Ebrahiminia, V., Zucker, J.D., Falcoff, H., Venot, A.: Use of the C4.5 machine learning algorithm to test a clinical guideline-based decision support system. Stud. Health Technol. Inform. 136, 223–228 (2008)

    Google Scholar 

  12. Miłkowski, M.: Morfologik: LanguageTool 2.5. http://morfologik.blogspot.com/2014/03/languagetool-25.html

  13. Nguyen, A.N., et al.: Symbolic rule-based classification of lung cancer stages from free-text pathology reports. J. Am. Med. Inform. Assoc. 17(4), 440–445 (2010)

    Article  Google Scholar 

  14. Psiuk-Maksymowicz, K., et al.: A holistic approach to testing biomedical hypotheses and analysis of biomedical data. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 449–462. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_34

    Chapter  Google Scholar 

  15. Qaiser, S., Ali, R.: Text mining: use of TF-IDF to examine the relevance of words to documents. Int. J. Comput. Appl. 181(1), 25–29 (2018)

    Google Scholar 

  16. Razia, S., Rao, M.R.N.: Machine learning techniques for thyroid disease diagnosis - a review. Indian J. Sci. Technol. 9(28), 1–9 (2016)

    Article  Google Scholar 

  17. Seethala, R.R., et al.: Noninvasive follicular thyroid neoplasm with papillary-like nuclear features: a review for pathologists, 31(1), 39–55. https://doi.org/10.1038/modpathol.2017.130

  18. Silge, J., Robinson, D.: tidytext: text mining and analysis using tidy data principles in R. https://doi.org/10.21105/joss.00037

  19. Song, J.S.A., Hart, R.D.: Fine-needle aspiration biopsy of thyroid nodules. Can. Fam. Phys. 64(2), 127–128 (2018)

    Google Scholar 

  20. Stanek-Widera, A., Biskup-Frużyńska, M., Zembala-Nożyńska, E., Śnietura, M., Lange, D.: The diagnosis of cancer in thyroid fine needle aspiration biopsy. Surgery, repeat biopsy or specimen consultation? Pol. J. Pathol. 67(1), 19–23 (2016)

    Article  Google Scholar 

  21. Szwed, P.: Enhancing concept extraction from Polish texts with rule management. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds.) BDAS 2015–2016. CCIS, vol. 613, pp. 341–356. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34099-9_27

    Chapter  Google Scholar 

  22. Wiharto, W., Kusnanto, H., Herianto, H.: Interpretation of clinical data based on C4.5 algorithm for the diagnosis of coronary heart disease. Healthc. Inform. Res. 22(3), 186–195 (2016)

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported by The National Center for Research and Development project MILESTONE under the program STRATEGMED (contract No. STRATEGMED2/267398/4/NCBR/2015). Full protocol of study was approved by ethics committee. This work was partially supported by the Polish Ministry of Science and Higher Education as part of the Implementation Doctorate program at the Silesian University of Technology, Gliwice, Poland (contract No. 10/DW/2017/01/1).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aleksander Płaczek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Płaczek, A., Płuciennik, A., Pach, M., Jarząb, M., Mrozek, D. (2019). The Role of Feature Selection in Text Mining in the Process of Discovering Missing Clinical Annotations – Case Study. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Paving the Road to Smart Data Processing and Analysis. BDAS 2019. Communications in Computer and Information Science, vol 1018. Springer, Cham. https://doi.org/10.1007/978-3-030-19093-4_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-19093-4_19

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-19092-7

  • Online ISBN: 978-3-030-19093-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics