Skip to main content

Similarity Search for the Content of Medical Records Using Unstructured Data

  • Conference paper
  • First Online:
Information Technology in Biomedicine (ITIB 2018)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 762))

Included in the following conference series:

  • 453 Accesses

Abstract

Clustering large amounts of unstructured data is an important challenge in contemporary medicine and biology. This article presents an automatic clustering method for unstructured medical data. The presented method consists of the following main steps: transformation of the document corpus to a frequency matrix of terms; dimensionality reduction of the frequency matrix of terms using principal component analysis (PCA); the direct comparison of pairs of documents similarity measures using the cosine and correlation distances; and finding the optimal number of groups for expertly labelled data sets by treating the clustering problem as an optimization problem in which the objective function is an F measure to be optimized via the selection of parameter values such as PCA resolution and the similarity threshold of the pairs of documents. The usefulness of the proposed methodology was demonstrated by performing calculations on three data sets: short sentences divided into three themes, radiological reports of aneurysms, and radiological reports of abdomen studies. A common barrier in clustering unstructured data is difficulty in results interpretation. To overcome this limitation, the utility of presentation methods, including group histograms, similarity matrices, plots of document assignment to founding clusters, F-measure interpolation and alphabetical- and term-frequency dictionaries, are presented. Excluding the labelling step, the presented method is completely automated and can be used as a preliminary data analysis method for large bodies of text to discover potential groups of interesting topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 189.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 249.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zhu, F., Patumcharoenpol, P., Zhang, C., Yang, Y.: Biomedical text mining and its applications in cancer research. J. Biomed. Inform. 46, 200–211 (2013)

    Article  Google Scholar 

  2. Kawa, J., Juszczyk, J., Pyciński, B., Badura, P., Piętka, E.: Radiological atlas for patient specific model generation. Adv. Intell. Syst. Comput. 84, 69–84 (2014)

    Google Scholar 

  3. Rebholz-Schuhmann, D., Jepes, A., Li, C., Kafkas, S., Lewin, I., et al.: Assessment of NER solutions against the first and second CALBC Silver Standard Corpus. J. Biomed. Seman. 2(Suppl. 5), S11 (2011)

    Article  Google Scholar 

  4. Krallinger, M., Vasquez, M., Leitner, F., Salgado, D., Chatr-Aryamontri, A., Winter, A., et al.: The protein-protein interaction tasks of BioCreative III: classification ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 12(Suppl. 8), S3 (2011)

    Article  Google Scholar 

  5. Amine, A., Elberrichi, Z., Simonet, M.: Evaluation of text clustering methods using WordNet. Int. Arab J. Inf. Technol. 7(4), 349–357 (2010)

    Google Scholar 

  6. Safeer, Y., Mustafa, A., Noor, A.A.: Clustering unstructured data. Int. J. Comput. Sci. Inf. Secur. 8(2), 174–180 (2010)

    Google Scholar 

  7. Spinczyk, D., Dziecitko, M.: Similarity search for the content of medial records. In: Information Technologies in Medicine. Advances in Intelligent Systems and Computing, vol. 471, pp. 489–501 (2016)

    Google Scholar 

  8. Albright, R.: Taming Text with the SVD. SAS Institute White Paper (2004)

    Google Scholar 

  9. Meyer, C.: Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia (2000)

    Book  Google Scholar 

  10. Vandenberghe, L.: Applied Numerical Computing (lecture) (2011)

    Google Scholar 

  11. Keim, D., Kohlhammer, J., Ellis, G., Mansmann, F.: Mastering the Information Age Solving Problems with Visual Analytics. Eurographics Association, Goslar (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dominik Spinczyk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wilczek, S., Gawrysiak, K., Spinczyk, D. (2019). Similarity Search for the Content of Medical Records Using Unstructured Data. In: Pietka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technology in Biomedicine. ITIB 2018. Advances in Intelligent Systems and Computing, vol 762. Springer, Cham. https://doi.org/10.1007/978-3-319-91211-0_44

Download citation

Publish with us

Policies and ethics