Similarity Search for the Content of Medical Records Using Unstructured Data

Wilczek, Sylwia; Gawrysiak, Kinga; Spinczyk, Dominik

doi:10.1007/978-3-319-91211-0_44

Sylwia Wilczek¹⁸,
Kinga Gawrysiak¹⁹ &
Dominik Spinczyk¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 762))

Included in the following conference series:

International Conference on Information Technologies in Biomedicine

497 Accesses

Abstract

Clustering large amounts of unstructured data is an important challenge in contemporary medicine and biology. This article presents an automatic clustering method for unstructured medical data. The presented method consists of the following main steps: transformation of the document corpus to a frequency matrix of terms; dimensionality reduction of the frequency matrix of terms using principal component analysis (PCA); the direct comparison of pairs of documents similarity measures using the cosine and correlation distances; and finding the optimal number of groups for expertly labelled data sets by treating the clustering problem as an optimization problem in which the objective function is an F measure to be optimized via the selection of parameter values such as PCA resolution and the similarity threshold of the pairs of documents. The usefulness of the proposed methodology was demonstrated by performing calculations on three data sets: short sentences divided into three themes, radiological reports of aneurysms, and radiological reports of abdomen studies. A common barrier in clustering unstructured data is difficulty in results interpretation. To overcome this limitation, the utility of presentation methods, including group histograms, similarity matrices, plots of document assignment to founding clusters, F-measure interpolation and alphabetical- and term-frequency dictionaries, are presented. Excluding the labelling step, the presented method is completely automated and can be used as a preliminary data analysis method for large bodies of text to discover potential groups of interesting topics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 189.00; Price excludes VAT (USA)

Softcover Book: USD 249.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

An Unsupervised Algorithm for Qualitative Coding of Text Data: Artifact Design, Application, and Evaluation

An Efficient Clustering Technique for Unstructured Data Utilizing Latent Semantic Analysis

Biomedical Document Clustering

References

Zhu, F., Patumcharoenpol, P., Zhang, C., Yang, Y.: Biomedical text mining and its applications in cancer research. J. Biomed. Inform. 46, 200–211 (2013)
Article Google Scholar
Kawa, J., Juszczyk, J., Pyciński, B., Badura, P., Piętka, E.: Radiological atlas for patient specific model generation. Adv. Intell. Syst. Comput. 84, 69–84 (2014)
Google Scholar
Rebholz-Schuhmann, D., Jepes, A., Li, C., Kafkas, S., Lewin, I., et al.: Assessment of NER solutions against the first and second CALBC Silver Standard Corpus. J. Biomed. Seman. 2(Suppl. 5), S11 (2011)
Article Google Scholar
Krallinger, M., Vasquez, M., Leitner, F., Salgado, D., Chatr-Aryamontri, A., Winter, A., et al.: The protein-protein interaction tasks of BioCreative III: classification ranking of articles and linking bio-ontology concepts to full text. BMC Bioinform. 12(Suppl. 8), S3 (2011)
Article Google Scholar
Amine, A., Elberrichi, Z., Simonet, M.: Evaluation of text clustering methods using WordNet. Int. Arab J. Inf. Technol. 7(4), 349–357 (2010)
Google Scholar
Safeer, Y., Mustafa, A., Noor, A.A.: Clustering unstructured data. Int. J. Comput. Sci. Inf. Secur. 8(2), 174–180 (2010)
Google Scholar
Spinczyk, D., Dziecitko, M.: Similarity search for the content of medial records. In: Information Technologies in Medicine. Advances in Intelligent Systems and Computing, vol. 471, pp. 489–501 (2016)
Google Scholar
Albright, R.: Taming Text with the SVD. SAS Institute White Paper (2004)
Google Scholar
Meyer, C.: Matrix Analysis and Applied Linear Algebra. SIAM, Philadelphia (2000)
Book Google Scholar
Vandenberghe, L.: Applied Numerical Computing (lecture) (2011)
Google Scholar
Keim, D., Kohlhammer, J., Ellis, G., Mansmann, F.: Mastering the Information Age Solving Problems with Visual Analytics. Eurographics Association, Goslar (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Biomedical Engineering, Silesian University of Technology, Roosevelta 40, 41-800, Zabrze, Poland
Sylwia Wilczek & Dominik Spinczyk
Warsaw School of Economy, Aleja Niepodległości 162, 00-001, Warsaw, Poland
Kinga Gawrysiak

Authors

Sylwia Wilczek
View author publications
You can also search for this author in PubMed Google Scholar
Kinga Gawrysiak
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Spinczyk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Spinczyk .

Editor information

Editors and Affiliations

Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Ewa Pietka
Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Pawel Badura
Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Jacek Kawa
Faculty of Biomedical Engineering, Silesian University of Technology, Zabrze, Poland
Wojciech Wieclawek

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wilczek, S., Gawrysiak, K., Spinczyk, D. (2019). Similarity Search for the Content of Medical Records Using Unstructured Data. In: Pietka, E., Badura, P., Kawa, J., Wieclawek, W. (eds) Information Technology in Biomedicine. ITIB 2018. Advances in Intelligent Systems and Computing, vol 762. Springer, Cham. https://doi.org/10.1007/978-3-319-91211-0_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-91211-0_44
Published: 06 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-91210-3
Online ISBN: 978-3-319-91211-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Similarity Search for the Content of Medical Records Using Unstructured Data