Abstract
MEDLINE is a representative collection of medical documents supplied with original full-text natural-language abstracts as well as with representative keywords (called MeSH-terms) manually selected by the expert annotators from a pre-defined ontology and structured according to their relation to the document. We show how the structured manually assigned semantic descriptions can be combined with the original full-text abstracts to improve quality of clustering the documents into a small number of clusters. As a baseline, we compare our results with clustering using only abstracts or only MeSH-terms. Our experiments show 36% to 47% higher cluster coherence, as well as more refined keywords for the produced clusters.
Work done under partial support of the ITRI of Chung-Ang University, Korean Government (KIPA Professorship for Visiting Faculty Positions in Korea), and Mexican Government (CONACyT, SNI, IPN). The third author is currently on Sabbatical leave at Chung-Ang University.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Iliopoulos, I., Enright, A., Ouzounis, C.: Textquest: document clustering of medline abstracts for concept discovery in molecular biology. In: Pac. Symp. on Biocomput. pp. 384–395 (2001)
Kubat, M., Bratko, I., Michalski, R.S.: In: Michalski, R.S., Bratko, I., Kubat, M. (eds.) Machine Learning and Data Mining: methods and applications: A review of machine learning methods, John Wiley & Sons, New York (1997)
Sekimizu, T., Park, H.S., Tsujii, J.: Identifying the interaction between genes and gene products based on frequently seen verbs in Medline abstracts. In: Genome Informatics Workshop, Tokyo, p. 62 (1998)
Thomas, J., Milward, D., Ouzounis, C., Pulman, S., Carroll, M.: Automatic extraction of protein interactions from scientific abstracts. In: Pac. Symp. Biocomput, pp. 538–549 (2000)
Andrade, M.A., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14, 600 (1998)
Proux, D., Rechenmann, F., Julliard, L., Pillet, V., Jacq, B.: Detecting gene symbols and names in biological texts: a first step toward pertinent information extraction. In: Genome Informatics Workshop, Tokyo, pp. 72–80 (1998)
Salton, G., McGill, M.J.: Introduction to Modern Retrieval. McGraw-Hill Book Company, New York (1983)
Dhillon, I.S., Modha, D.S.: Concept Decomposition for Large Sparse Text Data using Clustering, Technical Report RJ 10147(9502), IBM Almaden Research Center (1999)
Frakes, W.B., Baeza-Yates, R.: Information Retrieval: Data Structures and Algorithms. Prentince Hall, Englewood Cliffs (1992)
Dhillon, I.S., Fan, J., Guan, Y.: Efficient Clustering of Very Large Document Collections. In: Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers, Dordrecht (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shin, K., Han, SY., Gelbukh, A. (2004). Advanced Clustering Technique for Medical Data Using Semantic Information. In: Monroy, R., Arroyo-Figueroa, G., Sucar, L.E., Sossa, H. (eds) MICAI 2004: Advances in Artificial Intelligence. MICAI 2004. Lecture Notes in Computer Science(), vol 2972. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24694-7_33
Download citation
DOI: https://doi.org/10.1007/978-3-540-24694-7_33
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21459-5
Online ISBN: 978-3-540-24694-7
eBook Packages: Springer Book Archive