skip to main content
10.1145/1141753.1141802acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

Published: 11 June 2006 Publication History

Abstract

Document clustering has been used for better document retrieval, document browsing, and text mining in digital library. In this paper, we perform a comprehensive comparison study of various document clustering approaches such as three hierarchical methods (single-link, complete-link, and complete link), Bisecting K-means, K-means, and Suffix Tree Clustering in terms of the efficiency, the effectiveness, and the scalability. In addition, we apply a domain ontology to document clustering to investigate if the ontology such as MeSH improves clustering qualify for MEDLINE articles. Because an ontology is a formal, explicit specification of a shared conceptualization for a domain of interest, the use of ontologies is a natural way to solve traditional information retrieval problems such as synonym/hypernym/ hyponym problems. We conducted fairly extensive experiments based on different evaluation metrics such as misclassification index, F-measure, cluster purity, and Entropy on very large article sets from MEDLINE, the largest biomedical digital library in biomedicine.

References

[1]
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S. Fast algorithms for projected clustering. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of data, 1999, 61--72.]]
[2]
Beil, F., Ester, M. and Xu, X. Frequent Term-Based Text Clustering, In Proceedings of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 23--26, 2002, Edmonton, Alberta, Canada, 436--442.]]
[3]
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U. When is nearest neighbor meaningful?. Proceedings of 7th International Conference on Database Theory, 1999, 217--235.]]
[4]
Buckley, C., Salton, G., Allen, J. and Singhal, A. Automatic query expansion using SMART: TREC-3. In: D. K. Harman (ed.), The Third Text Retrieval Conference (TREC-3). U.S. Department of Commerce, 1995, 69--80.]]
[5]
Buckley, C. and Lewit, A. F. Optimization of inverted vector searches. In Proceedings of SIGIR-85, 1985, 97--110.]]
[6]
Cutting, D., Karger, D., Pedersen, J. and Tukey, J. Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections, In Proceedings of SIGIR '92, 1992, 318--329.]]
[7]
Ghosh, J. Scalable clustering methods for data mining. In N. Ye (Ed.), Handbook of data mining. Lawrence Erlbaum, 2003.]]
[8]
Gruber, T.R. Towards Principles for the Design of Ontologies used for Knowledge Sharing. International Journal of Human-Computer Studies, 43, 1995, 907--928.]]
[9]
Hearst, M. A. and Pedersen, J. O. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96, 1996, 76--84.]]
[10]
Hotho, A., Maedche A., and Staab S. Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI), 16, 4, 2002, 48--54.]]
[11]
Hu, X. Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies, Library Management Journal, 26, 4/5, 2005, 261--270.]]
[12]
Kaufman, L., and Rousseeuw, P.J. Finding Groups in Data: an Introduction to Cluster Analysis, 1999, John Wiley & Sons.]]
[13]
Koller, D. and Sahami, M. Hierarchically classifying documents using very few words. In Proceedings of ICML-97, 1997, 170--176.]]
[14]
Larsen, B. and Aone, C. Fast and Effective Text Mining Using Linear-time Document Clustering, KDD-99, San Diego, California, 1999, 16--22.]]
[15]
Li, T., Ma, S., and Ogihara, M. Document clustering via adaptive subspace iteration. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, 2004, 218--225.]]
[16]
Pantel, P. and Lin, D. Document clustering with committees. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of data, 2002, 199--206.]]
[17]
Steinbach, M., Karypis, G., and Kumar, V. A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota, 2000.]]
[18]
van Rijsbergen, C. J. Information Retrieval, 2nd edition, London: Buttersworth, 1979. (http://www.dcs.gla.ac.uk/Keith/Preface.html)]]
[19]
Wang, B.B., McKay, R I., Abbass, H.A., Barlow M. Learning Text Classifier using the Domain Concept Hierarchy. In Proceedings of International Conference on Communications, Circuits and Systems 2002, China.]]
[20]
Willett, P. Recent trends in hierarchical document clustering: A critical review. Information Processing & Management, 24, 5, 1988, 577--597.]]
[21]
Xu, W. and Gong, Y. Document clustering by concept factorization. Proceedings of SIGIR-04, 2004, 202--209.]]
[22]
Zamir O., and Etzioni O. Web Document Clustering: A Feasibility Demonstration, In Proceedings of SIGIR 98, 1998, 46--54.]]
[23]
Zeng, Y., Tang, J., Garcia-Frias, J. and Gao, G.R. An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results, IEEE Computer Society Bioinformatics Conference (CSB2002), 2002, 276--287.]]
[24]
Zhao, Y., and Karypis, G. Criterion functions for document clustering: Experiments and analysis, Technical Report, Department of Computer Science, University of Minnesota, 2002.]]
[25]
Zhao, Y., and Karypis, G. Evaluation of Hierarchical Clustering Algorithms for Document Datasets, Technical Report, Department of Computer Science, University of Minnesota, 2002.]]
[26]
Zhong, S., and Ghosh, J. A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference, 2003.]]
[27]
zu Eissen, S.M., Stein, B, Potthast, M. The Suffix Tree Document Model Revisited, In Proceedings of the 5th International Conference on Knowledge Management, 2005, 596--603.]]

Cited By

View all
  • (2024)Human-in-the-loop latent space learning for biblio-record-based literature managementInternational Journal on Digital Libraries10.1007/s00799-023-00389-825:1(123-136)Online publication date: 1-Mar-2024
  • (2022)Bibrecord-Based Literature Management with Interactive Latent Space LearningFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_13(155-171)Online publication date: 7-Dec-2022
  • (2021)Analytical Comparison of Clustering Techniques for the Recognition of Communication PatternsGroup Decision and Negotiation10.1007/s10726-021-09758-731:3(555-589)Online publication date: 13-Oct-2021
  • Show More Cited By

Index Terms

  1. A comprehensive comparison study of document clustering for a biomedical digital library MEDLINE

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      JCDL '06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital libraries
      June 2006
      402 pages
      ISBN:1595933549
      DOI:10.1145/1141753
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 11 June 2006

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. comparison study
      2. document clustering
      3. ontology

      Qualifiers

      • Article

      Conference

      JCDL06
      JCDL06: Joint Conference on Digital Libraries 2006
      June 11 - 15, 2006
      NC, Chapel Hill, USA

      Acceptance Rates

      Overall Acceptance Rate 415 of 1,482 submissions, 28%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)1
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Human-in-the-loop latent space learning for biblio-record-based literature managementInternational Journal on Digital Libraries10.1007/s00799-023-00389-825:1(123-136)Online publication date: 1-Mar-2024
      • (2022)Bibrecord-Based Literature Management with Interactive Latent Space LearningFrom Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries10.1007/978-3-031-21756-2_13(155-171)Online publication date: 7-Dec-2022
      • (2021)Analytical Comparison of Clustering Techniques for the Recognition of Communication PatternsGroup Decision and Negotiation10.1007/s10726-021-09758-731:3(555-589)Online publication date: 13-Oct-2021
      • (2020)

        Integrating Unified Medical Language System and Kleinberg’s Burst Detection Algorithm into Research Topics of Medications for Post-Traumatic Stress Disorder

        Drug Design, Development and Therapy10.2147/DDDT.S270379Volume 14(3899-3913)Online publication date: Sep-2020
      • (2018)Text ClusteringEncyclopedia of Database Systems10.1007/978-1-4614-8265-9_415(4067-4070)Online publication date: 7-Dec-2018
      • (2017)Exploring diseases based biomedical document clustering and visualization using self-organizing maps2017 IEEE 19th International Conference on e-Health Networking, Applications and Services (Healthcom)10.1109/HealthCom.2017.8210791(1-6)Online publication date: Oct-2017
      • (2017)Healthcare Data Mining, Association Rule Mining, and ApplicationsHealth Informatics Data Analysis10.1007/978-3-319-44981-4_13(201-210)Online publication date: 10-Sep-2017
      • (2016)A data mining approach to selecting herbs with similar efficacy: Targeted selection methods based on medical subject headings (MeSH)Journal of Ethnopharmacology10.1016/j.jep.2016.02.007182(27-34)Online publication date: Apr-2016
      • (2016)Text ClusteringEncyclopedia of Database Systems10.1007/978-1-4899-7993-3_415-2(1-4)Online publication date: 20-Dec-2016
      • (2015)A quartet method based on variable neighborhood search for biomedical literature extraction and clusteringInternational Transactions in Operational Research10.1111/itor.1224024:3(537-558)Online publication date: 14-Dec-2015
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media