Abstract
In this paper we introduce a novel document clustering approach that solves some major problems of traditional document clustering approaches. Instead of depending on traditional vector space model, this approach represents a set of documents as bipartite graphs using domain knowledge in ontology. In this representation, the concepts of the documents are classified according to their relationships with documents that are reflected on the bipartite graph. Using the concept groups, documents are clustered based on the concepts’ contribution to each document. Through the mutual-refinement relationship with concept groups and document groups, the two groups are recursively refined. Our experimental results on MEDLINE articles show that our approach outperforms two leading document clustering algorithms: BiSecting K-means and CLUTO. In addition to its decent performance, our approach provides a meaningful explanation for each document cluster by identifying its most contributing concepts, thus helps users to understand and interpret documents and clustering results.
This research work is supported in part from the NSF Career grant (NSF IIS 0448023). NSF CCF 0514679 and the PA Dept of Health Tobacco Settlement Formula Grant (#240205, 240196).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
van Rijsbergen, C.J.: Information Retrieval, 2nd edn. Buttersworth, London (1979), http://www.dcs.gla.ac.uk/Keith/Preface.html
Willett, P.: Recent trends in hierarchical document clustering: A critical review. Information Processing & Management 24(5), 577–597 (1988)
Cutting, D., Karger, D., Pedersen, J., Tukey, J.: Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. In: SIGIR 1992, pp. 318–329 (1992)
Buckley, C., Lewit, A.F.: Optimization of inverted vector searches. In: Proceedings of SIGIR 1985, pp. 97–110 (1985)
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In: Proceedings of SIGIR 1996, Zurich, Switzerland, pp. 76–84 (1996)
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proc. ACM SIGIR 1998, pp. 46–54 (1998)
Koller, D., Sahami, M.: Hierarchically classifying documents using very few words. In: Proceedings of ICML 1997, Nashville, TN, pp. 170–176 (1997)
Wang, B.B., (Bob) McKay, R I., Abbass, H.A. Barlow, M.: Learning Text Classifier using the Domain Concept Hierarchy. In: Proceedings of International Conference on Communications, Circuits and Systems 2002, China (2002)
Hotho, A., Maedche, A., Staab, S.: Text Clustering Based on Good Aggregations. Künstliche Intelligenz (KI) 16(4), 48–54 (2002)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is nearest neighbor meaningful? In: Proceedings of 7th International Conference on Database Theory, pp. 217–235 (1999)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, Chichester (1990)
Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota (2000)
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: KDD 1999, San Diego, California (1999)
Hu, X.: Mining Novel Connections from Large Online Digital Library Using Biomedical Ontologies. Library Management Journal 26(4/5), 261–270 (2005)
Harper, D.J., van Rijsbergen, C.J.: Evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation 34, 189–216 (1978)
Van Rijsbergen, C.J., Harper, D.J., Porter, M.F.: The selection of good search terms. Information Processing and Management 17, 77–91 (1981)
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304 (1998)
Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An Adaptive Meta-Clustering Approach: Combining The Information From Different Clustering Results. In: CSB 2002 IEEE Computer Society Bioinformatics Conference Proceedings, pp. 276–287 (2002)
Beil, F., Ester, M., Xu, X.: Frequent Term-Based Text Clustering. In: 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 (2002)
Zhong, S., Ghosh, J.: A comparative study of generative models for document clustering. In: Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference (2003)
Pantel, P., Lin, D.: Document clustering with committees. In: SIGIR 2002, pp. 199–206 (2002)
Liu, J., Wang, W., Yang, J.: A framework for ontology-driven subspace clustering. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 623–628 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yoo, I., Hu, X. (2006). Clustering Large Collection of Biomedical Literature Based on Ontology-Enriched Bipartite Graph Representation and Mutual Refinement Strategy. In: Ng, WK., Kitsuregawa, M., Li, J., Chang, K. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2006. Lecture Notes in Computer Science(), vol 3918. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11731139_36
Download citation
DOI: https://doi.org/10.1007/11731139_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33206-0
Online ISBN: 978-3-540-33207-7
eBook Packages: Computer ScienceComputer Science (R0)