skip to main content
10.1145/1281192.1281233acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis

Published: 12 August 2007 Publication History

Abstract

To unravel the concept structure and dynamics of the bioinformatics field, we analyze a set of 7401 publications from the Web of Science and MEDLINE databases, publication years 1981-2004. For delineating this complex, interdisciplinary field, a novel bibliometric retrieval strategy is used. Given that the performance of unsupervised clustering and classification of scientific publications is significantly improved by deeply merging textual contents with the structure of the citation graph, we proceed with a hybrid clustering method based on Fisher's inverse chi-square. The optimal number of clusters is determined by a compound semiautomatic strategy comprising a combination of distance-based and stability-based methods. We also investigate the relationship between number of Latent Semantic Indexing factors, number of clusters, and clustering performance. The HITS and PageRank algorithms are used to determine representative publications in each cluster. Next, we develop a methodology for dynamic hybrid clustering of evolving bibliographic data sets. The same clustering methodology is applied to consecutive periods defined by time windows on the set, and in a subsequent phase chains are formed by matching and tracking clusters through time. Term networks for the eleven resulting cluster chains present the cognitive structure of the field. Finally, we provide a view on how much attention the bioinformatics community has devoted to the different subfields through time.

Supplementary Material

JPG File (p360-janssens-200.jpg)
JPG File (p360-janssens-768.jpg)
Low Resolution (p360-janssens-200.mov)
High Resolution (p360-janssens-768.mov)

References

[1]
R. A. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., 1999.
[2]
V. Batagelj and A. Mrvar. Pajek - analysis and visualization of large networks. Graph Drawing, 2265:477--478, 2002.
[3]
A. Ben-Hur, A. Elisseeff, and I. Guyon. A stability based method for discovering structure in clustered data. In Pacific Symposium on Biocomputing, pages 6--17, 2002.
[4]
M. W. Berry, S. T. Dumais, and G. W. Obrien. Using linear algebra for intelligent information retrieval. SIAM Review, 37(4):573--595, 1995.
[5]
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1-7):107--117, 1998.
[6]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
[7]
W. Glänzel, F. Janssens, and B. Thijs. A comparative analysis of publication activity and citation impact based on the core literature in bioinformatics. In Proc. 11th Intl. Conf. of the ISSI, Madrid, Spain, 2007.
[8]
T. L. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.
[9]
E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., 2004.
[10]
X. He, C. H. Q. Ding, H. Zha, and H. D. Simon. Automatic topic identification using webpage clustering. In Proc. of the 2001 IEEE intl. conf. on Data Mining, pages 195--202, Washington, DC, USA, 2001. IEEE Computer Society.
[11]
L. V. Hedges and I. Olkin. Statistical Methods for Meta-Analysis. Academic Press, 1985.
[12]
A. K. Jain and R. C. Dubes. Algorithms for clustering data. Prentice-Hall, Inc., 1988.
[13]
F. Janssens. Clustering of scientific fields by integrating text mining and bibliometrics. Ph.D. thesis, Faculty of Engineering, Katholieke Universiteit Leuven, Belgium, http://hdl.handle.net/1979/847, 2007.
[14]
M. M. Kessler. Bibliographic coupling between scientific papers. American Documentation, 14:10--25, 1963.
[15]
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.
[16]
A. Kontostathis and W. M. Pottenger. Essential Dimensions of Latent Semantic Indexing (EDLSI). In Proc. 40th Annual Hawaii Intl. Conf. on System Sciences (CD-ROM), 2007.
[17]
Q. Mei and C. Zhai. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In KDD '05, pages 198--207, New York, NY, USA, 2005. ACM Press.
[18]
D. S. Modha and W. S. Spangler. Clustering hypertext with applications to web searching. In Proc. of the 11th ACM Conf. on Hypertext and Hypermedia, pages 143--152, New York, 2000. ACM Press.
[19]
C. A. Ouzounis and A. Valencia. Early bioinformatics: the birth of a discipline - a personal view. Bioinformatics, 19(17):2176--2190, 2003.
[20]
S. K. Patra and S. Mishra. Bibliometric study of bioinformatics literature. Scientometrics, 67(3):477--489, 2006.
[21]
C. Perez-Iratxeta, M. A. Andrade-Navarro, and J. D. Wren. Evolving research trends in bioinformatics. Briefings in Bioinformatics, 2006.
[22]
P. Rousseeuw. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(1):53--65, 1987.
[23]
G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, Inc., 1986.
[24]
Y. Wang and M. Kitsuregawa. Evaluating contentslink coupled web page clustering for web search results. In Proc. 11th intl. conf. on Information and Knowledge Management, pages 499--506, New York, NY, USA, 2002. ACM Press.

Cited By

View all
  • (2023)An Ensemble and Multi-View Clustering Method Based on Kolmogorov ComplexityEntropy10.3390/e2502037125:2(371)Online publication date: 17-Feb-2023
  • (2023)Research on optimization of two-way clustering algorithm for gene expression data analysisInternational Conference on Modern Medicine and Global Health (ICMMGH 2023)10.1117/12.2692181(26)Online publication date: 7-Sep-2023
  • (2021)An Intelligent Mechanism to Automatically Discover Emerging Technology Trends: Exploring Regulatory TechnologyACM Transactions on Management Information Systems10.1145/348518713:2(1-29)Online publication date: 11-Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '07: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2007
1080 pages
ISBN:9781595936097
DOI:10.1145/1281192
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 August 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cluster chains
  2. fisher's inverse chi-square method

Qualifiers

  • Article

Conference

KDD07

Acceptance Rates

KDD '07 Paper Acceptance Rate 111 of 573 submissions, 19%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)An Ensemble and Multi-View Clustering Method Based on Kolmogorov ComplexityEntropy10.3390/e2502037125:2(371)Online publication date: 17-Feb-2023
  • (2023)Research on optimization of two-way clustering algorithm for gene expression data analysisInternational Conference on Modern Medicine and Global Health (ICMMGH 2023)10.1117/12.2692181(26)Online publication date: 7-Sep-2023
  • (2021)An Intelligent Mechanism to Automatically Discover Emerging Technology Trends: Exploring Regulatory TechnologyACM Transactions on Management Information Systems10.1145/348518713:2(1-29)Online publication date: 11-Dec-2021
  • (2021)Applying text similarity algorithm to analyze the triangular citation behavior of scientistsApplied Soft Computing10.1016/j.asoc.2021.107362107(107362)Online publication date: Aug-2021
  • (2020)The Parallelization and Optimization of K-means Algorithm Based on Spark2020 15th International Conference on Computer Science & Education (ICCSE)10.1109/ICCSE49874.2020.9201770(457-462)Online publication date: Aug-2020
  • (2020)A New Information Theory Based Clustering Fusion Method for Multi-view Representations of Text DocumentsSocial Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis10.1007/978-3-030-49570-1_11(156-167)Online publication date: 10-Jul-2020
  • (2019)Bibliometric Delineation of Scientific FieldsSpringer Handbook of Science and Technology Indicators10.1007/978-3-030-02511-3_2(25-68)Online publication date: 2019
  • (2019)Clustering approaches for high‐dimensional databases: A reviewWIREs Data Mining and Knowledge Discovery10.1002/widm.13009:3Online publication date: 23-Jan-2019
  • (2018)A survey on visualization for scientific literature topicsJournal of Visualization10.1007/s12650-017-0462-221:2(321-335)Online publication date: 1-Apr-2018
  • (2017)Precision based recommender system using ontology2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS)10.1109/ICECDS.2017.8390037(3153-3160)Online publication date: Aug-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media