Skip to main content
Log in

Finding cohesive clusters for analyzing knowledge communities

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Documents and authors can be clustered into “knowledge communities” based on the overlap in the papers they cite. We introduce a new clustering algorithm, Streemer, which finds cohesive foreground clusters embedded in a diffuse background, and use it to identify knowledge communities as foreground clusters of papers which share common citations. To analyze the evolution of these communities over time, we build predictive models with features based on the citation structure, the vocabulary of the papers, and the affiliations and prestige of the authors. Findings include that scientific knowledge communities tend to grow more rapidly if their publications build on diverse information and if they use a narrow vocabulary.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Blei D, Lafferty J (2006) Dynamic topic models. 23rd ICML, 113–120

  2. Crane D (1972) Invisible colleges: diffusion of knowledge in scientific communities. University of Chicago Press

  3. Dhillon I, Guan Y (2003) Information theoretic clustering of sparse cooccurrence data. ICDM 517–520

  4. Dhillon IS (2001) Co-clustering documents and words using bipartite spectral graph partitioning. KDD, pp 269–274, ACM Press, New York

  5. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. KDD, AAAI Press, Portland, OR, pp 226–231

  6. Fern X, Brodley C (2003) Random projection for high dimensional data clustering: a cluster ensemble approach. ICML, pp 186–193

  7. Flake G, Lawrence S, Giles C (2000) Efficient identification of Web communities. KDD pp 150–160

  8. Gibson D, Kleinberg J and Raghavan P (1998). Inferring web communities from link topology. ACM Press, New York

    Google Scholar 

  9. Griffith B, Small H, Stonehill J and Dey S (1974). The structure of scientific literatures II: toward a macro- and microstructure for Science. Sci Studies 4(4): 339–365

    Article  Google Scholar 

  10. Guha S, Meyerson A, Mishra N, Motwani R and O’Callaghan L (2003). Clustering data streams: theory and practice. IEEE Trans Knowledge Data Eng 15(3): 515–528

    Article  Google Scholar 

  11. Hopcroft J, Khan O, Kulis B, Selman B (2003) Natural communities in large linked networks. KDD, pp 541–546

  12. Huang Q, Dom B, Steele D, Ashley J and Niblack W (1995). Foreground/background segmentation of color images by integration of multiple cues. IEEE Int Conf Image Process 1: 246–249

    Google Scholar 

  13. Kearns MJ, Mansour Y, Ng AY (1997) An information-theoretic analysis of hard and soft assignment methods for clustering. UAI, pp 282–293

  14. McGann A (2002). The advantages of ideological cohesion a model of constituency representation and electoral competition in multi-party democracies. J Theor Politics 14(1): 37–70

    Article  Google Scholar 

  15. McGovern A, Friedland L, Hay M, Gallagher B, Fast A, Neville J and Jensen D (2003). Exploiting relational structure to understand publication patterns in high-energy physics. SIGKDD Explor Newslett 5(2): 165–172

    Article  Google Scholar 

  16. Pantel P, Lin D (2002) Document clustering with committees. SIGIR ’02, ACM Press, New York, pp 199–206

  17. Popescul A, Flake G, Lawrence S, Ungar L, Giles C (2000) Clustering and identifying temporal trends in document databases. Advances in digital libraries, 2000. ADL 2000. proceedings. IEEE, pp 173–182

  18. Savakis A (1998) Adaptive document image thresholding using foreground and background clustering. Proceedings of international conference on image processing ICIP98

  19. Small H (2003). Paradigms, citations and maps of science: a personal history. J Am Soc Informat Sci Technol 54(5): 394–399

    Article  MathSciNet  Google Scholar 

  20. Small H and Crane D (1979). Specialties and disciplines in science and social science: an examination of their structure using citation indexes. Scientometrics 1(5): 445–461

    Article  Google Scholar 

  21. Steinbach M, Karypis G and Kumar V (2000). A comparison of document clustering techniques. KDD workshop text mining 34: 35

    Google Scholar 

  22. Strehl A and Ghosh J (2002). Cluster ensembles—a knowledge reuse framework for combining multiple partitions. JMLR 3: 583–617

    Article  MathSciNet  Google Scholar 

  23. Sullivan D, White DH and Barboni EJ (1977). Co-citation analyses of science: an evaluation. Social Studies Sci 7(2): 223–240

    Article  Google Scholar 

  24. Upham SP (2006) Communities of innovation. PhD thesis, University of Pennsylvania

  25. Wang X, McCallum A (2006) Topics over time: a non-Markov continuous-time model of topical trends. KDD, pp 424–433

  26. Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD international conference on Management of data, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lyle H. Ungar.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kandylas, V., Upham, S.P. & Ungar, L.H. Finding cohesive clusters for analyzing knowledge communities. Knowl Inf Syst 17, 335–354 (2008). https://doi.org/10.1007/s10115-008-0135-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-008-0135-5

Keywords

Navigation