Skip to main content
Log in

A dynamic bibliometric model for identifying online communities

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Predictive modelling of online dynamic user-interaction recordings and community identification from such data becomes more and more important with the widespread use of online communication technologies. Despite of the time-dependent nature of the problem, existing approaches of community identification are based on static or fully observed network connections. Here we present a new, dynamic generative model for the inference of communities from a sequence of temporal events produced through online computer- mediated interactions. The distinctive feature of our approach is that it tries to model the process in a more realistic manner, including an account for possible random temporal delays between the intended connections. The inference of these delays from the data then forms an integral part of our state-clustering methodology, so that the most likely communities are found on the basis of the likely intended connections rather than just the observed ones. We derive a maximum likelihood estimation algorithm for the identification of our model, which turns out to be computationally efficient for the analysis of historical data and it scales linearly with the number of non-zero observed (L +  1)-grams, where L is the Markov memory length. In addition, we also derive an incremental version of the algorithm, which could be used for real-time analysis. Results obtained on both synthetic and real-world data sets demonstrate the approach is flexible and able to reveal novel and insightful structural aspects of online interactions. In particular, the analysis of a full day worth synchronous Internet relay chat participation sequence, reveals the formation of an extremely clear community structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Baldi P, Frasconi P and Smyth P (2003). Modeling the internet and the web: probabilistic methods and algorithms. Wiley, San Francisco, CA, USA

    Google Scholar 

  • Bingham E, Gionis A, Haiminen N, Hiisilä H, Mannila H, Terzi E (2006) Segmentation and dimensionality reduction. In: Ghosh J, Lambert D, Skillicorn DB, Srivastava J (eds) Proceedings of the 6th SIAM international conference on data mining, April 20–22, 2006, Bethesda, MD, USA, SIAM

  • Brin S and Page L (1998). The anatomy of a large-scale hypertextual web search engine. Comp Netw 30(1–7): 107–117

    Article  Google Scholar 

  • Cadez IV, Heckerman D, Meek C, Smyth P and White S (2003). Model-based clustering and visualization of navigation patterns on a web site. Data Min Knowl Discov 7(4): 399–424

    Article  MathSciNet  Google Scholar 

  • Celeux G, Chrétien S, Forbes F and Mkhadri A (2001). A component-wise EM algorithm for mixtures. J Comput Graph Stat 10(4): 697–712

    Article  Google Scholar 

  • Choudhury T, Basu S (2004) Modeling conversational dynamics as a mixed-memory markov process. In: Advances in neural information processing systems 17 (NIPS 2004), Vancouver, British Columbia, Canada

  • Cohn D, Chang H (2000) Learning to probabilistically identify authoritative documents. In: Langley P (ed) Proceedings of the 17th international conference on machine learning (ICML 2000), Stanford University, Standord, CA, USA, Morgan Kaufmann, pp 167–174

  • Cooley R, Mobasher B and Srivastava J (1999). Data preparation for mining world wide web browsing patterns. Knowl Inf Syst 1(1): 5–32

    Google Scholar 

  • Danon L, Díaz-Guilera A, Duch J, Arenas A (2005) Comparing community structure identification. J Stat Mech Theory Exp P09008

  • Dempster AP, Laird NM and Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. J Roy Stat Soc B 39(1): 1–38

    MATH  MathSciNet  Google Scholar 

  • Flake GW, Lawrence S, Giles CL and Coetzee F (2002). Self-organization and identification of web communities. IEEE Comp 35(3): 66–71

    Google Scholar 

  • Guedalia ID, London M and Werman M (1999). An on-line agglomerative clustering method for nonstationary data. Neural Comput 11(2): 521–540

    Article  Google Scholar 

  • He X, Ding CHQ, Zha H, Simon HD (2001) Automatic topic identification using webpage clustering. In: Cercone N, Lin TY, Wu X (eds) Proceedings of the 2001 IEEE international conference on data mining, San Jose, California, USA, IEEE computer society, pp 195–202

  • Jain AK and Dubes RC (1988). Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River, NJ, USA

    MATH  Google Scholar 

  • Kabán A (2007). Predictive modelling of heterogeneous sequence collections by topographic ordering of histories. Mach Learn 68(1): 63–95

    Article  Google Scholar 

  • Kabán A, Wang X (2004) Context based identification of user communities from internet chat. In: Proceedings of IEEE International Joint Conference Neural Networks (IJCNN 2004), IEEE computer society, pp 3287–3292

  • Kabán A, Wang X (2006) Deconvolutive clustering of markov states. In: Scheffer T, Fuernkranz J, Spiliopoulou M (eds) 17th European conference on machine learning (ECML2006), Vol 4212 LNAI, Springer-Verlag, pp 246–257

  • Kleinberg JM (1999). Authoritative sources in a hyperlinked environment. J ACM 46(5): 604–632

    Article  MATH  MathSciNet  Google Scholar 

  • Kleinberg JM (2003). Bursty and hierarchical structure in streams. Data Min Knowl Discov 7(4): 373–397

    Article  MathSciNet  Google Scholar 

  • Kleinberg JM (2006) Temporal dynamics of on-line information streams. In: Garofalakis M, Gehrke J, Rastogi R (eds) Data stream management: processing high-speed data streams. Springer

  • Krishnan T and McLachlan GJ (1997). The EM algorithm and extensions. John Wiley and Sons, New York, NY, USA

    MATH  Google Scholar 

  • Manning CD and Schütze H (1999). Foundations of statistical natural language processing. MIT Press, Cambridge, MA, USA

    MATH  Google Scholar 

  • Neal RM, Hinton GE (1999) A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Learning in graphical models. MIT Press, Cambridge, MA, USA, pp 355–368

  • Newman MEJ (2004). Detecting community structure in networks. Euro Phys J B 38: 321–330

    Article  Google Scholar 

  • Ng AY, Zheng AX, Jordan MI (2001) Link analysis, eigenvectors and stability. In: Nebel B (ed) Proceedings of the 17th international joint conference on artificial intelligence, IJCAI 2001, Seattle, Washington, USA, Morgan Kaufmann, pp 903–910

  • Rabiner LR (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286

  • Raftery A (1985). A model for high-order markov chains. Roy Stat Soc B 47(3): 528–539

    MATH  MathSciNet  Google Scholar 

  • Raftery A and Berchtold A (2002). The mixture transition distribution model for high-order markov chains and non-gaussian time series. Stat Sci 17(3): 328–356

    Article  MATH  MathSciNet  Google Scholar 

  • Ripley BD and Hjort NL (1995). Pattern recognition and neural networks. Cambridge University Press, New York, NY, USA

    Google Scholar 

  • Saul LK and Jordan MI (1999). Mixed memory markov models: decomposing complex stochastic processes as mixtures of simpler ones. Mach Learn 37(1): 75–87

    Article  MATH  Google Scholar 

  • Saul LK, Pereira F (1997) Aggregate and mixed-order markov models for statistical language processing. CoRR, cmp-lg/9706007

  • Ueda N and Nakano R (1994). A new competitive learning approach based on an equidistortion principle for designing optimal vector quantizers. Neural Netw 7(8): 1211–1227

    Article  Google Scholar 

  • Wang X, Kabán A (2006) State aggregation in higher-order markov chains for finding online communities. In: Corchado E et al (ed) 7th international conference on intelligent data engineering and automated learning (IDEAL06), LNCS, vol 4224 Springer-Verlag, pp 1023–1030

  • Wasserman S, Faust K, Iacobucci D (1994) Social network analysis: methods and applications (Structural Analysis in the Social Sciences). Cambridge University Press

  • Ypma A, Heskes T (2002) Automatic categorization of web pages and user clustering with mixtures of hidden markov models. In: Zaïane OR, Srivastava J, Spiliopoulou M, Masand BM (eds) WEBKDD, Lecture notes in computer science, vol 2703 Springer, pp 35–49

  • Zhang D, Chen S and Tan K (2005). Improving the robustness of ‘online agglomerative clustering method’ based on kernel-induce distance measures. Neural Process Lett 21(1): 45–51

    Article  Google Scholar 

  • Zhong S (2005) Efficient online spherical k-means clustering. In: Proceedings of the IEEE international joint conference neural networks (IJCNN 2005), IEEE computer society, pp 3180–3185

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xin Wang.

Additional information

Communicated by Chang-shing Perng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, X., Kabán, A. A dynamic bibliometric model for identifying online communities. Data Min Knowl Disc 16, 67–107 (2008). https://doi.org/10.1007/s10618-007-0081-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-007-0081-y

Keywords

Navigation