Elsevier

Future Generation Computer Systems

Volume 91, February 2019, Pages 361-370
Future Generation Computer Systems

Combination of links and node contents for community discovery using a graph regularization approach

https://doi.org/10.1016/j.future.2018.08.009Get rights and content

Abstract

With the rapid growth of the networked data, the study of community detection is drawing increasing attention of researchers. A number of algorithms have been proposed and some of them have been well applied in many research fields, such as recommendation systems, information retrieval, etc. Traditionally, the community detection methods mainly use the knowledge of the topological structure which contains the most important clue for finding potential groups or communities. However, as we know, a wealth of content information exists on the nodes in real-world networks, and may help for community detection. Considering the above problem, we introduce a novel community detection method under the framework of nonnegative matrix factorization (NMF), and adopt the idea that two nodes with similar content will be most likely to belong to the same community to achieve the incorporation of links and node contents, i.e., we employ a graph regularization to penalize the dissimilarity of nodes denoted by community memberships. Besides, we introduce an intuitive manifold learning strategy to recover the intrinsic geometrical structure of the content information, i.e., K-near neighbor consistency. In addition, we found that, there are still drawbacks in this framework due to it does not consider the heterogeneous distribution of node degrees. This heterogeneous distribution can affect the function of graph regularization and isolates the original community memberships. We first proposed the node popularities satisfying the above interpretation and develop a new NMF-based model, named as Combination of Links and Node Contents for Community Discovery (CLNCCD). The experiments on both artificial and real-world networks compared with the state-of-the-art methods show that, the new model obtains significant improvement for community detection by incorporating node contents effectively.

Introduction

The development of social networking has led to yield variety of mass data, such as user relation networks, product reviews and online comments. For the analysis of the complex networked data, researchers try to access a first view on those data by finding the latent community structure. A community can be deemed as a group of nodes that are closely connected within a class and are sparsely connected between the different classes. Discovering communities consisting of “similar” users is very important, and it has been applied in many areas, e.g., sociology, biology and computer science. For instance, in biology, some different units, belonging to an organization, have related functions which are linked in special structure to draw the whole effect of the organization. The interaction among a set of proteins in a cell can form an RNA polymerase for transcription of genes [1]. As Sachan et al. [2] described, finding the salient communities from the organization of people can create a guidance which can be helpful for web marketing, behavior prediction of users belonging to a community and understanding the function of complex system.

The above data described is often treated as a graph in which nodes denote units or users in real-world networks. And the relations among units or users are represented by directed or undirected links which connect those nodes just described. In order to discovering communities in the graph, many earlier methods [3], [4], [5] considered the information of network topology alone. In addition to the linked graph, there also exist some datasets with content information alone, such as tweet, ImageNet. And some researchers presented many data clustering algorithms [6], [7], [8] for the analysis of these datasets, which usually make use of node contents to perform clustering. In fact, both the linkage and content information exist simultaneously in real-world networks. For example, in the social networks, the content information may be tweets, user profiles or the text on the web pages. In the citation networks, node content is regarded as the abstract of a paper or affiliation information of the authors. So it is realized that, using links or node contents alone, an algorithm is often limited to mine or explore the important information about the structure of communities from the attributed graph which consists of network topology and node contents. For example, the methods of clustering by using the node content information are difficult to process those data points without the content. If one considers the links between data points in addition to data point contents, the designed methods may be able to solve this difficulty. Conversely, node contents might give a guidance for discovering communities when there are a few links existed in the network.

Recently, several methods [2], [9], [10], [11], [12] based on combining topological and content information have been proposed. For example, Ruan et al. [12] believed that node contents can be regarded as another type of edges, thus constructed a new graph in which the edges consist of the original links and the links ‘simplified’ by node contents, and then partitioned the simplified graph into clusters. Approaches based on generative framework [9] statistically model network topology and node contents, respectively, with the main idea that these two kinds of information are ‘generated’ by communities. However, they still have some drawbacks. First, when considering the content information in networks, the similarity between node contents in these community detection methods is often based on Euclidean space. However, the structure of the space of the content information may be a manifold which is a geometrical space that locally resembles the Euclidean space near each point. In this scenario, it is not suitable that these methods measure the similarity just described based only on Euclidean space. Thus, these methods which did not consider the manifold structure of information of node contents often limits the applicability for community detection. Second, the current generative frameworks usually cannot model the unbalanced degree of nodes, and thus the community memberships themselves are insufficient to model the link–link patterns. In a word, these existing community detection methods mainly focus on the combination of network topology and node contents, while still have room for improvement in the performance because they often did not consider that the manifold structure intrinsically held in the space of content information, and also often ignored the impact of the heterogeneity of the degree of nodes in the network topology.

In this paper, we introduce and develop a nonnegative matrix factorization (NMF) method for community detection which models links with graph regularization to try to overcome the shortcomings of the existing approaches. The proposed model consists of two sub-models. In the first sub-model, its input is the adjacency matrix which describes the topological information, and the community memberships of nodes are obtained by factorizing this adjacency matrix. Next, it is intuitive that if two nodes are similar in terms of the content information, they will be most likely to belong to the same community, and thus we can make their community memberships similar; vice versa. We implement the above idea to construct the second sub-model by using the method of spectral clustering. By now, we can adopt a graph regularization term to add the second sub-model into the first sub-model, which achieves a combination of the linkage and node content information. By combining these two sub-models, the unified model not only combines node contents with network topology but also considers the manifold structure of the content information in the process of constructing the Laplacian matrix, so as to achieve an improvement for finding communities. Last but not least, in order to reduce the impact of the unbalance of the degree of nodes on graph regularization, we incorporate the node popularities into this unified model for enhancing the performance. Finally, our proposed method integrates topology and content of networks and is also a simple and unified method that considers the manifold structure of content information and the heterogeneity of the degree of nodes simultaneously. We name this method as Combination of Links and Node Contents for Community Discovery (CLNCCD). The processing of the proposed method, that is how to incorporate content information, is shown in Fig. 1.

The rest of our article is organized as follows. Section 2 briefly reviews the related work. In Section 3 we describe the proposed model and methods. In Section 4, we give the experimental evaluation, and discuss how to select the parameters. And finally, we discuss and draw some conclusions in Section 5.

Section snippets

Related work

There are many different types of information that can be applied to finding communities. In this article, we focus on a survey of community detection algorithms, considering topology alone, content alone and using them both.

Conventional methods mainly explore the topology of networks to achieve the goal of community detection, and the most popular community detection algorithms are: Louvain algorithm [3] which is a state-of-the-art modularity-based approach, InfoMap algorithm [4] which

Framework of community detection using a graph regularization method

In general, a network with content information can be described as an attributed graph G = (V, E, F): V is the collection of the vertices containing n nodes {v1, …, vN}, E the set of the edges in which each edge connects two nodes in V, and F the set of content on nodes. We focus only on discovering communities on an undirected and unweighted graph. Based on the above consideration, the adjacency matrix of G can be denoted by a nonnegative symmetric binary matrix A=aijR+N×N in which aij

Experimental results

We implement some comparative experiments of our method and the baseline methods on two different types of datasets, i.e. the artificial networks and the real-world networks. We test the ability of our algorithm for discovering communities when the networks and their communities are under control, while the experiments on the real network can show whether the proposed algorithm is suitable for the real applications.

Conclusions

In this paper, we propose a novel NMF-based method for community detection by considering both linkage and content information. This proposed method provides a novel view for communities due to it characterizes the nature of the combination of two types of information more effectively. By the comparison with some state-of-the-arts, we summarized the following contributions for our method. First, we find out that the space of the content information is not a Euclidean structure, but a complex

Acknowledgments

The work was supported by National Basic Research Program of China (2013CB329301), National Key R&D Program ofChina (2017YFC0820106), Natural Science Foundation of China (61502334, 61772361) and the Technology Research and Development Program of Tianjin, China (15ZXHLGX00130).

Jinxin Cao received his B.S. degree from Shandong Normal University, China, in 2010. Since 2011, he has been a post-graduate and Ph.D. joint program student in school of Computer Science and Technology at Tianjin University, China. His research interests includes data mining and analysis of complex networks.

References (28)

  • OjaE.

    Principal components, minor components, and linear neural networks

    Neural Netw.

    (1992)
  • HeD. et al.

    Identification of hybrid node and link communities in complex networks

    Sci. Rep.

    (2015)
  • SachanM. et al.

    Using content and interactions for discovering communities in social networks

  • LiY. et al.

    Uncovering the small community structure in large networks: A local spectral approach

  • JiaSongwei et al.

    Defining and identifying cograph communities in complex networks

    New J. Phys.

    (2015)
  • FanuelM. et al.

    Magnetic eigenmaps for community detection in directed networks

    Phys. Rev. E

    (2017)
  • RodriguezA. et al.

    Clustering by fast search and find of density peaks

    Science

    (2014)
  • HaoF. et al.

    K-clique community detection in social networks based on formal concept analysis

    IEEE Syst. J.

    (2017)
  • HofmannT.

    Probabilistic latent semantic indexing

  • YangJ. et al.

    Community detection in networks with node attributes

  • D. He, Z. Feng, D. Jin, X. Wang, W. Zhang, Joint identification of network communities and semantics via integrative...
  • WangX. et al.

    Semantic community identification in large attribute networks

  • RuanY. et al.

    Efficient community detection in large networks using content and links

  • SuC. et al.

    A new random-walk based label propagation community detection algorithm

  • Cited by (17)

    • A new single-chromosome evolutionary algorithm for community detection in complex networks by combining content and structural information

      2021, Expert Systems with Applications
      Citation Excerpt :

      Matrix operations such as matrix multiplication are used to create these matrices. Cao et al. (2019) developed a novel community detection model within the framework of non-negative matrix factorization (NMF). Their model was dubbed the combination of links and node contents for community detection (CLNCCD).

    • ACSIMCD: A 2-phase framework for detecting meaningful communities in dynamic social networks

      2021, Future Generation Computer Systems
      Citation Excerpt :

      Fani et al. [47] developed a dynamic approach for discovering communities using neural embedding from members’ temporal content and social interactions. In the same way, several content-links based approaches with different mechanisms are introduced in the literature [13,48–50]. Overall, these methods take the whole network in each snapshot (or designed for static networks), which makes them inappropriate for some evolving graphs.

    • Dual-channel hybrid community detection in attributed networks

      2021, Information Sciences
      Citation Excerpt :

      Hence, the consideration of the heterogeneous network structure and its semantics can potentially provide better community partitioning than the conventional methods that only consider network topology. Based on this hypothesis, several state-of-the-art methods [7,42,1,46] have been introduced to combine these two sources and achieved improved performance compared with the methods considering network topology only. Despite their effectiveness, we argue that there remain the following unresolved limitations.

    • Similarity preserving overlapping community detection in signed networks

      2021, Future Generation Computer Systems
      Citation Excerpt :

      These features make NMF naturally fit for solving the problem of community detection in complex networks. Recently, lots of NMF-based methods for community detection have been proposed, such as CLNCCD [37], NF-CCE [38], S2-jNMF [39], NMF-AWL [40], MHGNMF [41] and ADMM [42]. However, it should be pointed out that most of them are only suitable for unsigned networks.

    View all citing articles on Scopus

    Jinxin Cao received his B.S. degree from Shandong Normal University, China, in 2010. Since 2011, he has been a post-graduate and Ph.D. joint program student in school of Computer Science and Technology at Tianjin University, China. His research interests includes data mining and analysis of complex networks.

    Hongcui Wang received her B.S. degree in Computer Science and Technology in 2003 from Shandong University, and her M.S. in Software Application in 2006 form the Institute of Computing Technology, Chinese Academy of Sciences. She got the Ph.D. in Intelligent Information Processing in 2009 from Kyoto University, Japan. Since 2010, she has been a lecturer at Tianjin University. Her research interests are Speaker Recognition, Multilingual Speech Recognition, and Speech Perception Mechanism.

    Di Jin received his B.S., M.B. and Ph.D. in College of Computer Science and Technology in 2005, 2008 and 2012, respectively, from Jilin University. He has been an associate professor in Tianjin University. His current research interests include data mining, analysis of complex networks, and machine learning.

    Jianwu Dang graduated from Tsinghua Univ., China, in 1982, and got his M.S. at the same university in 1984. He worked for Tianjin Univ. as a lecture from 1984 to 1988. He was awarded the Ph.D. from Shizuoka Univ., Japan in 1992. Since 2001, he has moved to Japan Advanced Institute of Science and Technology (JAIST). His research interests are in all the fields of speech production, speech synthesis, and speech cognition.

    1

    These authors contributed to the work equally and should be regarded as co-first authors.

    View full text