Combination of links and node contents for community discovery using a graph regularization approach

doi:10.1016/j.future.2018.08.009

Future Generation Computer Systems

Volume 91, February 2019, Pages 361-370

https://doi.org/10.1016/j.future.2018.08.009 Get rights and content

Abstract

With the rapid growth of the networked data, the study of community detection is drawing increasing attention of researchers. A number of algorithms have been proposed and some of them have been well applied in many research fields, such as recommendation systems, information retrieval, etc. Traditionally, the community detection methods mainly use the knowledge of the topological structure which contains the most important clue for finding potential groups or communities. However, as we know, a wealth of content information exists on the nodes in real-world networks, and may help for community detection. Considering the above problem, we introduce a novel community detection method under the framework of nonnegative matrix factorization (NMF), and adopt the idea that two nodes with similar content will be most likely to belong to the same community to achieve the incorporation of links and node contents, i.e., we employ a graph regularization to penalize the dissimilarity of nodes denoted by community memberships. Besides, we introduce an intuitive manifold learning strategy to recover the intrinsic geometrical structure of the content information, i.e., K-near neighbor consistency. In addition, we found that, there are still drawbacks in this framework due to it does not consider the heterogeneous distribution of node degrees. This heterogeneous distribution can affect the function of graph regularization and isolates the original community memberships. We first proposed the node popularities satisfying the above interpretation and develop a new NMF-based model, named as Combination of Links and Node Contents for Community Discovery (CLNCCD). The experiments on both artificial and real-world networks compared with the state-of-the-art methods show that, the new model obtains significant improvement for community detection by incorporating node contents effectively.

Introduction

The development of social networking has led to yield variety of mass data, such as user relation networks, product reviews and online comments. For the analysis of the complex networked data, researchers try to access a first view on those data by finding the latent community structure. A community can be deemed as a group of nodes that are closely connected within a class and are sparsely connected between the different classes. Discovering communities consisting of “similar” users is very important, and it has been applied in many areas, e.g., sociology, biology and computer science. For instance, in biology, some different units, belonging to an organization, have related functions which are linked in special structure to draw the whole effect of the organization. The interaction among a set of proteins in a cell can form an RNA polymerase for transcription of genes [1]. As Sachan et al. [2] described, finding the salient communities from the organization of people can create a guidance which can be helpful for web marketing, behavior prediction of users belonging to a community and understanding the function of complex system.

The above data described is often treated as a graph in which nodes denote units or users in real-world networks. And the relations among units or users are represented by directed or undirected links which connect those nodes just described. In order to discovering communities in the graph, many earlier methods [3], [4], [5] considered the information of network topology alone. In addition to the linked graph, there also exist some datasets with content information alone, such as tweet, ImageNet. And some researchers presented many data clustering algorithms [6], [7], [8] for the analysis of these datasets, which usually make use of node contents to perform clustering. In fact, both the linkage and content information exist simultaneously in real-world networks. For example, in the social networks, the content information may be tweets, user profiles or the text on the web pages. In the citation networks, node content is regarded as the abstract of a paper or affiliation information of the authors. So it is realized that, using links or node contents alone, an algorithm is often limited to mine or explore the important information about the structure of communities from the attributed graph which consists of network topology and node contents. For example, the methods of clustering by using the node content information are difficult to process those data points without the content. If one considers the links between data points in addition to data point contents, the designed methods may be able to solve this difficulty. Conversely, node contents might give a guidance for discovering communities when there are a few links existed in the network.

Recently, several methods [2], [9], [10], [11], [12] based on combining topological and content information have been proposed. For example, Ruan et al. [12] believed that node contents can be regarded as another type of edges, thus constructed a new graph in which the edges consist of the original links and the links ‘simplified’ by node contents, and then partitioned the simplified graph into clusters. Approaches based on generative framework [9] statistically model network topology and node contents, respectively, with the main idea that these two kinds of information are ‘generated’ by communities. However, they still have some drawbacks. First, when considering the content information in networks, the similarity between node contents in these community detection methods is often based on Euclidean space. However, the structure of the space of the content information may be a manifold which is a geometrical space that locally resembles the Euclidean space near each point. In this scenario, it is not suitable that these methods measure the similarity just described based only on Euclidean space. Thus, these methods which did not consider the manifold structure of information of node contents often limits the applicability for community detection. Second, the current generative frameworks usually cannot model the unbalanced degree of nodes, and thus the community memberships themselves are insufficient to model the link–link patterns. In a word, these existing community detection methods mainly focus on the combination of network topology and node contents, while still have room for improvement in the performance because they often did not consider that the manifold structure intrinsically held in the space of content information, and also often ignored the impact of the heterogeneity of the degree of nodes in the network topology.

In this paper, we introduce and develop a nonnegative matrix factorization (NMF) method for community detection which models links with graph regularization to try to overcome the shortcomings of the existing approaches. The proposed model consists of two sub-models. In the first sub-model, its input is the adjacency matrix which describes the topological information, and the community memberships of nodes are obtained by factorizing this adjacency matrix. Next, it is intuitive that if two nodes are similar in terms of the content information, they will be most likely to belong to the same community, and thus we can make their community memberships similar; vice versa. We implement the above idea to construct the second sub-model by using the method of spectral clustering. By now, we can adopt a graph regularization term to add the second sub-model into the first sub-model, which achieves a combination of the linkage and node content information. By combining these two sub-models, the unified model not only combines node contents with network topology but also considers the manifold structure of the content information in the process of constructing the Laplacian matrix, so as to achieve an improvement for finding communities. Last but not least, in order to reduce the impact of the unbalance of the degree of nodes on graph regularization, we incorporate the node popularities into this unified model for enhancing the performance. Finally, our proposed method integrates topology and content of networks and is also a simple and unified method that considers the manifold structure of content information and the heterogeneity of the degree of nodes simultaneously. We name this method as Combination of Links and Node Contents for Community Discovery (CLNCCD). The processing of the proposed method, that is how to incorporate content information, is shown in Fig. 1.

The rest of our article is organized as follows. Section 2 briefly reviews the related work. In Section 3 we describe the proposed model and methods. In Section 4, we give the experimental evaluation, and discuss how to select the parameters. And finally, we discuss and draw some conclusions in Section 5.

Section snippets

Related work

There are many different types of information that can be applied to finding communities. In this article, we focus on a survey of community detection algorithms, considering topology alone, content alone and using them both.

Conventional methods mainly explore the topology of networks to achieve the goal of community detection, and the most popular community detection algorithms are: Louvain algorithm [3] which is a state-of-the-art modularity-based approach, InfoMap algorithm [4] which

Framework of community detection using a graph regularization method

In general, a network with content information can be described as an attributed graph G $=$ (V, E, F): V is the collection of the vertices containing n nodes { $v_{1}$ , …, $v_{N}$ }, E the set of the edges in which each edge connects two nodes in V, and F the set of content on nodes. We focus only on discovering communities on an undirected and unweighted graph. Based on the above consideration, the adjacency matrix of G can be denoted by a nonnegative symmetric binary matrix $A = \{a_{i j}\} \in R_{+}^{N \times N}$ in which $a_{i j}$

Experimental results

We implement some comparative experiments of our method and the baseline methods on two different types of datasets, i.e. the artificial networks and the real-world networks. We test the ability of our algorithm for discovering communities when the networks and their communities are under control, while the experiments on the real network can show whether the proposed algorithm is suitable for the real applications.

Conclusions

In this paper, we propose a novel NMF-based method for community detection by considering both linkage and content information. This proposed method provides a novel view for communities due to it characterizes the nature of the combination of two types of information more effectively. By the comparison with some state-of-the-arts, we summarized the following contributions for our method. First, we find out that the space of the content information is not a Euclidean structure, but a complex

Acknowledgments

The work was supported by National Basic Research Program of China (2013CB329301), National Key R&D Program ofChina (2017YFC0820106), Natural Science Foundation of China (61502334, 61772361) and the Technology Research and Development Program of Tianjin, China (15ZXHLGX00130).

Jinxin Cao received his B.S. degree from Shandong Normal University, China, in 2010. Since 2011, he has been a post-graduate and Ph.D. joint program student in school of Computer Science and Technology at Tianjin University, China. His research interests includes data mining and analysis of complex networks.

References (28)

OjaE.
Principal components, minor components, and linear neural networks
Neural Netw.
(1992)
HeD. et al.
Identification of hybrid node and link communities in complex networks
Sci. Rep.
(2015)
SachanM. et al.
Using content and interactions for discovering communities in social networks
LiY. et al.
Uncovering the small community structure in large networks: A local spectral approach
JiaSongwei et al.
Defining and identifying cograph communities in complex networks
New J. Phys.
(2015)
FanuelM. et al.
Magnetic eigenmaps for community detection in directed networks
Phys. Rev. E
(2017)
RodriguezA. et al.
Clustering by fast search and find of density peaks
Science
(2014)
HaoF. et al.
K-clique community detection in social networks based on formal concept analysis
IEEE Syst. J.
(2017)
HofmannT.
Probabilistic latent semantic indexing
YangJ. et al.
Community detection in networks with node attributes

D. He, Z. Feng, D. Jin, X. Wang, W. Zhang, Joint identification of network communities and semantics via integrative...

WangX. et al.

Semantic community identification in large attribute networks

RuanY. et al.

Efficient community detection in large networks using content and links

SuC. et al.

A new random-walk based label propagation community detection algorithm

Cited by (17)

A new single-chromosome evolutionary algorithm for community detection in complex networks by combining content and structural information
2021, Expert Systems with Applications
Citation Excerpt :
Matrix operations such as matrix multiplication are used to create these matrices. Cao et al. (2019) developed a novel community detection model within the framework of non-negative matrix factorization (NMF). Their model was dubbed the combination of links and node contents for community detection (CLNCCD).
Community detection is an important step in perceiving network structure and performance for complex network analysis. The rapid growth of network data in recent years has piqued the interest of many researchers in community detection. The majority of community detection methods only consider the network structure. Nonetheless, real-world network nodes may have some characteristics that can be useful for community detection. This study proposed a novel single-chromosome evolutionary algorithm with a distinctive architecture modification operator for community detection in complex networks using a combination of structural and content information. To this end, a novel virtual network was created by taking into account the structure and content of nodes, and communities were discovered for this network by optimizing the objective function (and using the combinatorial adjacency matrix instead of the structural adjacency matrix) in a series of steps. The nodes in this network were the same as the nodes in the main network; however, the links were developed based on similarities between nodes and their structural neighborhood. The proposed algorithm also included a method for sorting new nodes in order to determine the analysis order of nodes along with the local improvement of solution, as well as a new criterion, CS, for measuring the content similarity of nodes. The proposed algorithm was evaluated in real-networks and compared to various state-of-the-art and widely used methods. The Friedman rank algorithm was then used to rank the proposed algorithm and the existing methods using six real networks. According to the NMI criterion used in the Friedman rank test, the rank of the proposed algorithms increased by 96.8762%, 70.2693%, 26.0005%, 23.5294%, 46.5109%, and 23.5294% compared respectively with ASCD-ARC, BTLSC, Adapt-SA, PSB-PG, RSECD, and NEMBP, which have all been proposed in recent years.
ACSIMCD: A 2-phase framework for detecting meaningful communities in dynamic social networks
2021, Future Generation Computer Systems
Citation Excerpt :
Fani et al. [47] developed a dynamic approach for discovering communities using neural embedding from members’ temporal content and social interactions. In the same way, several content-links based approaches with different mechanisms are introduced in the literature [13,48–50]. Overall, these methods take the whole network in each snapshot (or designed for static networks), which makes them inappropriate for some evolving graphs.
Detecting and analyzing community structure is a challenging topic in dynamic social network analysis. Although the number of methods in this area is on the rise, there are only a few algorithms that can discover meaningful communities based on different aspects of social networks. Indeed, social networks contain various information sources that can be used to analyze them. The most important part of this information is related to users’ topics of interest (content information) and users’ interactions (structure information). One promising solution to discover meaningful communities is to combine these two concepts. Based on this, we introduce ACSIMCD, a 2-phase framework for discovering and updating community structure without recomputing them from scratch at each snapshot. This article mainly includes two parts. In the first part, a static community detection algorithm which is called Content and Structure Information based Method for Community Detection (CSIMCD for short) is proposed to discover the initial community structure. The CSIMCD uses a hybrid approach founded on statistical and semantic measures to extract the users’ topics of interest. Accordingly, the original network is divided into several clusters (topical clusters) so that each one represents a distinct topic, then by performing a link analysis on each topical cluster, the communities are detected. In the second part, we propose ACSIMCD (Adaptive CSIMCD), an adaptive method for detecting and updating community structure in dynamic social networks. More precisely, the ACSIMCD explores the topics of interest of each changed node to identify the topical cluster it belongs to. After that, we update the community structure in this topical cluster, and we keep others as they are. We compare the ACSIMCD model with algorithms from different approaches including content-based methods on real-world networks. The experimental results showed that ACSIMCD produces a community structure of high quality from the perspective of links and interests compared with the classical methods, and that it is able to process network changes effectively in a reasonable time scale.
Dual-channel hybrid community detection in attributed networks
2021, Information Sciences
Citation Excerpt :
Hence, the consideration of the heterogeneous network structure and its semantics can potentially provide better community partitioning than the conventional methods that only consider network topology. Based on this hypothesis, several state-of-the-art methods [7,42,1,46] have been introduced to combine these two sources and achieved improved performance compared with the methods considering network topology only. Despite their effectiveness, we argue that there remain the following unresolved limitations.
This study considers the problem of hybrid community detection in attributed networks based on the information of network topology and attributes with the aim to address the following two shortcomings of existing hybrid community detection methods. First, many of these methods are based on the assumption that network topology and attributes carry consistent information but ignore the intrinsic mismatch correlation between them. Second, network topology is typically treated as the dominant source of information, with attributes employed as the auxiliary source; the dominant effect of attributes is seldom explored or indeed considered. To address these limitations, this paper presents a novel Dual-channel Hybrid Community Detection (DHCD) method that considers the dominant effects of topology and attributes separately. The concept of transition relation between the topology and attribute clusters is introduced to explore the mismatch correlation between the two sources and learn the behavioral and content diversity of nodes. An extended overlapping community detection algorithm is introduced based on the two types of diversity. By utilizing network attributes, DHCD can simultaneously derive the community partitioning membership and corresponding semantic descriptions. The superiority of DHCD over state-of-the-art community detection methods is demonstrated on a set of synthetic and real-world networks.
Similarity preserving overlapping community detection in signed networks
2021, Future Generation Computer Systems
Citation Excerpt :
These features make NMF naturally fit for solving the problem of community detection in complex networks. Recently, lots of NMF-based methods for community detection have been proposed, such as CLNCCD [37], NF-CCE [38], S2-jNMF [39], NMF-AWL [40], MHGNMF [41] and ADMM [42]. However, it should be pointed out that most of them are only suitable for unsigned networks.
Community detection in signed networks is a challenging research problem, and is of great importance to understanding the structural and functional properties of signed networks. It aims at dividing nodes into different clusters with more intra-cluster and less inter-cluster links. Meanwhile, most positive links should lie within clusters and most negative links should lie between clusters. In recent years, some methods for community detection in signed networks have been proposed, but few of them focus on overlapping community detection. Moreover, most of them directly exploit the sparse link topology to detect communities, which often makes them perform poorly. In view of this, in this paper we propose a similarity preserving overlapping community detection (SPOCD) method. SPOCD firstly extracts node similarity information and geometric structure information from the link topology, and then uses a graph regularized binary semi-nonnegative matrix factorization (GRBSNMF) model to fuse these two sources of information to detect communities. Through this mechanism, nodes with high similarity can be well preserved in the same community. Besides, SPOCD devises a special discretization strategy to obtain the binary community indicator matrix, which is very convenient for directly identifying overlapping communities in signed networks. We conduct extensive experiments on synthetic and real-world signed networks, and the results demonstrate that our method outperforms state-of-the-art methods.
Modeling and detection of the multi-stages of Advanced Persistent Threats attacks based on semi-supervised learning and complex networks characteristics
2020, Future Generation Computer Systems
Advanced Persistent Threats (APT) present the most sophisticated types of attacks to modern networks which have proved to be very challenging to address. Using sophisticated attack techniques, attackers remotely control infected machines and exfiltrate sensitive information from organizations and governments. Security products deployed by enterprise networks based on traditional defenses often fail at detecting APT infections because of the dynamic nature of the APT attack process. To overcome the current limitations of attack network dynamics faced in APT studies, an innovative APT attack detection model based on a semi-supervised learning approach and complex networks characteristics is proposed in this paper. The entire targeted network is modeled as a small-world network and the evolving APT-Attack Network (APT-AN) as a scale-free network. Finite state machines are employed to model the state transitions of the nodes in the time domain in order to characterize the state changes during the APT attack process. The effectiveness of the model is demonstrated by applying it to real-world data from a large-scale enterprise network consisting of 17,684 hosts from the Los Alamos security lab. The proposed approach analyzes efficiently the large-scale dataset to reveal APT attack characteristics between the command and control center and the victim hosts. The final result is a ranked list of suspicious hosts participating in APT attack activities. The average detection precision of three APT stage is 90.5% in our proposed APT detection framework. The results show that the model can effectively detect the suspicious hosts at different stages of the APT attack process.
A novel intelligent Fuzzy-AHP based evolutionary algorithm for detecting communities in complex networks
2024, Soft Computing

View all citing articles on Scopus

Hongcui Wang received her B.S. degree in Computer Science and Technology in 2003 from Shandong University, and her M.S. in Software Application in 2006 form the Institute of Computing Technology, Chinese Academy of Sciences. She got the Ph.D. in Intelligent Information Processing in 2009 from Kyoto University, Japan. Since 2010, she has been a lecturer at Tianjin University. Her research interests are Speaker Recognition, Multilingual Speech Recognition, and Speech Perception Mechanism.

Di Jin received his B.S., M.B. and Ph.D. in College of Computer Science and Technology in 2005, 2008 and 2012, respectively, from Jilin University. He has been an associate professor in Tianjin University. His current research interests include data mining, analysis of complex networks, and machine learning.

Jianwu Dang graduated from Tsinghua Univ., China, in 1982, and got his M.S. at the same university in 1984. He worked for Tianjin Univ. as a lecture from 1984 to 1988. He was awarded the Ph.D. from Shizuoka Univ., Japan in 1992. Since 2001, he has moved to Japan Advanced Institute of Science and Technology (JAIST). His research interests are in all the fields of speech production, speech synthesis, and speech cognition.

¹: These authors contributed to the work equally and should be regarded as co-first authors.

View full text

Combination of links and node contents for community discovery using a graph regularization approach

Abstract

Introduction

Section snippets

Related work

Framework of community detection using a graph regularization method

Experimental results

Conclusions

Acknowledgments

Neural Netw.

Identification of hybrid node and link communities in complex networks

Sci. Rep.

Using content and interactions for discovering communities in social networks

Uncovering the small community structure in large networks: A local spectral approach

Defining and identifying cograph communities in complex networks

New J. Phys.

Magnetic eigenmaps for community detection in directed networks

Phys. Rev. E

Clustering by fast search and find of density peaks

Science

K-clique community detection in social networks based on formal concept analysis

IEEE Syst. J.

Probabilistic latent semantic indexing

Community detection in networks with node attributes

Semantic community identification in large attribute networks

Efficient community detection in large networks using content and links

A new random-walk based label propagation community detection algorithm