A novel parallel distance metric-based approach for diversified ranking on large graphs

doi:10.1016/j.future.2018.05.031

Future Generation Computer Systems

Volume 88, November 2018, Pages 79-91

https://doi.org/10.1016/j.future.2018.05.031 Get rights and content

Highlights

•
A generalized distance metric based on a subadditive set function over the symmetry difference of neighbors of pairs of nodes is introduced to capture the pairwise dis-similarity over pairs of nodes.
•
Diversified ranking on graphs (DRG) is formulated as a Max-Sum k-dispersion problem with metrical edge weights.
•
A centralized linear time 2-approximation algorithm GA is developed to significantly solve the problem of DRG.
•
A highly parallelizable algorithm is developed for DRG, which can be easily implemented in MapReduce style parallel computation models using GA as a basic reducer.

Abstract

Diversified ranking on graphs ( $DRG$ ) is an important and challenging issue in researching graph data mining. Traditionally, this problem is modeled by a submodular optimization objective, and solved by applying a cardinality constrained monotone submodular maximization. However, the existing submodular objectives do not directly capture the dis-similarity over pairs of nodes, while most of algorithms cannot easily take full advantage of the power of a distributed cluster computing platform, such as Spark, to significantly promote the efficiency of algorithms. To overcome the deficiencies of existing approaches, in this paper, a generalized distance metric based on a subadditive set function over the symmetry difference of neighbors of pairs of nodes is introduced to capture the pairwise dis-similarity over pairs of nodes. In our approach, $DRG$ is formulated as a Max-Sum k-dispersion problem with metrical edge weights, which is NP-hard, in association with the proposed distance metric, a centralized linear time 2-approximation algorithm GA is then developed to significantly solve the problem of $DRG$ . Moreover, we develop a highly parallelizable algorithm for $DRG$ , which can be easily implemented in MapReduce style parallel computation models using GA as a basic reducer. Finally, extensive experiments are conducted on real network datasets to verify the effectiveness and efficiency of our proposed approaches.

Introduction

In recent years, with the support of web 2.0 technology, various social network applications, such as Facebook and WeChat etc., have been rapidly developed, and result in massive accumulated data. In order to effectively process and analyze these social network data, many researchers have developed different data analysis method and technology platform [[1], [2]].

Graph data is an important type of social network data. In general, graph data is composed of nodes and edges without total ordering information over all nodes. Thus, graph ranking, which aims to bring nodes into a total order, becomes one of the fundamental tasks for graph data analysis [3]. PageRank [4] is a popular measure of the importance of nodes in a graph. The basic idea of PageRank is that a node in a graph has a higher rank if it is linked by more high-ranking nodes. Generally, PageRank measures the global importance of nodes in a graph. Individual relevance is not well reflected by this measure. To overcome this shortcoming, personalized PageRank (PPR) and its variations [[4], [5]], are proposed to find nodes in a graph that are most relevant to a query or user. However, as discussed in these works [[6], [7]], the top-k ranking list found by PPR may have highly similar nodes due to the fact of that PPR only focuses on the query relevance of nodes. This reduces ranking effectiveness, especially in the case of that the query intent of users is uncertain and ambiguous.

Diversity has been widely recognized and studied as a way of addressing uncertainty and ambiguity in information retrieval applications [8]. To provide more desirable graph ranking results, graph ranking algorithms have to find a tradeoff solution between query relevance and diversification. Consequently, a considerable amount of work on diversified ranking on graphs ( $DRG$ ) has been undertaken to improve the diversity of top-k ranking results [[9], [10], [11], [12], [13], [14], [15], [16]]. In this paper, we focus on improving the diversity of ranking results based purely on the topological structure of the graph without accounting for information about the link and node.

Li et al. [10] formulated $DRG$ as a bi-criteria optimization problem where the relevance measure is the sum of Personalized PageRank scores, and diversity is measured by the expansion ratio over the ranking results. Küçüktunç et al. [11] presented expanded relevance, which combines both relevance and diversity into a single function as an optimization objective. Expanded relevance is actually a weighted expansion ratio using personalized PageRank scores as the weights of nodes. Both the expansion ratio and expanded relevance resort to a submodular optimization objective and apply the classic cardinality constrained monotone submodular maximization to solve their problems.

However, two issues must be taken into consideration for improving the effectiveness and efficiency of $DRG$ . First, the key challenge of $DRG$ is to design well-defined diversity measures. Both expansion ratio and expanded relevance attempt to work on the assumption: two nodes are dissimilar if they do not share common neighbors in a graph, and a set of nodes that are dissimilar to one another implies a set of nodes with large expansion measures. However, both the expansion ratio and expanded relevance do not directly capture dissimilarity over pairs of nodes. In other words, a set of nodes with high expansion ratio or expanded relevance does not always imply that the nodes in the set are dissimilar to one another. The following Example 1 illustrates this observation.

Example 1

Let us firstly review the definition of expansion ratio introduced in [10]. Let $G = 〈V, E〉$ be a graph. $S \subseteq V$ is a subset of nodes. The expansion set of $S$ is denoted by $N (S)$ and defined as $N (S) = S \cup \{v \in V - S | \exists u \in S, (u, v) \in E\}$ . Then, the expansion ratio of $S$ is denoted by $er (S) = | N (S) | ∕ | V |$ . Thereby, we have the expansion ratio for $\{1, 2\}$ and $\{7, 8\}$ in Fig. 1 respectively: $\begin{aligned} er (\{1, 2\}) & = | \{1, 2\} \cup \{3, 4, 5, 6\} | ∕ | \{1, 2, 3, 4, 5, 6, 7, 8\} | = 0.75 \\ er (\{7, 8\}) & = | \{7, 8\} \cup \{4, 6\} | ∕ | \{1, 2, 3, 4, 5, 6, 7, 8\} | = 0.5 \end{aligned}$ Here $\{1, 2\}$ has higher expansion ratio than $\{7, 8\}$ , but 1 and 2 are obviously more similar with each other than that of 7 and 8 since they have more common neighbors (note that, in this paper, we assume as in paper [[10], [11]] that diversified measure is based purely on the topological structure of the graph without accounting for information about the link and node). This illustrates that the expansion ratio does not always capture well the dissimilarity over pairs of nodes. Since the expanded relevance can be considered as a weighted version of expansion ratio, a similar problem also exists for the expanded relevance. $■$

The second issue is that maximizing a cardinality constrained monotone sub-modular function is NP-hard. A $(1 - 1 ∕ e)$ -approximation greedy algorithm can be used to solve the problem [17]. The greedy algorithm starts with an empty set and adds a node maximizing marginal gain to the solution set in each iteration. For a graph with $m$ nodes, the greedy algorithm needs $O (m k)$ evaluations of marginal gains. However, when dealing with massive graphs, the classic greedy algorithm becomes expensive and infeasible. Although submodularity can be exploited to implement accelerated greedy algorithms, such as CELF [18] or lazy-greedy [19], as the scale of a graph increases, even for the small sizes of $k$ , the accelerated greedy algorithm is still inefficient. Moreover, the approaches based on submodular optimization are essentially implemented in a sequential procedure, and are not suitable for parallel processing. This results in that these algorithms cannot easily take full advantage of the power of a parallel graph processing platform, such as Spark GraphX [20], to significantly promote the efficiency of algorithms.

Therefore, a natural question to ask is whether it is possible to present a distance measure to capture directly pairwise dissimilarity between nodes and formulate the diversity of ranking results based on this measure. Or, even better, is it possible to develop efficient algorithms derived from the diversity model that are suitable for parallel implementations?

To this end, in this paper, a distance measure is firstly introduced to capture pairwise dis-similarity over pairs of nodes. Based on the defined distance metric, $DRG$ is formulated as a Max-Sum $k$ -dispersion problem (MSk D) [21]. A centralized linear time algorithm and a highly parallelizable MapReduce algorithm are proposed respectively to solve $DRG$ .

More specifically, the key contributions in this paper can be summarized as follows:

$•$
A generalized distance metric is introduced to measure various dissimilarities over pairs of nodes. The distance measure is defined by a set function $f$ over the symmetry difference of neighbors of pairs of nodes. Moreover, if $f$ is a subadditive set function [22] over the set of nodes, the distance measure can be proven as a metric that can be exploited to develop efficient algorithms. Furthermore, since many set functions are subadditive, such as the cardinality of a set, or the sum of weights of nodes, the proposed distance metric is a generalized measurement that can be used to capture various dissimilarities over pairs of nodes by setting different subadditive set functions.
$•$
$DRG$ is formulated as a Max-Sum $k$ -dispersion problem. Based on the defined distance metric, $DRG$ can be defined as a weighted complete graph $C (Q)$ , where $Q$ is the set of query-dependent nodes for a Personalized PageRank. Let the edge weight between two nodes be a distance metric between their corresponding nodes in an original graph. Using these means, $DRG$ is formulated as a MSk D on $C (Q)$ that is to seek a size- $k$ subset of nodes having an induced subgraph with a maximum sum of edge weight.
$•$
A highly parallelizable MapReduce approach with approximation guarantee is presented to solve $DRG$ on large graphs. Since MSk D is NP-hard [21], a centralized linear time approximation algorithm GA is firstly proposed to solve $DRG$ with the benefit of using the metrical distance. Using GA as a basic reducer, we further develop a highly parallelizable approach to solve $DRG$ , such approach can be easily implemented in MapReduce style parallel computation models. Meanwhile, this parallel approach is able to obtain approximation guarantees for $DRG$ . To the best of our knowledge this is the first method that solves $DRG$ in a MapReduce manner with approximation guarantees.

Extensive experiments are conducted on several representative real-world network datasets in comparison with existing approaches under relevance and various diversified measures. As a result, the experimental results significantly demonstrate the effectiveness and efficiency of our proposed algorithms.

The remainder of this paper is organized as follows. In Section 2, we review the related literature. In Section 3, we discuss the pairwise distance metric and present the formulation of our problem. The diversified algorithms are introduced in Section 4. In Section 5, we report the experimental results. Finally, we conclude the paper in Section 6.

Section snippets

Diversified ranking on graphs

Ranking nodes is a fundamental issue in the retrieval and mining of graph data. PageRank [4] provides a way to measure the global importance of nodes in a graph. Evolved from the PageRank, topic-sensitive PageRank or personalized PageRank [[5], [23]] can be used to evaluate the relevance scores of nodes in a graph. To address the redundancy problem in the ranking results, considerable work has addressed diversified ranking on graphs [[9], [10], [11], [12], [13], [14], [15], [16]]. In what

Problem formulation

In this section, we first establish a pairwise distance metric that is used to measure the dis-similarity over pair of nodes. Then, based on this metric, we propose a novel diversified ranking measure and formulate our problem.

Diversified ranking algorithms

As discussed in the previous section, it is NP-hard to solve TopkDRG defined in Eq. (4). In this section, we first propose a 2-approximate algorithm (GA) to approximately solve TopkDRG. Then, a two round MapReduce algorithm is proposed to solve TopkDRG in parallel using GA as basic reducers.

Experiments

In this section, we evaluate the proposed algorithms experimentally. We first describe the experimental setup in Section 5.1. We report the effects of several parameters which may influence the performance of our algorithms and test the scalability of our algorithms in Section 5.2. Finally, the comparison with existing diversified ranking algorithms is presented in Section 5.3.

Conclusion

In this paper, we present a distance metric-based approach to solve the problem of diversified ranking on graphs. The experimental results on real network datasets demonstrate that our approach achieves the comparable relevance and existing diversification performance, such as expansion rate and expansion relevance, while our approach is faster than the existing state-of-the-art methods due to the exploitation of the power of parallel computation. Additionally, for the average distance, which

Acknowledgments

The authors acknowledge the financial support from the following foundations: National Natural Science Foundation of China (61562091, 61472345, 61663046), Natural Science Foundation of Yunnan Province, China (2014FA023, 2016FB110, 2016FB104), Foundation of Backbone Teacher Development of Yunnan University, China (XT412003), Program for Excellent Young Talents of Yunnan University, China (XT412003), and Open Foundation of Key Laboratory of Software Engineering, China, Yunnan Province (2012SE303,

Jin Li received the B.Sc. degree in computer science, the M.Sc. degree in computational mathematics and the Ph.D. degree in telecommunication and information system from Yunnan University in 1998, 2004, and 2012 respectively. He is currently with the National Pilot School of Software, Yunnan University, Kunming, China, as an Associate Professor of Machine learning. His current research interests include machine learning, data mining, social network analysis.

References (36)

ChangV.
A cybernetics social cloud
J. Syst. Softw.
(2017)
ChangV.
A proposed social network analysis platform for big data analytics
Technol. Forecast. Soc. Change
(2018)
HassinR. et al.
Approximation algorithms for maximum dispersion
Oper. Res. Lett.
(1997)
YangY. et al.
Bi-weighted ensemble via HMM-based approaches for temporal data clustering
Pattern Recognit.
(2018)
YangY. et al.
HMM-based hybrid meta-clustering ensemble for temporal data
Knowl. Based Syst.
(2014)
GetoorL. et al.
Introduction to the special issue on link mining. A survey
ACM SIGKDD Explor. Newslett.
(2005)
PageL. et al.
The pagerank citation ranking: bringing order to the web
Stanford Digital Libraries Working Paper
(1998)
HaveliwalaT.H.
Topic-sensitive pagerank
Q. Mei, J. Guo, D.R. Radev, Divrank: The interplay of prestige and diversity in information networks, in: Acm Sigkdd...
X. Zhu, A.B. Goldberg, J.V. Gael, D. Andrzejewski, Improving in diversity in ranking using absorbing random walks, in:...

K. Zheng, H. Wang, Z. Qi, et al., A survey of query result diversification, in: Knowledge & Information Systems, 2016,...

TongH. et al.

Diversified ranking on large graphs: an optimization viewpoint

LiR.H. et al.

Scalable diversified ranking on large graphs

IEEE Trans. Knowl. Data Eng.

(2013)

KüçüktunçO. et al.

Diversified recommendation on graphs: Pitfalls, measures, and algorithms

KüçüktunçO. et al.

Diversifying citation recommendations

ACM Trans. Intell. Syst. Technol. (TIST)

(2015)

A. Dubey, S. Chakrabarti, C. Bhattacharyya, Diversity in ranking via resistive graph centers, in: KDD, 2011, pp....

MottinD. et al.

Graph query reformulation with diversity

L. Yuan, L. Qin, X. Lin, L. Chang, W. Zhang, Diversified top-k clique search, in: 31st IEEE international conference on...

Cited by (2)

Veracity handling and instance reduction in big data using interval type-2 fuzzy sets
2020, Engineering Applications of Artificial Intelligence
Citation Excerpt :
To analyze this overgrowing data, researchers have extensively used the technique of clustering. Although Euclidian distance measure was widely used to imitate dissimilarity between two patterns, a variety of other distance measures were also being used in the literature, Li et al. (2018), Radhakrishna et al. (2018) and AlShaer et al. (2019). However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization.
Within the aspect of big data, veracity refers to the existing uncertainty in the dataset. The continuous flow of unstructured data with unwanted noise may bring abnormality in the dataset making them unusable. In this paper, we propose a novel method to handle the veracity characteristic of the big data using the concept of footprint of uncertainty (FOU) in interval type-2 fuzzy sets (IT2 FSs). The proposed method helps in handling the veracity issue in big data and reduces the instances to a manageable extent. We have compared the results with the existing clustering based methods and examined the relationship between the clusters and the FOUs by comparing their centroids and defuzzified values. To scrutinize the validity of our results, we have also performed a number of additional experiments by appending extra instances to the datasets. To check its consistency and efficacy, the proposed methodology is assessed from three different aspects. Experimental result validates that the proposed method can suitably handle the veracity issue in big datasets and is efficient in reducing the instances.
A Stacked Multi-Granularity Convolution Denoising Auto-Encoder
2019, IEEE Access

Yun Yang received the B.Sc. (Hons.) degree in information technology and telecommunication from Lancaster University, Lancaster, U.K., in 2004, the M.Sc. degree in advanced computing from Bristol University, Bristol, U.K., in 2005, and the M.Phil. degree in informatics and the Ph.D. degree in computer science from the University of Manchester, Manchester, U.K., in 2006 and 2011, respectively. He was a Research Fellow with the University of Surrey, Surrey, U.K., from 2012 to 2013. He is currently with the National Pilot School of Software, Yunnan University, Kunming, China, as a full Professor of Machine learning. His current research interests include machine learning, data mining, pattern recognition and temporal data process and analysis.

Xiaoling Wang received the bachelors, masters, and doctoral degrees from Southeast University in 1997, 2000, and 2003, respectively. She is currently a professor, vice dean in Software Engineering Institute, East China Normal University. She achieved the Programs of New-Century Talent of Ministry of Education of China. She is a member of China Computer Federation Technical Committee on Databases. She has published more than 100 papers and some papers were published in international conferences and journals such as SIGMOD, WWW, SIGIR, AAAI, IJCAI, CIKM, DASFAA and ICWS. Her research interests mainly include web data management, data mining and data service technology. She is a member of the IEEE.

Zhiming Zhao obtained his Ph.D. in computer science in 2004 from University of Amsterdam (UvA). He is a senior researcher in the System and Network Engineering group at UvA. He is the scientific coordinator of the European H2020 SWITCH project and leads the Data for Science theme in the ENVRIPLUS project. His research interests include software defined networking, cloud computing, time critical systems and big data management.

Tong Li got the Ph.D. degree in Software Engineering in February 2007 from De Montfort University, U.K, the B.Sc. degree in Computer Science in July 1983 and the M.Sc. degree in Computer Science in July 1988, all from Yunnan University, Kunming, China. He is a professor in computer science of Yunnan University. His current research interests include software process and data mining.

View full text

A novel parallel distance metric-based approach for diversified ranking on large graphs

Highlights

Abstract

Introduction

Section snippets

Diversified ranking on graphs

Problem formulation

Diversified ranking algorithms

Experiments

Conclusion

Acknowledgments

J. Syst. Softw.

Technol. Forecast. Soc. Change

Oper. Res. Lett.

Pattern Recognit.

Knowl. Based Syst.

Introduction to the special issue on link mining. A survey

ACM SIGKDD Explor. Newslett.

The pagerank citation ranking: bringing order to the web

Stanford Digital Libraries Working Paper

Topic-sensitive pagerank