A novel parallel distance metric-based approach for diversified ranking on large graphs

https://doi.org/10.1016/j.future.2018.05.031Get rights and content

Highlights

  • A generalized distance metric based on a subadditive set function over the symmetry difference of neighbors of pairs of nodes is introduced to capture the pairwise dis-similarity over pairs of nodes.

  • Diversified ranking on graphs (DRG) is formulated as a Max-Sum k-dispersion problem with metrical edge weights.

  • A centralized linear time 2-approximation algorithm GA is developed to significantly solve the problem of DRG.

  • A highly parallelizable algorithm is developed for DRG, which can be easily implemented in MapReduce style parallel computation models using GA as a basic reducer.

Abstract

Diversified ranking on graphs (DRG) is an important and challenging issue in researching graph data mining. Traditionally, this problem is modeled by a submodular optimization objective, and solved by applying a cardinality constrained monotone submodular maximization. However, the existing submodular objectives do not directly capture the dis-similarity over pairs of nodes, while most of algorithms cannot easily take full advantage of the power of a distributed cluster computing platform, such as Spark, to significantly promote the efficiency of algorithms. To overcome the deficiencies of existing approaches, in this paper, a generalized distance metric based on a subadditive set function over the symmetry difference of neighbors of pairs of nodes is introduced to capture the pairwise dis-similarity over pairs of nodes. In our approach,DRG is formulated as a Max-Sum k-dispersion problem with metrical edge weights, which is NP-hard, in association with the proposed distance metric, a centralized linear time 2-approximation algorithm GA is then developed to significantly solve the problem ofDRG. Moreover, we develop a highly parallelizable algorithm forDRG, which can be easily implemented in MapReduce style parallel computation models using GA as a basic reducer. Finally, extensive experiments are conducted on real network datasets to verify the effectiveness and efficiency of our proposed approaches.

Introduction

In recent years, with the support of web 2.0 technology, various social network applications, such as Facebook and WeChat etc., have been rapidly developed, and result in massive accumulated data. In order to effectively process and analyze these social network data, many researchers have developed different data analysis method and technology platform [[1], [2]].

Graph data is an important type of social network data. In general, graph data is composed of nodes and edges without total ordering information over all nodes. Thus, graph ranking, which aims to bring nodes into a total order, becomes one of the fundamental tasks for graph data analysis [3]. PageRank [4] is a popular measure of the importance of nodes in a graph. The basic idea of PageRank is that a node in a graph has a higher rank if it is linked by more high-ranking nodes. Generally, PageRank measures the global importance of nodes in a graph. Individual relevance is not well reflected by this measure. To overcome this shortcoming, personalized PageRank (PPR) and its variations [[4], [5]], are proposed to find nodes in a graph that are most relevant to a query or user. However, as discussed in these works [[6], [7]], the top-k ranking list found by PPR may have highly similar nodes due to the fact of that PPR only focuses on the query relevance of nodes. This reduces ranking effectiveness, especially in the case of that the query intent of users is uncertain and ambiguous.

Diversity has been widely recognized and studied as a way of addressing uncertainty and ambiguity in information retrieval applications [8]. To provide more desirable graph ranking results, graph ranking algorithms have to find a tradeoff solution between query relevance and diversification. Consequently, a considerable amount of work on diversified ranking on graphs (DRG) has been undertaken to improve the diversity of top-k ranking results [[9], [10], [11], [12], [13], [14], [15], [16]]. In this paper, we focus on improving the diversity of ranking results based purely on the topological structure of the graph without accounting for information about the link and node.

Li et al. [10] formulatedDRG as a bi-criteria optimization problem where the relevance measure is the sum of Personalized PageRank scores, and diversity is measured by the expansion ratio over the ranking results. Küçüktunç et al. [11] presented expanded relevance, which combines both relevance and diversity into a single function as an optimization objective. Expanded relevance is actually a weighted expansion ratio using personalized PageRank scores as the weights of nodes. Both the expansion ratio and expanded relevance resort to a submodular optimization objective and apply the classic cardinality constrained monotone submodular maximization to solve their problems.

However, two issues must be taken into consideration for improving the effectiveness and efficiency ofDRG. First, the key challenge ofDRG is to design well-defined diversity measures. Both expansion ratio and expanded relevance attempt to work on the assumption: two nodes are dissimilar if they do not share common neighbors in a graph, and a set of nodes that are dissimilar to one another implies a set of nodes with large expansion measures. However, both the expansion ratio and expanded relevance do not directly capture dissimilarity over pairs of nodes. In other words, a set of nodes with high expansion ratio or expanded relevance does not always imply that the nodes in the set are dissimilar to one another. The following Example 1 illustrates this observation.

Example 1

Let us firstly review the definition of expansion ratio introduced in [10]. LetG=V,E be a graph.SV is a subset of nodes. The expansion set ofS is denoted byNS and defined asNS=SvVS|uS,u,vE. Then, the expansion ratio ofS is denoted byerS=|NS||V|. Thereby, we have the expansion ratio for1,2 and7,8 in Fig. 1 respectively: er1,2=|1,23,4,5,6||1,2,3,4,5,6,7,8|=0.75er7,8=|7,84,6||1,2,3,4,5,6,7,8|=0.5Here1,2 has higher expansion ratio than7,8, but 1 and 2 are obviously more similar with each other than that of 7 and 8 since they have more common neighbors (note that, in this paper, we assume as in paper [[10], [11]] that diversified measure is based purely on the topological structure of the graph without accounting for information about the link and node). This illustrates that the expansion ratio does not always capture well the dissimilarity over pairs of nodes. Since the expanded relevance can be considered as a weighted version of expansion ratio, a similar problem also exists for the expanded relevance.  

The second issue is that maximizing a cardinality constrained monotone sub-modular function is NP-hard. A11e-approximation greedy algorithm can be used to solve the problem [17]. The greedy algorithm starts with an empty set and adds a node maximizing marginal gain to the solution set in each iteration. For a graph withm nodes, the greedy algorithm needsOmk evaluations of marginal gains. However, when dealing with massive graphs, the classic greedy algorithm becomes expensive and infeasible. Although submodularity can be exploited to implement accelerated greedy algorithms, such as CELF [18] or lazy-greedy [19], as the scale of a graph increases, even for the small sizes ofk, the accelerated greedy algorithm is still inefficient. Moreover, the approaches based on submodular optimization are essentially implemented in a sequential procedure, and are not suitable for parallel processing. This results in that these algorithms cannot easily take full advantage of the power of a parallel graph processing platform, such as Spark GraphX [20], to significantly promote the efficiency of algorithms.

Therefore, a natural question to ask is whether it is possible to present a distance measure to capture directly pairwise dissimilarity between nodes and formulate the diversity of ranking results based on this measure. Or, even better, is it possible to develop efficient algorithms derived from the diversity model that are suitable for parallel implementations?

To this end, in this paper, a distance measure is firstly introduced to capture pairwise dis-similarity over pairs of nodes. Based on the defined distance metric,DRG is formulated as a Max-Sumk-dispersion problem (MSk D) [21]. A centralized linear time algorithm and a highly parallelizable MapReduce algorithm are proposed respectively to solveDRG.

More specifically, the key contributions in this paper can be summarized as follows:

  • A generalized distance metric is introduced to measure various dissimilarities over pairs of nodes. The distance measure is defined by a set functionf over the symmetry difference of neighbors of pairs of nodes. Moreover, iff is a subadditive set function [22] over the set of nodes, the distance measure can be proven as a metric that can be exploited to develop efficient algorithms. Furthermore, since many set functions are subadditive, such as the cardinality of a set, or the sum of weights of nodes, the proposed distance metric is a generalized measurement that can be used to capture various dissimilarities over pairs of nodes by setting different subadditive set functions.

  • DRG is formulated as a Max-Sumk-dispersion problem. Based on the defined distance metric,DRG can be defined as a weighted complete graphCQ, whereQ is the set of query-dependent nodes for a Personalized PageRank. Let the edge weight between two nodes be a distance metric between their corresponding nodes in an original graph. Using these means,DRG is formulated as a MSk D onCQ that is to seek a size-k subset of nodes having an induced subgraph with a maximum sum of edge weight.

  • A highly parallelizable MapReduce approach with approximation guarantee is presented to solveDRG on large graphs. Since MSk D is NP-hard [21], a centralized linear time approximation algorithm GA is firstly proposed to solveDRG with the benefit of using the metrical distance. Using GA as a basic reducer, we further develop a highly parallelizable approach to solveDRG, such approach can be easily implemented in MapReduce style parallel computation models. Meanwhile, this parallel approach is able to obtain approximation guarantees forDRG. To the best of our knowledge this is the first method that solvesDRG in a MapReduce manner with approximation guarantees.

Extensive experiments are conducted on several representative real-world network datasets in comparison with existing approaches under relevance and various diversified measures. As a result, the experimental results significantly demonstrate the effectiveness and efficiency of our proposed algorithms.

The remainder of this paper is organized as follows. In Section 2, we review the related literature. In Section 3, we discuss the pairwise distance metric and present the formulation of our problem. The diversified algorithms are introduced in Section 4. In Section 5, we report the experimental results. Finally, we conclude the paper in Section 6.

Section snippets

Diversified ranking on graphs

Ranking nodes is a fundamental issue in the retrieval and mining of graph data. PageRank [4] provides a way to measure the global importance of nodes in a graph. Evolved from the PageRank, topic-sensitive PageRank or personalized PageRank [[5], [23]] can be used to evaluate the relevance scores of nodes in a graph. To address the redundancy problem in the ranking results, considerable work has addressed diversified ranking on graphs [[9], [10], [11], [12], [13], [14], [15], [16]]. In what

Problem formulation

In this section, we first establish a pairwise distance metric that is used to measure the dis-similarity over pair of nodes. Then, based on this metric, we propose a novel diversified ranking measure and formulate our problem.

Diversified ranking algorithms

As discussed in the previous section, it is NP-hard to solve TopkDRG defined in Eq. (4). In this section, we first propose a 2-approximate algorithm (GA) to approximately solve TopkDRG. Then, a two round MapReduce algorithm is proposed to solve TopkDRG in parallel using GA as basic reducers.

Experiments

In this section, we evaluate the proposed algorithms experimentally. We first describe the experimental setup in Section 5.1. We report the effects of several parameters which may influence the performance of our algorithms and test the scalability of our algorithms in Section 5.2. Finally, the comparison with existing diversified ranking algorithms is presented in Section 5.3.

Conclusion

In this paper, we present a distance metric-based approach to solve the problem of diversified ranking on graphs. The experimental results on real network datasets demonstrate that our approach achieves the comparable relevance and existing diversification performance, such as expansion rate and expansion relevance, while our approach is faster than the existing state-of-the-art methods due to the exploitation of the power of parallel computation. Additionally, for the average distance, which

Acknowledgments

The authors acknowledge the financial support from the following foundations: National Natural Science Foundation of China (61562091, 61472345, 61663046), Natural Science Foundation of Yunnan Province, China (2014FA023, 2016FB110, 2016FB104), Foundation of Backbone Teacher Development of Yunnan University, China (XT412003), Program for Excellent Young Talents of Yunnan University, China (XT412003), and Open Foundation of Key Laboratory of Software Engineering, China, Yunnan Province (2012SE303,

Jin Li received the B.Sc. degree in computer science, the M.Sc. degree in computational mathematics and the Ph.D. degree in telecommunication and information system from Yunnan University in 1998, 2004, and 2012 respectively. He is currently with the National Pilot School of Software, Yunnan University, Kunming, China, as an Associate Professor of Machine learning. His current research interests include machine learning, data mining, social network analysis.

References (36)

  • K. Zheng, H. Wang, Z. Qi, et al., A survey of query result diversification, in: Knowledge & Information Systems, 2016,...
  • TongH. et al.

    Diversified ranking on large graphs: an optimization viewpoint

  • LiR.H. et al.

    Scalable diversified ranking on large graphs

    IEEE Trans. Knowl. Data Eng.

    (2013)
  • KüçüktunçO. et al.

    Diversified recommendation on graphs: Pitfalls, measures, and algorithms

  • KüçüktunçO. et al.

    Diversifying citation recommendations

    ACM Trans. Intell. Syst. Technol. (TIST)

    (2015)
  • A. Dubey, S. Chakrabarti, C. Bhattacharyya, Diversity in ranking via resistive graph centers, in: KDD, 2011, pp....
  • MottinD. et al.

    Graph query reformulation with diversity

  • L. Yuan, L. Qin, X. Lin, L. Chang, W. Zhang, Diversified top-k clique search, in: 31st IEEE international conference on...
  • Cited by (2)

    • Veracity handling and instance reduction in big data using interval type-2 fuzzy sets

      2020, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      To analyze this overgrowing data, researchers have extensively used the technique of clustering. Although Euclidian distance measure was widely used to imitate dissimilarity between two patterns, a variety of other distance measures were also being used in the literature, Li et al. (2018), Radhakrishna et al. (2018) and AlShaer et al. (2019). However, one of the major issues in using clustering algorithms for big data that causes confusion amongst practitioners is the lack of consensus in the definition of their properties as well as a lack of formal categorization.

    Jin Li received the B.Sc. degree in computer science, the M.Sc. degree in computational mathematics and the Ph.D. degree in telecommunication and information system from Yunnan University in 1998, 2004, and 2012 respectively. He is currently with the National Pilot School of Software, Yunnan University, Kunming, China, as an Associate Professor of Machine learning. His current research interests include machine learning, data mining, social network analysis.

    Yun Yang received the B.Sc. (Hons.) degree in information technology and telecommunication from Lancaster University, Lancaster, U.K., in 2004, the M.Sc. degree in advanced computing from Bristol University, Bristol, U.K., in 2005, and the M.Phil. degree in informatics and the Ph.D. degree in computer science from the University of Manchester, Manchester, U.K., in 2006 and 2011, respectively. He was a Research Fellow with the University of Surrey, Surrey, U.K., from 2012 to 2013. He is currently with the National Pilot School of Software, Yunnan University, Kunming, China, as a full Professor of Machine learning. His current research interests include machine learning, data mining, pattern recognition and temporal data process and analysis.

    Xiaoling Wang received the bachelors, masters, and doctoral degrees from Southeast University in 1997, 2000, and 2003, respectively. She is currently a professor, vice dean in Software Engineering Institute, East China Normal University. She achieved the Programs of New-Century Talent of Ministry of Education of China. She is a member of China Computer Federation Technical Committee on Databases. She has published more than 100 papers and some papers were published in international conferences and journals such as SIGMOD, WWW, SIGIR, AAAI, IJCAI, CIKM, DASFAA and ICWS. Her research interests mainly include web data management, data mining and data service technology. She is a member of the IEEE.

    Zhiming Zhao obtained his Ph.D. in computer science in 2004 from University of Amsterdam (UvA). He is a senior researcher in the System and Network Engineering group at UvA. He is the scientific coordinator of the European H2020 SWITCH project and leads the Data for Science theme in the ENVRIPLUS project. His research interests include software defined networking, cloud computing, time critical systems and big data management.

    Tong Li got the Ph.D. degree in Software Engineering in February 2007 from De Montfort University, U.K, the B.Sc. degree in Computer Science in July 1983 and the M.Sc. degree in Computer Science in July 1988, all from Yunnan University, Kunming, China. He is a professor in computer science of Yunnan University. His current research interests include software process and data mining.

    View full text