Elsevier

Information Sciences

Volume 339, 20 April 2016, Pages 369-394
Information Sciences

Efficient pattern matching on big uncertain graphs

https://doi.org/10.1016/j.ins.2015.12.034Get rights and content

Abstract

A significant amount of research has been devoted to seeking efficient solutions to the problem of pattern matching over graphs. This interest is largely due to the many applications that require such efficient solutions, including protein complex prediction, social network analysis, and structural pattern recognition. However, in many real applications, the graph data are often noisy, incomplete, and inaccurate. In other words, there exist many uncertain graphs. Therefore, in this paper, we study pattern matching in the context of large uncertain graphs. Specifically, we want to retrieve all qualified matches of a query pattern in the uncertain graph. Though pattern matching over uncertain graphs is NP-hard, we employ a filtering-and-verification framework to speed up the search. In the filtering phase, we propose a probabilistic matching tree (PM-tree) built from match cuts obtained by a cut selection process. Based on the PM-tree, we devise a collective pruning strategy to prune a large number of unqualified matches. During the verification phase, we develop an efficient sampling algorithm to validate the remaining candidates. Extensive experimental results demonstrate the effectiveness and efficiency of the proposed algorithms. Finally, we show how our solution can be applied to querying knowledge graphs.

Introduction

Graphs constitute a generic data model with wide applicability in numerous domains, such as social networks, biological networks, and the World Wide Web. Indeed, it is often less complex for users to shoehorn semi-structured or sparse data into a vertex-edge-vertex data model than into a relational data model. Furthermore, it is also most natural for users to reason about an increasing number of popular datasets, such as the underlying networks of Twitter, Facebook, or LinkedIn, within a graph paradigm. Various types of queries over graph data have been investigated, such as subgraph search queries [62], [69], [73], shortest-path queries [6], [23], reachability queries [34], [57], and pattern matching queries [18], [42]. Reachability, or shortest-path, queries focus on the relation between two vertices in a graph. On the other hand, pattern matching queries are concerned with the connectivity among sets of vertices. Thus, a pattern matching query is more informative than a simple shortest-path, or reachability, query. Furthermore, a pattern matching query can be answered in polynomial time [20], while processing a subgraph query is # P-complete [25]. Therefore, the database community has devoted considerable effort to the study of the pattern matching query problem [18], [19], [20], [42], [74].

Interestingly, all of the aforementioned studies focus exclusively on applications where the edges of the graph are deterministic. Yet, in most applications, there is inherent uncertainty about the presence of edges due to often inevitable noise, incompleteness, and delays during data collection. For example, in protein–protein interaction (PPI) network, the proteins obtained from experiments may contain non-existing protein interactions, or on the contrary miss existing ones [10], [28], [52], [54]; in social networks, graphs are often used to represent communities of users, where probabilities can be assigned to edges to model the degree of influence among users [1], [40], [46]; in communication or road networks, edge probabilities are used to quantify the connectivity between nodes, or to take traffic uncertainty into consideration [9], [30]; finally, the uncertainty in an Resource Description Framework (RDF) graph is caused by data errors or semantic extraction inaccuracy in the data integration process [13], [31], [39].

Based on the above discussion, in this paper, we study pattern matching queries over large uncertain graphs. In the following, we describe the problem of probabilistic graph pattern matching and outline our contributions.

We first introduce graph pattern matching on deterministic graphs, and then proceed to discuss uncertain graph pattern matching.

Given a graph pattern query q with n vertices {v1,,vn} and a deterministic graph gc, a deterministic pattern matching query retrieves all matches of q in gc. For a given q and an n-vertex set m={u1,,un} in gc, m is a match for q in gc, if (1) the n vertices {u1,,un} in gc have the same labels as the corresponding vertices {v1,,vn} in q; and (2) for any two adjacent vertices vi and vj in q, the shortest-path distance between the two corresponding vertices ui and uj in gc is no larger than a given threshold γ [19], [74].

Example 1

Consider the pattern query q and the deterministic graph ugc in Fig. 1. For this example the probabilities of each edge can be ignored. Let the weight of each edge be 1 and the distance constraint γ be 3. Vertices {2, 5, 7} or {5, 6, 7} form a match for q in ugc, since their vertex labels are same as those of q, namely, {A, B, C}, and the shortest-path distance between each pair of vertices is less than 3. Though the vertex set {1, 5, 7} also has labels {A, B, C}, it is not a match because the shortest-path distance between vertices 1 and 7 is 4, which violates the distance constraint.

The semantics of pattern matching queries have many real life applications [19], [20], [74]. For example, suppose that Fig. 1 is a graph model of LinkedIn, where vertices represent active users and edges indicate the friendship relations among users. Job attributes are used to label the vertices, e.g., {A, B, C} = {Scientist, Professor, Student}. The pattern matching query q looks for relations among scientists, professors and students. Finding such patterns may help social science researchers discover close connections (due to the distance constraint) between a successful scientist and his/her circle of students or professors.

For the uncertain graph pattern matching problem, we focus here on threshold-based probabilistic pattern matching (T-PM) over large uncertain graphs, where vertices are deterministic and edges are uncertain. Specifically, let g be an uncertain graph, let q be a graph pattern query, and let ϵ be a probability threshold. A T-PM query retrieves all vertex sets m={u1,,un} in g (i.e., n vertices in g), such that the pattern matching probability (PMP) of m in g is at least ϵ. We will formally define PMP later.

We employ the possible world semantics [53], which has been widely used for modeling query processing over uncertain databases, to explain the semantics of PMP. A possible world graph (PWG) of an uncertain graph is a possible instance of the uncertain graph. It contains all of the vertices and a subset of the edges of the uncertain graph, and its weight is the product of all probabilities associated with the edges. Then, for a graph pattern query q with n vertices {v1,,vn} and an n vertex set m={u1,,un} in an uncertain graph g, the probability of m being a match for q is the sum of the weights of those PWGs g′, of g, where m is a match for q in g′. For m to be a match for q in g′, it must satisfy the two conditions of deterministic graph pattern matching defined above.

Example 2

Fig. 2 shows a couple of the PWGs of the uncertain graph ug of Fig. 1 and their respective weights. There are altogether 29=512 PWGs for ug, and the sum of all weights is 1. To decide if a vertex set m={5,6,7} is a match for q in the uncertain graph ug, we first find all of ug’s PWGs that contain m as a match for q. Again, recall that m is a match for q in g′ if (1) vertices in m and q have the same labels, and (2) each pair of corresponding vertices in m has a shortest-path distance of at most 3 (γ=3). Here, the result includes both of the PWGs depicted in Fig. 2, as well as many others. Next, we sum the probability of all of these PWGs: 0.01248+0.009126+=0.65. If a threshold 0.6 is used for the query, then m is a qualified match for q in the uncertain graph ug.

The above example gives a naive solution to T-PM query processing. We call it SCAN, as it needs to enumerate all PWGs of the uncertain graph, and to conduct a pattern matching between the query and each PWG. SCAN is very inefficient due to the exponential nature of the number of PWGs. Therefore, in this paper, we propose a filter-and-verification method to reduce the search space.

Specifically, given a graph pattern query q and a large uncertain graph g, our solution performs T-PM query processing in three steps, namely structural pruning, probabilistic pruning, and verification. In the structural pruning step, we run q on a deterministic graph gc that removes uncertainty from g, and get a match candidate set SCq. In the probabilistic pruning step, we first obtain a tight upper bound for PMP via a pre-computed index, which is based on edge cuts of gc. Next we refine the set of candidates in SCq, by pruning those potential matches whose upper bound is smaller than the probability threshold. In the verification phase, we validate each remaining candidate match to determine the final answer set.

The following is a summary of the contributions we make with this paper.

  • We give a general framework for answering pattern matching queries over large uncertain graphs.

  • We calculate a very tight upper bound for removing a large number of false candidates. We also devise the “Collective Pruning” strategy to speed up the pruning process.

  • We propose a lightweight index to avoid storing the exponential number of cuts, and devise a query cost model to maximize the pruning capability of the index with a small number of cuts.

  • We propose an efficient hybrid sampling algorithm to rapidly validate the final query answers.

  • We conduct extensive experiments to confirm the efficiency and effectiveness of our proposed approaches on real uncertain graph datasets.

Our earlier work [64] set the stage for the more in-depth study of the uncertain graph pattern matching problem found here. The extension includes the following new contents. First, we include proofs for all of theorems. Second, we used a probabilistic index consisting of edge cuts of the graph, we found that the number of cuts was extremely large, which led to a very large size index. Therefore, in this paper, we propose an optimal cut selection algorithm, so that the index has great pruning power and very small size. Third, we design a basic sampling algorithm to verify the candidates, so as to avoid the hard problem of computing pattern matching probabilities. To speed up the basic algorithm, we use a hybrid sampling approach based on unequal probability sampling techniques, that sample many possible worlds at once. Fourth, we show how to apply uncertain graph pattern matching to the problem of querying knowledge graphs. The experimental results show that the proposed approach is significantly better than state-of-the-art methods in terms of both efficiency and match quality.

The remainder of this paper is organized as follows. We formally define T-PM queries over uncertain graphs, and give the complexity of the problem in Section 2. In Section 3, we give an overview of our approach, while Section 4 details the algorithms for efficient probabilistic pruning and the derivation of the upper bounds of the PMP. Index construction and sampling-based verification algorithms are presented in Sections 5 and 6, respectively. We discuss the results of performance tests on real datasets in Section 7. Relevant related work is presented in Section 9. Finally, Section 10 concludes the paper.

Section snippets

Problem definition

In this section, we define some necessary concepts and discuss the complexity of the graph matching problem.

Definition 1 Uncertain graph

An undirected deterministic graph gc is denoted by (V, E, Σ, L), where V is a set of vertices, E is a set of edges (⊆ V × V), Σ is a set of labels, and L: VΣ is a function that assigns labels to vertices. An uncertain graph is defined as g=(gc,Pr), where Pr: E → (0, 1] is a function that assigns existence probabilities to edges in E.

Definition 2 Possible world graph

A PWG g=(V,E,Σ,L) is an instantiation of an

Overview of our approach

Fig. 3 gives a high-level overview of our general framework for a pattern matching query q over an uncertain graph g. It consists of three phases, namely Structural pruning, Probabilistic pruning, and Verification. The first two phases belong to the filtering step, and the last one is the verification step. We briefly present each step in what follows.

Structural pruning. The idea of structural pruning is straightforward. For n vertices m={u1,,un} in g, if we remove all of the uncertainty in

Probabilistic pruning

As mentioned above, we first conduct structural pruning to obtain a set of qualified candidate matches of q in g. We then use probabilistic pruning techniques to further filter the remaining match set, SCq.

The idea behind probabilistic pruning is to compute and use an upper bound for PMP. To facilitate this process, we propose an indexing structure, called probabilistic matching tree (PM-tree).

Before we describe the structure of PM-trees, we begin with some definitions. Given a deterministic

Probabilistic matching tree

Definition 6 introduced the structure and properties of PM-trees. Here, we first describe how to construct PM-trees, and then show that PM-trees have effective pruning capabilities.

Recall from Definition 6 that a PM-tree is a tree T=(V(T),E(T)), where V(T)=V(gc) and each edge eE(T) satisfies the following property.

Property 1

For each pair of distinct nodes (s, t) and edge e on the unique path between s and t, deleting e from T separates V(T) into two components, X and Y, such that (X, Y) is an st cut

Verification

In this section, we compute the PMP of a match in Cq to determine the final answer set. Specifically, given the hardness of computing PMP, we propose sampling algorithms to estimate PMP.

Performance evaluation

In this section, we report on the effectiveness and efficiency of our proposed approach. Our methods are implemented on a Windows XP machine with a Core 2 Duo CPU (2.8 GHz) and 8GB main memory. Programs are compiled using Microsoft Visual C++ 2010.

Real-world uncertain dataset. We use the real-world uncertain graph, Yeast, from the STRING database.1 Yeast contains all known and predicted protein interactions. The graph consists of 5862 vertices, 16,651 edges and 91 distinct

Application: Querying knowledge graphs

As knowledge graphs, such as DBpedia [2], YAGO [17], Probase [58] and Freebase [4], keep track of millions of entities (e.g., persons, products, organizations) together with their relationships, the potential for querying these graphs is tremendous. We now show how the graph pattern matching techniques described above may be applied in this context.

A knowledge graph can be represented as a tuple KG=(V,E,LV, LE, c), where V, E, LV and LE denote nodes, edges, node labels and edge labels,

Querying uncertain data

The topic most related to our work is managing and mining uncertain graphs, and it can be divided into two categories. The first category uses online algorithms, i.e., sampling approaches, to answer queries. Zou et al. [76], [77] study frequent subgraph mining on uncertain graph data. Potamias et al. [50] study k-nearest neighbor queries (k-NN) over uncertain graphs. Gao et al. [24] study the probability distribution of the diameter in uncertain graphs. Jin et al. [32] develop fast peeling

Conclusions

Uncertain graphs are pervasive in many real-world applications, such as bioinformatics, where data often exhibit uncertainties. In this paper, we study the problem of retrieving matches from large uncertain graphs that satisfy a query graph pattern with high confidence. To efficiently tackle this problem, we propose a tree index structure to enable an adaptive pruning process, designed according to a formal cost model, so that the index not only has a small size but also has powerful pruning

Acknowledgments

Ye Yuan is supported by the NSFC (grant nos. 61100024 and 61173029) and the Fundamental Research Funds for the Central Universities (grant no. N130504006). Guoren Wang is supported by the NSFC (grant no. 61025007, 61328202 and U1401256), National Basic Research Program of China (973, grant no. 2011CB302200-G), National High Technology Research and Development 863 Program of China (grant no. 2012AA011004). Lei Chen is supported by the NSFC (grant no. 61328202). Bo Ning is supported by the NSFC

References (77)

  • J. Cheng et al.

    Fg-index: Towards verification free query processing on graph databases

    Proceedings of Special Interest Group on Management of Data

    (2007)
  • J. Cheng et al.

    Fast graph pattern matching

  • Y. Cheng et al.

    Threshold-based shortest path query over large correlated uncertain graphs

    J. Comput. Sci. Technol.

    (2015)
  • H. Chui et al.

    Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions

    Bioinformatics

    (2007)
  • C.J. Colbourn

    The Combinatorics of Network Reliability

    (1987)
  • W.J. Cook et al.

    Combinatorial Optimization

    (1997)
  • D. Dimitrov et al.

    Query operators for comparing uncertain graphs

    (2015)
  • D.S. Hochbaum

    Approximation Algorithms for NP-Hard Problems

    (1997)
  • P. Ernst et al.

    Knowlife: A versatile approach for constructing a large knowledge graph for biomedical sciences

    BMC Bioinform.

    (2015)
  • M. Fabian et al.

    Yago: A core of semantic knowledge unifying wordnet and wikipedia

  • W. Fan et al.

    Incremental graph pattern matching

  • W. Fan et al.

    Adding regular expressions to graph reachability and pattern queries

  • W. Fan et al.

    Graph pattern matching: From intractable to polynomial time

    Proceedings of Very Large Data Base

    (2010)
  • L. Fang et al.

    Rex: Explaining relationships between entity pairs

    Proc. VLDB Endow.

    (2011)
  • G.S. Fishman

    A monte carlo sampling plan based on product form estimation

    Proceedings of the 23rd Conference on Winter Simulation

    (1991)
  • A.W.-C. Fu et al.

    Is-label: An independent-set based labeling scheme for point-to-point distance querying

    Proc. VLDB Endow.

    (2013)
  • M.R. Garey et al.

    Computers and Intractability: A Guide to the Theory of NP-Completeness

    (1979)
  • R.E. Gomory et al.

    Multi-terminal network flows

    SIAM

    (1961)
  • J.L. Herman et al.

    Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

    BMC Bioinform.

    (2015)
  • A. Hogan et al.

    Towards fuzzy query-relaxation for RDF

    The Semantic Web: Research and Applications

    (2012)
  • M. Hua et al.

    Probabilistic path queries in road networks: traffic uncertainty aware path selection

  • H. Huang et al.

    Query evaluation on probabilistic rdf databases

  • R. Jin et al.

    Discovering highly reliable subgraphs in uncertain graphs

  • R. Jin et al.

    Distance-constraint reachability computation in uncertain graphs

  • R. Jin et al.

    Simple, fast, and scalable reachability oracle

    Proc. VLDB Endow.

    (2013)
  • G. Kasneci et al.

    Ming: Mining informative entity relationship subgraphs

    Proceedings of the Conference on Information and Knowledge Management

    (2009)
  • G. Kasneci et al.

    Naga: Searching and ranking knowledge

    International Conference on Data Engineering

    (2008)
  • A. Khan et al.

    Nema: Fast graph search with label similarity

    Proc. VLDB Endow.

    (2013)
  • Cited by (17)

    • Effective and efficient aggregation on uncertain graphs

      2022, Fuzzy Sets and Systems
      Citation Excerpt :

      In recent years, significant progress has been made on search and mining over uncertain graphs. There have been numerous prior explorations on frequent subgraph mining [33–36], dense subgraph mining [37–39], subgraph matching [40–43] and so on. For frequent subgraph matching, Chen et al. [33] investigated the problem of frequent subgraph mining on single uncertain graphs.

    • Stable structural clustering in uncertain graphs

      2022, Information Sciences
      Citation Excerpt :

      Because of the significant difference between the deterministic graph and the uncertain graph, some concepts and algorithms in deterministic graphs cannot be directly applied to uncertain graphs. In recent years, the problems that have been extensively studied in deterministic graphs are gradually discussed with respect to uncertain graphs, such as the calculation of k-nearest neighbors [25], k-core [3,13,15,20,24], simrank similarity [34], motif [21], betweenness centrality [27,29], frequent pattern mining [9,31,35] and clustering [5,12,14,17,18]. Structural clustering is an important method in graph clustering, whose goal is to find densely connected clusters in large networks.

    • Limited approximate bisimulations and the corresponding rough approximations

      2021, International Journal of Approximate Reasoning
    • An approach to extracting complex knowledge patterns among concepts belonging to structured, semi-structured and unstructured sources in a data lake

      2019, Information Sciences
      Citation Excerpt :

      In the literature, a huge variety of approaches to extracting CKPs has been proposed. Some of them are based on Network Analysis [47], others are centered on “questions and answers” mechanisms [18], further ones exploit Similarity Join [39], and so forth. Each family of approaches has its pros and cons, as well as its corresponding tools [40].

    • Autonomous overlapping community detection in temporal networks: A dynamic Bayesian nonnegative matrix factorization approach

      2016, Knowledge-Based Systems
      Citation Excerpt :

      For instance, Ahmed and Chen [27] proposed an efficient algorithm for link prediction in temporal uncertain social networks, in which each edge is associated with a probability value indicating its existence in the network. Yuan, et al. [28] employ a filtering-and-verification framework for retrieve all qualified matches of a query pattern in the uncertain graph, in which a probabilistic matching tree (PM-tree) is built from match cuts obtained by a cut selection process and based on the PM-tree, and a collective pruning strategy is devised to prune a large number of unqualified matches. Rezvanian and Meybodi [29] first define minimum vertex covering in stochastic graphs and give four learning automata-based algorithms for solving minimum vertex covering problem in stochastic graphs, in which the probability distribution functions of the weights associated with the vertices of the graph are unknown and can be parameterized a proper choice of the parameter.

    • A survey on mining and analysis of uncertain graphs

      2022, Knowledge and Information Systems
    View all citing articles on Scopus
    View full text