Elsevier

Pattern Recognition

Volume 42, Issue 9, September 2009, Pages 2003-2012
Pattern Recognition

Median graphs: A genetic approach based on new theoretical properties

https://doi.org/10.1016/j.patcog.2009.01.034Get rights and content

Abstract

Given a set of graphs, the median graph has been theoretically presented as a useful concept to infer a representative of the set. However, the computation of the median graph is a highly complex task and its practical application has been very limited up to now. In this work we present two major contributions. On one side, and from a theoretical point of view, we show new theoretical properties of the median graph. On the other side, using these new properties, we present a new approximate algorithm based on the genetic search, that improves the computation of the median graph. Finally, we perform a set of experiments on real data, where none of the existing algorithms for the median graph computation could be applied up to now due to their computational complexity. With these results, we show how the concept of the median graph can be used in real applications and leaves the box of the only-theoretical concepts, demonstrating, from a practical point of view, that can be a useful tool to represent a set of graphs.

Introduction

In structural pattern recognition, the concept of median graph [22] has been presented as a useful tool to represent a set of graphs. Given a set of graphs S, the median graph is defined as the graph that minimizes the sum of distances (SOD) to all the graphs in S. It aims to extract the essential information of a set of graphs into a single prototype. Potential applications of median graphs include graph clustering and prototype learning. For instance, it has been successfully applied to different areas such as the synthesis of graphical symbols [21], image clustering [20], optical character recognition [22] and graphical symbol recognition [16].

Nevertheless, the computation of the generalized median graph is a highly complex task. In the past some exact and approximate algorithms have been developed. Optimal algorithms include a tree search approach called multimatch [27] and a more efficient algorithm which takes advantage of certain conditions about the distance between two graphs [18]. Suboptimal methods include genetic algorithms [8], [22], greedy-based algorithms [20], [19] and spectral-based approaches such that of [14], [35]. In spite of this wide offer of algorithmic tools, all of them are very limited in their application. They are often restricted to use small graphs and with some particular conditions. None of them have been applied using real data.

In spite of these efforts to develop new and more efficient algorithms, only few work about the theoretical properties of the median graph exists. In [22], some interesting properties of the median graph related to their size and their SOD have been presented. Concretely, they show the general limits for both the size and the SOD of the median graph. Unfortunately, these original bounds are sometimes very coarse and they cannot be easily used to reduce the complexity of its computation. Thus, the reduction of such bounds may be crucial to be able to compute the median graph more efficiently or to obtain better approximations.

In this paper we make theoretical and algorithmic contributions to the computation of the median graph that result in a new genetic algorithm, computationally more efficient than existing approaches, that can be applied to real sets of data with large graphs. The most important contribution of this work is that, from a theoretical point of view, we show that under a particular cost function and a distance based on the maximum common subgraph, the original bounds given in [22] can be reduced.1 After that, we use these new theoretical results to present the second major contribution of this paper: a new approximate algorithm for the median graph computation based on a genetic search. It validates the new bounds not only from a theoretical point of view, but also giving them a practical application, implementing a new strategy for the median graph computation. As a result, the computation time of the median graph is reduced. In order to show the usefulness of the new approach, we perform a set of preliminary experiments using a real database of 2340 webpages, split into six classes. Each webpage is represented as a graph with a number of nodes between 100 and 300. In a first experiment we show how the median graph can be computed in a reasonable time, compared with the previous existing algorithms. Furthermore, in a second experiment we assess the accuracy of the median comparing its SOD with the SOD of the set median graph. We show that with this new approach we obtain graphs with lower SOD than the set median, which demonstrates that we are obtaining good approximations of the median graph. Finally, although it is not the main objective of this work, we try to validate the median graph as a representative of a class of graphs. Up to now, existing algorithms could only be applied to very limited sets of graphs and the median graph could not be evaluated from a practical point of view as a good representative of a class. To that extent, we perform a preliminary classification experiment using the median graph. In some cases, we obtain slightly better results than a nearest-neighbor classifier with a much lower computation time. In this way, we demonstrate, for the first time, that the median graph can be a feasible alternative to represent a set of graphs in real applications.

The rest of the paper is organized as follows. In Section 2, we present some theoretical concepts required to understand the rest of this work. Then, in Section 3 the concept of the median graph and its theoretical properties are introduced. After that, in Section 4, we present the new theoretical properties of the median graph. Section 5 introduces a new genetic algorithm for the median graph computation, that takes advantage of the new theoretical results. Then, Section 6 is devoted to present our experiments and the results we obtained. Finally, we terminate with some conclusions and possible future research lines.

Section snippets

Basic definitions

Let L be a finite alphabet of labels for nodes and edges. A graph is a four-tuple g=(V,E,α,β) where V is the finite set of nodes, E is the set of edges, α is the node labelling function (α:VL) and β is the edge labelling function (β:EL). We assume that our graphs are fully connected. Consequently, the set of edges is implicitly given (i.e. E=V×V). Such assumption is only for notational convenience, and it does not impose any restriction in the generality of our results. In the case where no

Generalized median graph

Given a set of graphs, the concept of median graph has been presented as a useful tool to compute a representative of the set. Let U be the set of graphs that can be constructed using labels from L. Given S={g1,g2,,gn}U, the generalized median graph g¯ of S is defined as the graph gU such that its SOD to all the graphs in S is minimum:g¯=argmingUgiSd(g,gi)=argmingUSOD(g)Notice that g¯ is not usually a member of S and, in general, more than one generalized median graph can be found for a

New theoretical results on the median graph

The theoretical properties mentioned above can be used to bound the search space of the median, either by limiting the size of the candidate medians or discarding some of these candidate medians based on the bounds of the SOD, for instance. Nevertheless, as mentioned in [22], these bounds are sometimes too coarse and may not be very useful to reduce the complexity of the median graph computation. Using the concepts of mcs(S) and MCS(S) defined in Section 2.2, the cost function introduced in

Genetic search algorithm

In this section we show how the new bounds presented in the previous section can be used to develop a new sub-optimal algorithm for the computation of the generalized median graph that takes advantage of these theoretical results to reduce the search space of the median graph. Computing the median graph is an optimization problem where a search space has to be explored to find the optimal solution. Among the several optimization techniques—such as Tabu search, genetic algorithms, etc.—that

Experimental results

In order to experimentally evaluate both the new theoretical properties and the new genetic approach, we present in this section three experiments using a real database of graphs representing webpages. Such graphs have a large number of nodes (around 200) but they are a particular class of graphs with unique node labels. Such kind of graphs allow the computation of the maximum common subgraph of two graphs in polynomial time [24]. That makes the computation of the edit distance based on the

Conclusions

The median graph has been presented as a good alternative to compute the representative of a set of graphs. Although some theoretical properties and algorithms have been introduced so far related to the median graph, existing methods do not permit to use it in real pattern recognition applications.

In this paper we have derived new theoretical properties of the median graph and we have developed a new algorithm using these properties that permit to extend the computation of the median graph to

Acknowledgments

This work has been partially supported by the research Fellowship 401-027 (UAB), the Cicyt project TIN2006-15694-C02-02 (Ministerio Ciencia y Tecnología) and the Spanish research programme Consolider Ingenio 2010: MIPRCV (CSD2007-00018). We would like to thank K. Riesen from the University of Bern for his help with the webpages database and to make it available to us.

About the Author—MIQUEL FERRER was born in Terrassa, 24 October 1975. He received his Telecommunications Engineering Degree from the Universitat Ramon Llull, La Salle (Barcelona) in 2003. In 2004 he joined the Departament d’Informàtica i Matemàtiques, Universitat Rovira i Virgili. In 2005 he moved to the Computer Vision Center, Departament de Cències de la Computació, Universitat Autònoma de Barcelona, where he obtained his Ph.D. in June 2008.

References (35)

  • H. Bunke et al.

    On the minimum common supergraph of two graphs

    Computing

    (2000)
  • D. Conte et al.

    Challenging complexity of maximum common subgraph detection algorithms: a performance analysis of three algorithms on a wide database of graphs

    Journal of Graph Algorithms and Applications

    (2007)
  • P.J. Durand, R. Pasari, J.W. Baker, C. che Tsai, An efficient algorithm for similarity analysis of molecules, Internet...
  • M. Ferrer, F. Serratosa, A. Sanfeliu, Synthesis of median spectral graph, in: J.S. Marques, N.P. de la Blanca, P. Pina...
  • M. Ferrer, F. Serratosa, E. Valveny, On the relation between the median and the maximum common subgraph of a set of...
  • M. Ferrer, E. Valveny, F. Serratosa, Spectral median graphs applied to graphical symbol recognition, in: J.F.M....
  • M. Ferrer, E. Valveny, F. Serratosa, Bounding the size of the median graph, in: J. Martí, J.-M. Benedí, A.M. Mendonça,...
  • Cited by (0)

    About the Author—MIQUEL FERRER was born in Terrassa, 24 October 1975. He received his Telecommunications Engineering Degree from the Universitat Ramon Llull, La Salle (Barcelona) in 2003. In 2004 he joined the Departament d’Informàtica i Matemàtiques, Universitat Rovira i Virgili. In 2005 he moved to the Computer Vision Center, Departament de Cències de la Computació, Universitat Autònoma de Barcelona, where he obtained his Ph.D. in June 2008.

    About the Author—ERNEST VALVENY is an Associate Professor at the Computer Science Department of the Universitat Autònoma de Barcelona (UAB), where he obtained his Ph.D. degree in 1999. He is also a member of the Computer Vision Center (CVC) at UAB. His research work has mainly focused on symbol recognition in graphic documents. Other areas of interest are in the field of computer vision and pattern recognition, more specifically in the domain of document analysis, including shape representation, character recognition, document indexing and layout analysis. He is currently a member of the IAPR TC-10, the Technical Committee on Graphics Recognition, and IAPR-TC-5 on Benchmarking and Software. He has been co-chair of the two editions of the International Contest on Symbol Recognition, supported by IAPR-TC10. He has worked in several industrial projects developed in the CVC and published several papers in national and international conferences and journals.

    About the Author—FRANCESC SERRATOSA was born in Barcelona, 24 May 1967. He received his Computer Science Engineering degree from the Universitat Politècnica de Catalunya (Barcelona) in 1993. Since then, he has been active in research in the areas of computer vision, robotics and structural pattern recognition. He received his Ph.D. from the same university in 2000. He is currently an Associate Professor of Computer Science at the Universitat Rovira i Virgili (Tarragona), he has published more than 40 papers and he is an active reviewer in some congresses and journals. He is the author of two patents in the computer vision field.

    View full text