Keywords

1 Introduction

Searching for information, a fundamental activity of human beings in all eras, has never become so complex, so challenging, and so demanding as nowadays when the world is flooded with data and information. There are many sophisticated information retrieval (IR) systems which have been developed to retrieve desired and valuable information from diverse sources of data in efficient ways. In order to speak about information being desired or valuable, one has to introduce suitable mechanisms to specify desired information and to use quantitative estimates to measure the value or usefulness of each piece of information. For example, some of the IR systems choose to rank documents according to their numerical scores labeled on the basis of the vector-space model and the probabilistic model [16].

We are concerned with searching for information (more concretely, for geometric knowledge objects) from knowledge bases for which pattern matching is essential (see, e.g., [20]). For knowledge base queries, there are mainly two kinds: factual and conceptual [19]. The former identifies pieces of information relevant to the input through expansion of the query terms and expansion of themes, while the latter identifies potential existence of information in particular areas by specifying terminologies. In addition, ontology-based retrieval models or algorithms (e.g., the one described in [14]) may be used to support semantic search and knowledge-base exploitation.

Mathematics is constituted by multi-layer knowledge with various kinds of representations. Most of mathematical objects may be represented using functions and relations in which complex structures may be not obvious. It is rather difficult to retrieve mathematical objects merely based on keywords. Mathematical information retrieval is accomplished in general with query construction, normalization, indexing, and matching [21]. Currently available systems for searching mathematical expressions are based mostly on tree indexing (see, for instance, [6, 1013]). The retrieval of mathematical information from natural language text is another challenging issue, involving the use of techniques and tools from computational linguistics and artificial intelligence. The reader is referred to [8] for the system mArachna, which is capable of retrieving certain mathematical information from scientific books in German.

Figures are widely used to represent or illustrate mathematical knowledge in general, and geometric knowledge in particular. How to retrieve geometric information and how to discover geometric knowledge from figures are both interesting questions that may be asked. For existing work related to these questions, we may mention [9], in which declarative, procedural, and analytic approaches are used to describe geometric figures and a search mechanism based on a graph database is developed, and [2, 17], in which it is shown how geometric theorems can be discovered automatically from images of diagrams.

In this paper, we present an inexact query method for searching geometric theorems stored in the open geometric knowledge base OpenGeo [18]. Our method is different from the query method presented in [9]. We construct undirected graphs, which are easier to be built, instead of directed graphs for queries and theorems stored in the database. We use types to describe nodes and edges. It is not necessary to consider directed edges when types of nodes and edges are fixed. For example, if two nodes’ types are point and line respectively and the type of the edge between these two nodes is incident, then there is only one possibility that the point is incident to the line. On the other hand, weights are used to measure importance of nodes in an undirected graph. Considering an inexact query, we will weaken the query graph according to the weights of nodes to find out theorems. So it is possible to emphasize different parts of the query graph in the weakening process. If one wants to keep the nodes that have perpendicular relations, higher weights can be given to these nodes than to others. Different allocations of weights lead to different searching results, satisfying various application requests. Our method allows users to allocate weights by their willingness. In addition, to reduce searching space to save time, irrelevant theorems which have no possibilities to satisfy the query are filtered out before hand.

In detail, for a given image of a diagram, the method works by first retrieving geometric features (including geometric objects and their relations) implied in the image using pattern-recognition methods and numerical verification techniques [2, 17] and then constructing a graph G corresponding to the diagram in the input image from the retrieved features. The graph G is simplified and weakened to match graphs produced from theorems in OpenGeo and the degree of relevance between G and the graph of each theorem found from OpenGeo is calculated and used to rank the resulting theorems. This inexact query method is capable of figuring out theorems of high degree of relevance with the diagram in the input image and may be used to explore properties of similar diagrams, to find relevant theorems for illustration, and to seek for analogous techniques of theorem proving. After a short review of formal representation of geometric theorems and the structure and implementation of OpenGeo in Sect. 2, we will outline the process of searching for theorems from OpenGeo using features retrieved from images of diagrams in Sect. 3 and define the degree of relevance for ranking searching results in Sect. 4. Experimental results with a preliminary implementation of our searching method will be reported in Sect. 5. Conclusions drawn from this work will be discussed briefly and together with future work in Sect. 6.

2 OpenGeo: A Formalized Geometric Knowledge Base

OpenGeo [18] is an open online geometric knowledge base, containing typical geometric knowledge objects (such as definitions, theorems, and proofs) and web-based interfaces and tools to support users to manage the knowledge objects stored in OpenGeo.

2.1 Representation of Geometric Knowledge Objects in OpenGeo

Geometric knowledge objects in OpenGeo are categorized into specific classes, including definition, axiom, theorem, proof, problem, and algorithm (which are interconnected according to the structure of geometric knowledge objects [3]). Each class may contain several data items. For example, the class of theorem contains knowledgeName (identifiers for the theorem), formalRepresentation (processable by other tools for automated reasoning, computation, transformation, etc.), naturalRepresentation (for human users to read), algebraicRepresentation (algebraic expressions for the theorem), diagramInstruction (instructions for drawing diagrams of the theorem), nondegeneracyCondition (constraints to make the theorem rigorous and unambiguous), figure (images or diagrams constructed by using dynamic geometry software for the theorem), and keyWords.

The data stored in the formalRepresentation item is represented in GDL, a formal Geometry Description Language [1] which is readable and processable, and can be easily transformed into or from natural languages. For example, the theorem named after Robert Simson, illustrated by the figure in Fig. 1 and stated in English as “the feet of the perpendiculars from a point to the sides of a triangle are collinear if and only if the point lies on the circumcircle of the triangle,” may be represented in GDL as

Fig. 1.
figure 1

The figure of Simson’s theorem

\(\text {Theorem}(\text {Simson}\), \(\text{ Theorem }\),

\(\text {assume}(\{A,B,C,D\}:=\) \(\{\texttt {point}()\), \(\texttt {point}()\), \(\texttt {point}()\), \(\texttt {point}()\}\),

           \(\texttt {incident}(D\), \(\texttt {circumcircle}(\texttt {triangle}(A,B,C)))\),

\(\text {show}(\texttt {collinear}(\texttt {foot}(D,\texttt {line}(A,B))\), \(\texttt {foot}(D,\texttt {line}(A,C))\), \(\texttt {foot}(D,\)

          \(\texttt {line}(B,C)))))\).

This formal representation of the theorem can be used for automated proving of the theorem and automated generation of diagrams illustrating the theorem [1]. We will use the data stored in the formalRepresentation item of theorem class to construct graphs of theorems for searching in Sect. 3.

2.2 OpenGeo Extension

To search for theorems in OpenGeo with an image of diagram as query input, it is effective to firstly retrieve geometric features from both the input image and the theorems in OpenGeo and then contrast the retrieved features to determine the degrees of relevance between the diagram in the input image and the theorems. By geometric features of a diagram or a theorem, we mean geometric entities and their relations which are involved in the diagram or the theorem. They can be represented as a graph in the way that an entity is mapped into a node, while a relation is mapped into an undirected edge. Formally, a pair (ID, Type) is used to represent a node, where ID is an integer automatically assigned to the node and Type indicates the type of the entity (see Table 1); a triple (FirstNodeID, SecondNodeID, Type) is used to represent an edge, where FirstNodeID and SecondNodeID are the ID’s of the two nodes that the edge connects and Type indicates the type of the relation (see Table 2). Different types of nodes are related to different types of edges. For example, D (in Table 1) means the distance between two points, no matter whether the line segment between the two points exists or not in the diagram. But if L (in Table 1) appears, there must be a line appearing in the diagram. Relations between entities of type L could be perpendicular relation (of type perp in Table 2), parallel relation (of type para in Table 2) while quantities of type D have the equivalence relation (of type equ in Table 2).

Table 1. Types of nodes
Table 2. Types of edges

For instance, the graph for the geometric features of Simson theorem may be represented as follows.

  • Nodes:

    $$\begin{array}{llllll} ~(0, \texttt {L}),~ &{} ~(1, \texttt {L}),~ &{} ~(2, \texttt {L}),~ &{} ~(3, \texttt {C}),~ &{} ~(4, \texttt {P}),~ &{} ~(5, \texttt {L}),~ \\ ~(6, \texttt {P}),~ &{} ~(7, \texttt {L}),~&{} ~(8, \texttt {P}),~ &{} ~(9, \texttt {L}),~ &{} ~(10, \texttt {L}),~ &{} ~(11, \texttt {TRI}),~ \\ ~(12, \texttt {P}),~ &{} ~(13, \texttt {P}),~ &{} ~(14, \texttt {P}),~ &{} ~(15, \texttt {P}). &{} &{} \end{array}$$
  • Edges:

    $$\begin{array}{llllll} (0, 5, \texttt {perp}), &{} (1, 7, \texttt {perp}), &{} (2, 9, \texttt {perp}), &{} (12, 11, \texttt {tri}), &{} (13, 11, \texttt {tri}), &{} (14, 11, \texttt {tri}), \\ (12, 0, \texttt {inc}), &{} (13, 0, \texttt {inc}), &{} (12, 1, \texttt {inc}), &{} (14, 1, \texttt {inc}), &{} (13, 2, \texttt {inc}), &{} (14, 2, \texttt {inc}), \\ (4, 5, \texttt {inc}), &{} (4, 0, \texttt {inc}), &{} (15, 5, \texttt {inc}), &{} (6, 7, \texttt {inc}), &{} (6, 1, \texttt {inc}), &{} (15, 7, \texttt {inc}), \\ (8, 9, \texttt {inc}), &{} (8, 2, \texttt {inc}), &{} (15, 9, \texttt {inc}), &{} (4, 10, \texttt {inc}), &{} (6, 10, \texttt {inc}), &{} (8, 10, \texttt {inc}), \\ (15, 3, \texttt {inc}), &{} (12, 3, \texttt {inc}), &{} (13, 3, \texttt {inc}), &{} (14, 3, \texttt {inc}). &{} &{} \end{array}$$

To facilitate the fetching of geometric features of theorems, OpenGeo has been extended by adding a data item, named feature, to store graphs generated automatically from formal representations of theorems.

3 Searching for Geometric Theorems in OpenGeo

The process of searching for geometric theorems in OpenGeo consists of three main steps: (1) retrieving geometric features from an input image of diagram; (2) filtering out irrelevant theorems using the retrieved features; (3) matching the features of each remaining theorem with the features of the diagram in the input image to obtain theorems of high relevance. We detail these steps in the following three subsections.

3.1 Retrieving Geometric Features from Diagrams

Chen, Song, and Wang [2, 17] proposed a method to detect basic geometric entities (such as points, lines, and circles), to recognize labels of basic geometric entities, and to mine basic geometric relations (such as incidence, parallelism, perpendicularity, and equivalence) from images of diagrams by using techniques and tools of pattern recognition and numerical verification. The retrieved geometric features, represented in GDL, can be transformed into graph representations. For example, the following graph representation may be produced for the diagram shown in Fig. 2.

Fig. 2.
figure 2

An image of diagram as query input

  • Nodes:

    $$\begin{array}{lllllll} ~(0, \texttt {P}),~&{}~(1, \texttt {L}),~&{}~(2, \texttt {D}),~&{}~(3, \texttt {D}),~&{}~(4, \texttt {P}),~&{}~(5, \texttt {L}),~&{}~(6, \texttt {L}),\\ ~(7, \texttt {L}),~&{}~(8, \texttt {C}),~&{}~(9, \texttt {L}),~&{}~(10, \texttt {L}),~&{}~(11, \texttt {L}),~&{}~(12, \texttt {P}),~&{}~(13, \texttt {D}),\\ ~(14, \texttt {D}),~&{}~(15, \texttt {D}),~&{}~(16, \texttt {TRI}),~&{}~(17, \texttt {L}),~&{}~(18, \texttt {L}),~ &{}(19, \texttt {L}),~&{}~(20, \texttt {P}),\\ ~(21, \texttt {P}),~&{}~(22, \texttt {P}).&{} &{} &{} &{} &{} \end{array}$$
  • Edges:

    $$\begin{array}{llllll} (0, 2, \texttt {ind}), &{} (21, 2, \texttt {ind}), &{} (0, 3, \texttt {ind}), &{} (22, 3, \texttt {ind}), &{} (4, 13, \texttt {ind}), &{} (20, 13, \texttt {ind}), \\ (4, 14, \texttt {ind}), &{} (21, 14, \texttt {ind}), &{} (4, 15, \texttt {ind}), &{} (22, 15, \texttt {ind}), &{} (1, 9, \texttt {perp}), &{} (5, 10, \texttt {perp}), \\ (6, 11, \texttt {perp}), &{} (0, 1, \texttt {inc}), &{} (21, 1, \texttt {inc}), &{} (22, 1, \texttt {inc} ), &{} (22, 5, \texttt {inc}), &{} (20, 5, \texttt {inc}), \\ (20, 6, \texttt {inc}), &{} (21, 6, \texttt {inc}), &{} (4, 7, \texttt {inc}), &{} (0, 7, \texttt {inc}), &{} (20, 9, \texttt {inc}), &{} (21, 10, \texttt {inc}), \\ (22, 11, \texttt {inc}), &{} (9, 12, \texttt {inc}), &{} (10, 12, \texttt {inc}), &{} (11, 12, \texttt {inc}), &{} (4, 17, \texttt {inc}), &{} (20, 17, \texttt {inc}), \\ (4, 18, \texttt {inc}), &{} (21, 18, \texttt {inc}), &{} (4, 19, \texttt {inc}), &{} (22, 19, \texttt {inc}), &{} (20, 8, \texttt {inc}), &{} (21, 8, \texttt {inc}), \\ (22, 8, \texttt {inc}), &{} (20, 16, \texttt {tri}), &{} (21, 16, \texttt {tri}), &{} (22, 16, \texttt {tri}), &{} (2, 3, \texttt {equ}), &{} (13, 15, \texttt {equ}), \\ (14, 15, \texttt {equ}). &{} &{} &{} &{} &{} \end{array}$$

Geometric information retrieved from images of diagrams may contain entities and relations irrelevant to the geometric features that the diagrams are expected to illustrate. To obtain geometric features for the purpose of searching, the following simple rules may be applied to remove redundant points, lines, and circles.

  • If a point is involved in no more than two relations and types of these relations are inc, then remove this point and the corresponding relations. For example, if a point in a diagram is just the intersection of two lines, then it does not show any important geometric features of the diagram. Therefore, this point and the two relations of inc can be removed.

  • If a line or a circle is not involved in any relations, then remove it. In other words, a line or a circle without relations in a diagram does not show any important geometric features of the diagram, so it can be removed.

3.2 Filtering Out Irrelevant Theorems Using Features

For efficient searching in OpenGeo, it is necessary to filter out irrelevant theorems before starting the process of feature matching. In view of the importance of geometric relations in the construction of diagrams, we adopt the following rules to filter out some irrelevant theorems.

For any graph s produced for a theorem in OpenGeo and a graph q produced for the diagram in the input image, let the types of edges of s and q be collected in sets \(\mathbf {C}_s\) and \(\mathbf {C}_q\) and the numbers of edges of the same types of s and q be collected in sets \(\mathbf {N}_s\) and \(\mathbf {N}_q\), respectively.

  1. 1.

    If \(\mathbf {C}_q\setminus \mathbf {C}_s\ne \emptyset \), then the theorem with graph s is considered as irrelevant.

  2. 2.

    If \(\mathbf {C}_q \subset \mathbf {C}_s\) and there exists a c in \(\mathbf {C}_q\) such that the number of edges of type c in \(\mathbf {N}_q\) is greater than that in \(\mathbf {N}_s\), then the theorem with graph s is considered as irrelevant.

If either of the two conditions is satisfied, it is impossible that q is a subgraph of s. So the theorem with graph s is irrelevant to the query.

3.3 Matching Geometric Objects and Relations

By means of representing geometric features using graphs (see Sect. 2.2), the problem of feature matching can be converted to that of graph matching. For the latter there is a universal method, called GraphGrep and introduced by Giugno and Shasha [7]. This method proceeds by first creating a database, then parsing the query graph and filtering the database, and finally finding subgraphs matching the query graph. The resulting graphs produced by GraphGrep contain the query graph as a subgraph. Using such exact matching, it is hardly possible to find out theorems which are relevant with the query diagram only to some degree.

What we actually want is inexact matching. To achieve this, we add a weakening process before using GraphGrep, that is, first weakening the query graph by eliminating certain nodes and edges and then using GraphGrep to find graphs for theorems in OpenGeo that match the weakened query graph exactly. The following steps can be used to weaken the query graph.

  1. 1.

    Compute weights of nodes. Let \(\mathbf {R}\) be the set {inc, ind, perp, para, equ, tri, quad} of types of edges, where the weight for each \(\texttt {T}\in \mathbf {R}\) is pre-given and denoted by \(w_\texttt {T}\). Let \(\mathbf {W_R} = [w_\texttt {inc}, w_\texttt {ind}, w_\texttt {perp}, w_\texttt {para}, w_\texttt {equ}, w_\texttt {tri}, w_\texttt {quad}]\), \(\mathbf {V} = \{v_1, v_2,\ldots , v_p\}\) be the set of nodes, and \(\mathbf {E} = \{e_1, e_2,\ldots , e_q\}\) be the set of edges of the query graph. The weight of an edge \(e_i\), denoted by \(w_{i}^{e}\), is defined to be \(w_{\texttt {T}_{i}}\), where \(\texttt {T}_{i}\) is the type of \(e_{i}\), and the weight of a node \(v_i\) connected by \(e_{i_{1}}, e_{i_{2}},\ldots ,e_{i_{n}}~(1\le i_{1},i_{2},\ldots ,i_{n}~\le ~q)\) is defined to be \(w_{\texttt {T}_{i_{1}}}+ w_{\texttt {T}_{i_{2}}}+\cdots +w_{\texttt {T}_{i_{n}}}\). Let the weight of \(v_i\) be denoted by \(w_{i}^v\) and \(\mathbf {W_V} = [w_{1}^v, w_{2}^v, \ldots , w_{p}^v]\).Footnote 1

  2. 2.

    Sort nodes with respect to a specific order. Let the types of nodes be ordered as D \(\lessdot \) P \(\lessdot \) L \(\lessdot \) C \(\lessdot \) TRI \(\lessdot \) QUAD. Sort the nodes in \(\mathbf {V}\) with respect to the order \(\prec \), introduced according to the following rules:

    1. (a)

      if \(w_{i}^v < w_{j}^v\), then \(v_i \prec v_j\); if \(w_{i}^v = w_{j}^v\), then go to (b);

    2. (b)

      if the number of edges connected to \(v_i\) is greater than that of edges connected to \(v_j\), then \(v_i \prec v_j\); if the two numbers are equal, then go to (c);

    3. (c)

      if the type of \(v_i\) \(\lessdot \) the type of \(v_j\), then \(v_i \prec v_j\); if the types are identical, then go to (d);

    4. (d)

      if the ID of \(v_i\) is less than that of \(v_j\), then \(v_i \prec v_j\).

  3. 3.

    Remove a node and the edges connected to the node. Let the nodes of the query graph be ordered as \(v_{s_1} \prec v_{s_2} \prec \cdots \prec v_{s_p}\) and denote by \(\mathbf {E}_{v_{s_i}}(1\le i \le p)\) the set of edges that are connected to \(v_{s_i}\). Then \(\mathbf {V}\setminus \{v_{s_i}\}\) and \(\mathbf {E}\setminus \mathbf {E}_{v_{s_i}}\) are respectively the set of nodes and the set of edges of the weakened graph, obtained from the query graph by removing the node \(v_{s_i}\) from \(\mathbf {V}\) and all the edges connected to \(v_{s_i}\) from \(\mathbf {E}\).

4 Processing Results of Searching

Using the method of inexact matching presented in the preceding section, one can find a set of theorems in OpenGeo whose graphs match the query graph of the diagram in the given image. It remains to rank the found theorems, so that those which are most relevant to what the diagram may illustrate are placed on the top.

4.1 Computing Degrees of Relevance

Given the image of a diagram \(\text {D}\) as query input, we want to define, for each theorem \(\text {T}\) whose graph matches the graph of \(\text {D}\), a quantity \(\mathrm{rel}_{\text {T}}^{\text {D}}\), ranging from \(0\,\%\) to \(100\,\%\) and called the degree of relevance between \(\text {D}\) and \(\text {T}\), to measure how relevant \(\text {T}\) is to \(\text {D}\). For two theorems \(\text {T}_1\) and \(\text {T}_2\), if \(\mathrm{rel}_{\text {T}_1}^{\text {D}} < \mathrm{rel}_{\text {T}_2}^{\text {D}}\), then theorem \(\text {T}_2\) is said to be more relevant with \(\text {D}\) than theorem \(\text {T}_1\). The degree of relevance should meet the following three requirements.

  • Complete. Let (VE) and \((V_\text {T},E_\text {T})\) be the graph representations for \(\text {D}\) and \(\text {T}\), respectively. If \(V=V_\text {T}\) and \(E=E_\text {T}\), then \(\mathrm{rel}_\text {T}^{\text {D}} = 100\,\%\); if \(V\cap V_\text {T}=\emptyset \) and \(E\cap E_\text {T}=\emptyset \), then \(\mathrm{rel}_\text {T}^{\text {D}}=0\,\%\).

  • Intuitive. Let (VE), \((V_{\text {T}_1},E_{\text {T}_1})\), and \((V_{\text {T}_2},E_{\text {T}_2})\) be the graph representations for diagram \(\text {D}\) and theorems \(\text {T}_1\) and \(\text {T}_2\), respectively, and let \(\mathrm{m}_k=\frac{|V\cap V_{\text {T}_k}|+|E\cap E_{\text {T}_k}|}{|V_{\text {T}_k}|+|E_{\text {T}_k}|}\) for \(k=1,2\).Footnote 2 If \(\mathrm{m}_{k_1} < \mathrm{m}_{k_2}\), then \(\mathrm{rel}_{\text {T}_{k_1}}^{\text {D}} < \mathrm{rel}_{\text {T}_{k_2}}^{\text {D}}\) \((k_1,k_2\in \{1,2\})\).

  • Orderly. Let \(\text {D}_n\) be the diagram for which the graph is obtained by weakening the query graph n times (\(n=1,2,\ldots , |V|\)). Suppose that theorems \(\text {T}_a\) and \(\text {T}_b\) match \(\text {D}_a\) and \(\text {D}_b\), respectively. If \(a < b\), then \(\mathrm{rel}_{\text {T}_a}^{\text {D}} > \mathrm{rel}_{\text {T}_b}^{\text {D}}\).

The degree of relevance may be defined in different ways to meet the above requirements. In what follows, we provide one definition and show its soundness. Similarity of graphs has been studied in the past. Maximum common edge subgraphs are used for calculation of graph similarity in [15] and Dehmer and others [5] use generalized trees which are directed and hierarchical graphs to measure structural similarity of graphs. Most of the methods focus on general graphs. Our method is based on weighted and undirected graphs and takes into account geometric characteristics.

Let

$$\begin{aligned} \mathbf {G}_{\text {D}_g} = ( \{v_{r_{g,1}}, v_{r_{g,2}},\ldots , v_{r_{g,m_g}}\}, \{e_{r_{g,1}}, e_{r_{g,2}},\ldots , e_{r_{g,n_g}}\}) \end{aligned}$$

be the representation of the graph resulting from \(\mathbf {G}_{\text {D}}\) after being weakened g times (\(g=0,1,\ldots ,m_0-1\)),Footnote 3 and

$$\begin{aligned} \mathbf {G}_{\text {T}_g} = ( \{v_{t_{g,1}}, v_{t_{g,2}},\ldots , v_{t_{g,l_g}}\}, \{e_{t_{g,1}}, e_{t_{g,2}},\ldots , e_{t_{g,h_g}}\}) \end{aligned}$$

be the graph representation for a theorem \(\text {T}_g\) whose graph matches the query graph of \(\text {D}_g\) exactly. Let the set \(\mathbf {W_{R}}\) of weights be given, the set of weights of edges in the graph of \(\text {D}\) be \(\{w_{r_{0,1}}^e, w_{r_{0,2}}^e,\ldots , w_{r_{0,n_0}}^e\}\), and the set of weights of edges in the graph of \(\text {D}_g\) be \(\{w_{r_{g,1}}^e, w_{r_{g,2}}^e,\ldots , w_{r_{g,n_g}}^e\}\). Then the degree of relevance between \(\text {T}_g\) and \(\text {D}\) is defined as

$$\begin{aligned} \mathrm{rel}_{\text {T}_g}^{\text {D}} = \mathrm{mat}_{g} \cdot (\mathrm{mtr}_{{g}} - \mathrm{mtr}_{{g+1}}) + \mathrm{mtr}_{{g+1}}, \end{aligned}$$
(1)

where

$$\begin{aligned} \mathrm{mat}_{g} = \displaystyle \frac{1}{2} \cdot \left( \frac{m_g}{l_g} + \frac{n_g}{h_g}\right) , \end{aligned}$$
(2)
$$\begin{aligned} \mathrm{mtr}_{k} = \displaystyle \frac{1}{2} \cdot \left( \frac{m_k}{m_0} + \frac{\sum _{j=1}^{n_k} w_{r_{k,j}}^e}{\sum _{j=1}^{n_0} w_{r_{0,j}}^e}\right) ,\quad k=g,\, g+1, \end{aligned}$$
(3)

and \(g=0,1,\ldots ,m_0-1\). In the above definition, \(\mathrm{mat}_{{g}}\) and \(\mathrm{mtr}_{{g}}\) measure the degree of matching between \(\mathbf {G}_{\text {D}_g}\) and \(\mathbf {G}_{\text {T}_g}\) and the degree of matching between \(\mathbf {G}_{\text {D}_g}\) and \(\mathbf {G}_{\text {D}}\), respectively.

Assertion. The degree of relevance defined above is complete, intuitive, and orderly.

The correctness of this assertion can be seen from the following arguments.

  1. 1.

    Complete. If \(\mathbf {G_{\text {D}}}\) and \(\mathbf {G_{\text {T}_0}}\) are equivalent, then \(\mathbf {G_{\text {D}}}\), \(\mathbf {G}_{\text {D}_0}\), and \(\mathbf {G_{\text {T}_0}}\) are all the same. Therefore, \(\mathrm{mtr}_{0} = 1\), \(\mathrm{mat}_{0} =1\), and thus \(\mathrm{rel}_{\text {T}_0}^{\text {D}} = 1\), which means that the degree of relevance is 100 %. If for any theorem \(\text {T}_0\), there is neither node nor edge of \(\mathbf {G_{\text {D}}}\) which matches the nodes or edges of \(\mathbf {G_{\text {T}_0}}\), then \(\mathrm{mat}_{0} =0\), \(\mathrm{mtr}_{{1}} = 0\), and thus \(\mathrm{rel}_{\text {T}_0}^{\text {D}} = 0\), which means that the degree of relevance is 0 %.

  2. 2.

    Intuitive. According to the definition, when \(\mathrm{mtr}_{g}\) and \(\mathrm{mtr}_{{g+1}}\) are fixed, \(\mathrm{mtr}_{g}-\mathrm{mtr}_{{g+1}}>0\) holds. Therefore, the larger \(\mathrm{mat}_{g}\) is, the higher \(\mathrm{rel}_{\text {T}_g}^{\text {D}}\) is.

  3. 3.

    Orderly. From the formulae in the definition, it is easy to deduce that \(\mathrm{rel}_{\text {T}_g}^{\text {D}} > \mathrm{mtr}_{{g+1}}\) and \(\mathrm{rel}_{\text {T}_{g+1}}^{\text {D}} < \mathrm{mtr}_{{g+1}}\). Therefore, \(\mathrm{rel}_{\text {T}_g}^{\text {D}} > \mathrm{rel}_{\text {T}_{g+1}}^{\text {D}}\).

4.2 Ranking the Results

Retrieved theorems can be ranked according to the degrees of their relevance with the query diagram. For example, five theorems \(\text {T}_1,\ldots ,\text {T}_5\) found of degrees 85 %, 90 %, 45 %, 92 %, and 79 % of relevance, respectively, with the query diagram may be ranked top-down in the order of \(\text {T}_4,\text {T}_2,\text {T}_1,\text {T}_5,\text {T}_3\). From the ranking, it is easy to see which theorems are most relevant to the query input.

5 Implementation and Experimental Results

Now we explain how the searching method presented in the previous sections has been implemented using Python and provide some experimental results to show the performance of the method with our preliminary implementation.

5.1 Implementation Issues

The searching procedure contains five modules: parsing, filtering, exact matching (GraphGrep), similarity measuring, and reducing. Through the parsing module, both the input image of a query diagram and the formal representations of theorems in OpenGeo are parsed to yield graph representations of geometric features. By comparing the numbers of entities and relations in the query graph with those in the graph of each theorem in OpenGeo, the filtering module serves to reduce search space and produce a set of candidate graphs. Then GraphGrep is used to determine which candidate graphs match the query graph exactly. For each resulting graph after exact matching, the degree of relevance between this graph and the query graph is calculated in the module of similarity measuring. While the degrees of relevance are higher than a pre-specified percentage (threshold) and the given number of weakening operations is not reached, the query graph is (further) weakened in the reducing module and the procedure repeats with the weakened query graph instead of the query graph.

Fig. 3.
figure 3

(a) Searching result of exact matching; (b)–(c) Searching results after weakening the query graph once; (d) Searching result after weakening the query graph twice.

Table 3. Selected experimental results

5.2 Examples and Experiments

To see how well the searching procedure performs, let us take the image of the diagram shown in Fig. 2 as an example. With this image as query input, the procedure can find one theorem in OpenGeo, which is illustrated by the diagram shown in Fig. 3(a). The degree of relevance between this found theorem and the query diagram is 99.61 %. If the query graph is weakened once, then two other theorems from OpenGeo, illustrated by the diagrams shown in Fig. 3(b) and (c), can be found and the degrees of relevance between these two theorems and the query diagram are 81.97 % and 81.04 %, respectively. If the query graph is weakened twice, then another theorem, illustrated by the diagram shown in Fig. 3(d), can be found and the degree of relevance between this theorem and the query diagram is 71.95 %. These searching results indicate that the procedure we have implemented is capable of finding geometric theorems with images of diagrams as query input and the measure we have introduced for the degrees of relevance between found theorems and query diagrams is sound.

We have made experiments on more than 40 images of diagrams (scanned from the book [4]) to test our searching procedure. Selected experimental results are given in Table 3. The first column shows the input query images and the second column presents the number of theorems stored in OpenGeo for searching. With each input query image, the procedure may find some theorems, the number of which is recoded in the three sub-columns of the third column: the first sub-column counts the number of found theorems whose degrees of relevance with the query diagram belong to the interval \((80\,\%, 100\,\%]\) and the second and the third sub-column count the numbers of found theorems whose degrees of relevance belong to the intervals \((60\,\%, 80\,\%]\) and \((50\,\%, 60\,\%]\), respectively. The last column of Table 3 shows the running time of the searching procedure. The experimental results demonstrate that our searching procedure works effectively for most input images as query. For the input images in the last two rows of Table 3, the procedure can only find theorems whose degrees of relevance are less than 50 %.

6 Conclusion and Future Work

We have proposed a method to tackle the problem of searching for geometric theorems with images of diagrams as query input. The method uses geometric features retrieved from the images of diagrams and is based on graph matching. This method also treats weakened query graphs as a bridge to find out relevant theorems from the query graph. It is capable not only of finding theorems which the query diagrams likely illustrate, but also of ranking the found theorems according to their degrees of relevance with query diagrams. Preliminary experiments show that our method as well as its implementation works effectively for theorem searching in OpenGeo. We will improve and generalize the method, e.g., by including more types of geometric entities and relations, will extend our implementation for searching theorems in other geometric knowledge bases, e.g., graph databases, will try various methods to calculate degrees of relevance, and will develop a user-friendly interface for geometric theorem processing.