Keywords

1 Introduction

Semantic Based Regularization [5] is a Statistical Relational Learning (SRL) framework, which integrates the ability to learn from examples and data distributions, like in traditional semi-supervised learning, with the inference process typical of high level background knowledge typically used in logic inference. Prior knowledge in SBR is expressed via a set of FOL clauses expressing relationships among the tasks, or relationships among the patterns, or providing a partial definition of the mapping between the input and the output. The main advantage of SBR over other Statistical Relational Learning approaches like Markov Logic Networks (MLNs) [14] or Probabilistic Soft Logic (PSL) [3] is in the tighter integration of logic and the processing of feature-based continuous sensorial input that is available in many real world applications. Indeed, while a MLN can capture a logistic regression model [7, 8], it requires to deal with a large number of weights and groundings. More complex correlations between features and classes can not be captured as the resulting models would be too large to be tractable.

Deep Neural Networks [16] have been shown to be relatively successful in performing feature selection and inference over pattern constituents in their hidden layers. However, this process is opaque and it is not generally clear which is the amount of training data required to correctly instantiate the process during training. SBR provides a way to integrate (deep or shallow) learning with any explicit knowledge about the task at hand, making the learning process more controlled, easier to understand and requiring less labeled data. In the applications where this knowledge is available, it seems natural to exploit the knowledge to force the learning machine to develop more targeted intermediate pattern representations.

Unfortunately, learning is typically hard for all SRL approaches with high generality: the integration of learning with logic inference transforms the intractability of the latter (in a general setting) into the complexity of the numerical optimization problem that needs to be solved during learning. This issues also applies to SBR, for which getting good solutions often requires heuristics and experience. Some attempts at breaking the complexity of learning by subdividing the learning process into small and easier sequential tasks have been hinted by Bengio et al. [2] and later studied in a more systematic way by Yang et al. [19], and Friesen et al. [9]. This paper studies under which conditions training in SBR becomes easy. In particular, it will be shown that it exists a large class of knowledge that can be expressed as a set of convex constraints in SBR. These constraints can be exploited during training in a very efficient and effective way. This class of constraints provides a natural way to break the complexity of learning by building a training plan that uses the convex constraints as an effective initialization step for the final full optimization problem. The experimental results show the effectiveness of this training plan. Another contribution of this paper is to employ Neural Networks for the first time in the context of SBR, showing the generality and flexibility of the framework. Experimental results on image classification are presented to validate the approach.

The paper is organized as follows: Sect. 2 provides an introduction to the SBR learning framework and Sect. 3 shows how to build real-valued constraints from a FOL knowledge base. The experimental results are show in Sect. 4. Finally, some conclusions are drawn in Sect. 5.

2 Semantic Based Regularization

Consider a multi-task learning problem, where a set of T functions must be estimated (query or unknown functions) Let \({\varvec{f}}=\{ f_1, \ldots , f_T\}\) indicate the vector of functions.

A set of H functional constraints in the form \(1-\varPhi _h({\varvec{f}})=0, 0 \le \varPhi _h({\varvec{f}}) \le 1, h=1,\ldots ,H\) are provided to describe how the query functions should behave. These functionals can express a property of a single function or correlate multiple functions, so that learning can be helped by exploiting these correlations.

The j-th function is associated to a set \({\varvec{\mathcal {X}}}^\circ _j\), which is a sample of the patterns input to the function, Each pattern in this set is represented via a vector of features. We assume that this set of patterns is partially labeled, so that the desired function output is also provided for some patterns in the sample. Multiple functions can share the same sample of patterns (e.g. \({\varvec{\mathcal {X}}}^\circ _j={\varvec{\mathcal {X}}}^\circ _i ~ i \ne j\)). Some functions may express relations across multiple patterns, and the pattern representations associated to these functions can be generally expressed as the combination of the patterns from a set of finite domains: \({\varvec{\mathcal {X}}}^\circ _j = {\varvec{\mathcal {X}}}_{j1} \times {\varvec{\mathcal {X}}}_{j2} \times \ldots \).

Let \(f_k({\varvec{\mathcal {X}}}^\circ _k)\) indicate the vector of values obtained by applying the function \(f_k\) to the set of patterns \({\varvec{\mathcal {X}}}^\circ _k\) and \({\varvec{f}}( {\varvec{\mathcal {X}}} ) = f_1({\varvec{\mathcal {X}}}^\circ _1) \cup f_2({\varvec{\mathcal {X}}}^\circ _2) \cup \ldots \) collects the groundings for all functions.

Constraint satisfaction can be enforced by penalizing their violation on the sample of data:

$$\begin{aligned} \begin{array}{rcl} C_{e}[{\varvec{f}}( {\varvec{\mathcal {X}}} )]= & {} \sum \limits _{k=1}^T ||f_k||^2+ \sum \limits _{h=1}^H \lambda _h \Big ( 1 - \varPhi _h \big ({\varvec{f}}({\varvec{\mathcal {X}}})\big ) \Big ), \end{array} \end{aligned}$$
(1)

where the first term is a regularization term penalizing non-smooth solutions and \(\lambda _h\) is the weight for the h-th constraint.

The weights are optimized via gradient descent using a back-propagation schema, where the derivative of the cost function with respect to the j-th weight of the i-th function \(w_{ij}\) is:

$$\begin{aligned} \frac{\partial C_{e}}{\partial w_{ij}} = \sum _k \frac{\partial C_{e}}{\partial \varPhi _k} \cdot \frac{\partial \varPhi _k}{\partial w_{ij}} = \sum _k \frac{\partial C_{e}}{\partial \varPhi _k} \cdot \left( \sum _{t_{\varPhi _k}} \frac{\partial \varPhi _k}{\partial {t_{\varPhi _k}}} \cdot \frac{\partial {t_{\varPhi _k}}}{\partial f_i} \cdot \frac{\partial f_i}{\partial w_{ij}}\right) . \end{aligned}$$
(2)

2.1 Collective Classification

Collective classification (CC) [17] is the task of performing inference over a set of instances that are connected among each other via a set of relationships. Collective classification in SBR [6] enforces that the classification output is consistent with the FOL knowledge used during training.

In particular, let \(f_k({\varvec{\mathcal {X}}}^\prime _k)\) indicate the vector of values obtained by evaluating the function \(f_k\) over the data points of the test set \({\varvec{\mathcal {X}}}^\prime _k\). The set of vectors will be compactly referred to as: \({\varvec{f}}({\varvec{\mathcal {X}}}^\prime ) = f_1({\varvec{\mathcal {X}}}^\prime _1) \cup \ldots \cup f_T({\varvec{\mathcal {X}}}^\prime _T) \). If no neural network has been trained for \(f_k\) (no examples or no feature representations were available during training), \(f_k({\varvec{\mathcal {X}}}^\prime _k)\) is assumed to be just filled with default values equal to 0.5.

Collective classification searches for the values \(\bar{{\varvec{f}}}({\varvec{\mathcal {X}}}^\prime _k) = \bar{f_1}({\varvec{\mathcal {X}}}^\prime _1) \cup \ldots \cup \bar{f_\mathcal{T}}({\varvec{\mathcal {X}}}^\prime _T)\) respecting the FOL formulas on the test data, while being close to the prior values established by the kernel machines over the test data:

$$ C_{cc}[\bar{{\varvec{f}}}({\varvec{\mathcal {X}}}^\prime ), {\varvec{f}}({\varvec{\mathcal {X}}}^\prime )] = \frac{1}{2}\sum _{k=1}^{T}|\bar{f_k}({\varvec{\mathcal {X}}}^\prime _k) - f_k({\varvec{\mathcal {X}}}^\prime _k)|^2 + \sum _{h} \Big (1 - \varPhi _h\big (\bar{{\varvec{f}}} ( {\varvec{\mathcal {X}}}^\prime ) \big ) \Big ) $$

Optimization can be performed via gradient descent by computing the derivative with respect to the function values.

2.2 Logic and Constraints

This section will show how to convert any First Order Logic (FOL) knowledge into a set of constraints \(\varPhi _h\) that can be integrated into learning using Eq. 2.

Our approach is a variation of fuzzy generalizations of First Order Logic (FOL), which have been first proposed by Novak [13]. Fuzzy FOL can transform any FOL knowledge base into a real valued constraint.

T-norm and Residuum. A t-norm fuzzy logic [11, 20] is defined by its t-norm \(t(a_1, a_2)\) that models the logical AND.

Given a variable \(\bar{a}\) with continuous generalization a in [0, 1], its negation \(\lnot \bar{a}\) corresponds to \(1-a\). Once the t-norm functions corresponding to the \(\wedge \) and \(\lnot \) are defined, they can be composed to generalize any logic proposition. Different t-norm fuzzy logics have been proposed in the literature. For example, given two Boolean values \(\bar{a}_1, \bar{a}_2\) and their continuous generalizations \(a_1, a_2\) in [0, 1], the product t-norm is defined as: \( \begin{array}{lcl} (\bar{a}_1 \wedge \bar{a}_2)&~\rightarrow ~&t(a_1,a_2) = a_1 \cdot a_2. \end{array} \) The Lukasiewicz t-norm is instead defined as

$$ \begin{array}{lcl} (\bar{a}_1 \wedge \bar{a}_2) ~~~\rightarrow ~ ~~ t(a_1,a_2) = \max (0, a_1 + a_2 - 1). \end{array} $$

Any t-norm features a binary operator called residuum, which is used to generalize implications when dealing with continuous variables [11]. For example, the Lukasiewicz t-norm has a residuum defined as:

$$ (\bar{a}_1 \Rightarrow \bar{a}_2) ~~~\longrightarrow ~ ~~ t(a_1,a_2) = \left\{ \begin{array}{lcl} 1 &{} ~~~ &{}a_1 \le a_2 \\ 1 - a_1 + a_2 &{} ~~~ &{}a_1 > a_2 \end{array} \right. $$

Quantifiers. With no loss of generality, we focus our attention on FOL formulas in the Prenex Normal Form, having all the quantifiers at the beginning of the expression. The quantifier-free part of the expression is an assertion in fuzzy propositional logic once all the quantified variables are grounded. Hence, a t-norm fuzzy logic can be used to convert it into a continuous function. Let’s consider a FOL formula with variables \(x_1, x_2, \ldots \) assuming values in the finite sets \(\mathcal{X}_1, \mathcal{X}_2, \ldots \). \(\mathcal{P}=\{p_1, p_2, \ldots \}\) is the vector of predicates, where the j-th n-ary predicate is grounded from \(\mathcal{X}^\circ _{j} = \mathcal{X}_{j1} \times \mathcal{X}_{j2} \times \ldots \). Let \(p_j(\mathcal{X}^\circ _{j})\) indicate the set of possible groundings for the j-th predicate, and \(\mathcal{P}(\mathcal{X})\) indicate all possible grounded predicates, such that \(\mathcal{P}(\mathcal{X}) = p_1(\mathcal{X}^\circ _{1}) \cup p_2(\mathcal{X}^\circ _2) \cup \ldots \).

If the atoms \(\mathcal{P}(\mathcal{X})\) are generalized to assume real values in [0, 1], the degree of truth of a formula containing an expression E with a universally quantified variable \(x_i\) is the average of the t-norm generalization \(t_E(\cdot )\), when grounding \(x_i\) over \(\mathcal{X}_i\) (see Diligenti et al. [5] for more details):

$$ \begin{array}{l} {\forall x_i ~~ E\big (\mathcal{P}(\mathcal{X})\big ) ~~~\longrightarrow ~ ~~ \varPhi _\forall (\mathcal{P}\big ( \mathcal{X})\big ) = {\frac{1}{|\mathcal{X}_i|}}} \mathop {\sum }\limits _{{x_{i} \in {\mathcal{X}_i}}} {t_E\big (\mathcal{P}(\mathcal{X})\big )} \end{array} $$

For the existential quantifier, the truth degree is instead defined as the maximum of the t-norm expression over the domain of the quantified variable:

$$\exists x_i ~ E\big (\mathcal{P}(\mathcal{X})\big ) ~~ ~\longrightarrow ~ ~~ \varPhi _{\exists }\big (\mathcal{P}(\mathcal{X})\big ) = \max _{x_i \in \mathcal{X}_i} \; t_E\big (\mathcal{P}(\mathcal{X}) \big ) $$

When multiple universally or existentially quantified variables are present, the conversion is recursively performed from the outer to the inner variables. Please note that the fuzzy formula expression is continuous and differentiable with respect to the fuzzy value of a predicate, and it can therefore easily be integrated into learning.

2.3 Building Constraints from Logic

Let us assume to have available a knowledge base KB, consisting of a set of FOL formulas and a finite set of groundings of the variables. We assume that some of the predicates are unknown: the SBR learning process aims at finding a good approximation of each unknown predicate, so that the estimate predicates will satisfy the FOL formulas for the sample of the inputs. In particular, the function \(f_j(\cdot )\) will be learned as approximation of the j-th unknown predicate. The variables in the KB that are input to any \(f_j\) are replaced with the feature-based representation of the object grounded by the variables, and we will indicate as \(\varvec{x}_i\) the representation of the object grounded by \(x_i\). The groundings \(\mathcal{X}_i\) of the i-th variable are therefore replaced by the set \({\varvec{\mathcal {X}}}_i\), indicating the set of feature-based representations of the groundings. One constraint \(1-\varPhi _i(\cdot )=0\) for each formula \(F_i\) in the knowledge base is built by taking the fuzzy FOL generalization of the formula \(\varPhi _i(\cdot )\), where the unknown predicates are replaced by the learned functions, and the variables input to the learned functions are replaced by their duals iterating over the feature-based representations of the groundings. Previous literature on Semantic Based Regularization [5, 6] has focused on Kernel Machines to implement the functions \(f_j(\cdot )\). However, the SBR framework does not pose any restriction on the machine learning machinery used to approximate the unknown functions. In particular, Neural Networks are used for the first time in the experimental section of this paper.

3 Constraints and Local Minima

The constraint resulting from a FOL formula can be hard to optimize during learning. Let’s consider universally quantified FOL formulas in DNF form:

$$ \forall x_1 \ldots \forall x_n \overbrace{\big (n_{11} P_1(x_1) \wedge \ldots \wedge n_{1n} P_n(x_n))}^{minterm\ 1} \vee \ldots \vee \overbrace{(n_{k1} P_1(x_1) \wedge \ldots \wedge n_{kn} P_n(x_n)}^{minterm\ k}\big ) $$

where \(n_{ij}\) determines whether the j-th variable in the i-th minterm is negated or not. The following expression for each grounding can be obtained by applying a double negation and using the DeMorgan rule:

$$ \lnot \Big ( \lnot \big (n_{11} P_1(x_1) \wedge \ldots \wedge n_{1n} P_n(x_n)\big ) \wedge \ldots \wedge \lnot \big (n_{k1} P_1(x_1) \wedge \ldots \wedge n_{kn} P_n(x_n) \big )\Big ) $$

For any given grounding, the resulting propositional expression can be converted using the product t-norm and replacing the atoms with the unknown function approximations, yields the constraint:

$$ 1 - \varPhi ({\varvec{f}}({\varvec{\mathcal {X}}})) = \frac{1}{\prod \limits _{i=1}^{n} |{\mathcal X}_i|} \sum _{x_1} \ldots \sum _{x_{n-1}} \prod _{r=1}^k \Big (1 - \prod _{i \in A^p_r} f_i(x_i) \prod _{j \in A^n_r} (1 - f_j(x_j)) \Big ) = 0 $$

where \(A^p_r\) and \(A^n_1\) are the set of non-negated and negated atoms in the r-th minterm. It is clear that a null contribution to the summation for a given grounding is obtained as solution of a polynomial equation, where the r-th solution of the polynomial equation corresponds to the assignment satisfying the r-th minterm, that is:

$$ \prod _{i \in A^p_r} f_i(x_i) \prod _{j \in A^n_r} (1-f_j(x_j))=1. $$

Since all minterms are by construction different and the polynomial equation is continuous and assuming values greater or equal to zero as guaranteed by any t-norm, the resulting expression has as many local minima as the number of true configurations in the truth table for the grounded propositional formula, which is in turn equal to the number of minterms of the initial DNF.

This shows that there is a duality between the number of possible assignments of the atoms satisfying the FOL formula for a given grounding of the variables, and the number of local minima in the expression generalizing the formula to a continuous domain. The intractability of unrestricted FOL inference is therefore translated into a SBR cost function that is plagued by many local minima.

3.1 Convexity of the Constraints

While optimization remains generally intractable, however using t-norm residua to translate logic implications, significantly increases the portion of constraints that can be efficiently exploited in learning.

T-norm residua are consistent with modus ponens at the extremes of the variable range. However, they soften the conditions under which the formula is verified. Indeed, any t-norm residuum returns a 1 value whenever the head holds a value larger than the body. This specifies an interval for the admissible solution. On the other hand, the t-norm translation of the implication via modus ponens has only 3 singular points where it is fully satisfied. An interesting result of the application of the t-norm residuum to SBR theory is that a much larger set of formula correspond to a convex constraint with respect to what it would happen using a modus ponens based translation.

In particular, let’s consider the class of FOL formula that are universally quantified, for which the propositional clause resulting from the evaluation of the predicates for any grounding is a definite clauses (e.g. having conjunctive body of positive atoms and a single literal head). One generic formula in this class has the following structure:

$$ \forall x_1 \ldots \forall x_v ~ P_1(x_{i(1)}) \wedge \ldots \wedge P_n(x_{i(n)}) \Rightarrow P_{n+1}(x_{i(n+1)}) $$

where i(j) is the index of the variable used by the j-th predicate. Let \(\varvec{x}= \{x_1, \ldots , x_v\}\) and replacing the predicates with the predicates with the functions \({\varvec{f}}\) to be learned, the constraint can be written as:

$$\begin{aligned} 1 - \varPhi ({\varvec{f}}({\varvec{\mathcal {X}}})) = \frac{1}{\prod \limits _{j=1}^{v} |{\mathcal X}_j|} \sum _{x_1 \in {\mathcal X}_1} \ldots \sum _{x_{v} \in {\mathcal X}_v} \big ( 1 - t({\varvec{f}}, \varvec{x})\big ) = 0 \end{aligned}$$
(3)

where \(t(\cdot )\) is the t-norm representation of the definite clause.

Theorem 1

The function \(1 - \varPhi (\cdot )\) translating a generic FOL formula with any number of nested universal quantifiers, conjunctive body and a single head is convex with respect to the function values if using the Lukasiewicz t-norm.

Proof

Equation 3 shows the general form of the constraint. A positive summation of convex functions is convex, then we only need to prove that each single contribution \(1 - t({\varvec{f}}, \varvec{x})\) to the summation is convex with respect to the \(f_i(x)\) values.

The translation of a conjunction of n variables \(A_1 \wedge \ldots \wedge A_n\) using the Lukasiewicz t-norm is equal to: \(\max (0, \sum _{j=1}^n A_j - n +1)\). Therefore the head and body of the clause are translated as:

The residuum definition for the Lukasiewicz tnorm is:

$$ A_1\Rightarrow A_2 ~~\longrightarrow ~~ \left\{ \begin{array}{ll} 1 &{} A_1-A_2 < 0\\ 1-A_1+A_2 &{} else \end{array}\right. $$

Let us call \(h({\varvec{f}}, \varvec{x}) = \max (0, \sum _{j=1}^n f_j(x_{i(j)}) - n +1) - f_{n+1}(x_{i(n+1)})\), Therefore,

$$\begin{aligned} 1 - t({\varvec{f}}, \varvec{x})= & {} g(h({\varvec{f}}, \varvec{x})) = \\= & {} 1 - \left\{ \begin{array}{ll} 1 &{} ~~~~h({\varvec{f}}, \varvec{x})< 0 \\ 1 - h({\varvec{f}}, \varvec{x}) &{}~~~~ else \end{array}\right. \\= & {} \left\{ \begin{array}{ll} 0 &{} ~~~~h({\varvec{f}}, \varvec{x}) < 0 \\ h({\varvec{f}}, \varvec{x}) &{}~~~~ else \end{array}\right. \end{aligned}$$

\(g(\cdot )\) is convex and non-decreasing in \(h(\cdot )\), while \(h(\cdot )\) is convex. Therefore, the combination \(g(h(\cdot ))\) is convex as well.

Let’s now see some special constraints that fall in this class.

Constraint and supervised data. Let \({\mathcal{X}}^+_k\) be the sets of positive for the k-th unknown predicate \(p_k\). The following logic formula expresses the fact that \(p_k\) is constrained on the values assumed over the supervised data, as it should get a 1 value on a positive example:

$$ \begin{array}{l} \forall x ~ P_k(x) \Rightarrow p_k(x) \end{array} $$

where \(x \in \mathcal{X}_k\) and the predicate \(P_k(x)\) is an evidence function holding true iff x is a positive example for the query predicate \(p_k\), respectively (e.g. \(\varvec{x}\in {\mathcal{X}}^+_k\)). Using the Lukasiewicz t-norm and replacing \(p_k\) with its approximation \(f_k\), this corresponds to the following constraint:

$$ \begin{array}{lcl} 1 - \varPhi \big ( f_k({\varvec{\mathcal {X}}}^+_k) \big )= & {} \frac{1}{|{\varvec{\mathcal {X}}}^+_k|} \sum \limits _{\varvec{x}\in {\varvec{\mathcal {X}}}^+_k} \max \big (0, 1-f_k(\varvec{x})\big ) =0 \end{array} $$

This an example showing how training using the hinge loss (\(\max (0, 1-f_k(\varvec{x}))\) emerges when expressing the fitting of the supervised data via a definite clause. As predicted by Theorem 1, this corresponds to a convex cost function to optimize when using a linear model (like when using an SVM to implement \(f_k\)  [4]).

Manifold Regularization. Let’s consider the formula expressing a manifold based on some relation R:

$$ \begin{array}{lcl} \forall x \forall y ~R(x,y) \Rightarrow \big ( P_k(x) \Leftrightarrow P_k(y)\big ) \end{array} $$

which is equivalent to the conjunction of the following two FOL formulas:

$$ \begin{array}{lcl} \forall x \forall y ~R(x,y) \wedge P_k(x) \Rightarrow P_k(y) \\ \forall x \forall y ~R(x,y) \wedge P_k(y) \Rightarrow P_k(x) \end{array} $$

According to Theorem 1, the resulting constraint for these formulas must yield a convex constraint. Indeed, the constraint is:

$$ \begin{array}{lcl} 1 - \varPhi ({\varvec{f}}_k({\varvec{\mathcal {X}}}_k)) &{}=&{} 1 - \frac{1}{|\mathcal {X}_k|^2} \Big (|\mathcal {X}_k|^2 - |\mathcal {R}| + \sum \limits _{(x,y) \in \mathcal {R}} \max \Big (0, -1 \\ &{}+&{} \left\{ \begin{array}{ll} 1 - f_k(x) + f_k(y) &{}~~~~~ f_k(x)> f_k(y) \\ 1 &{}~~~~~ else \end{array}\right. \\ &{}+&{}\left\{ \begin{array}{ll} 1 - f_k(y) + f_k(x) &{} ~~~~~f_k(y) > f_k(x) \\ 1 &{}~~~~~ else \end{array}\right. \Big ) \Big ) \\ &{}=&{}\frac{|\mathcal {R}|}{|\mathcal {X}_k|^2} - \frac{1}{|\mathcal {X}_k|^2} \sum \limits _{(x,y) \in \mathcal {R}} \max (0, 1 - |f_k(x) - f_k(y)|) \\ &{}=&{} \frac{1}{|\mathcal {X}_k|^2} \sum \limits _{(x,y) \in \mathcal {R}} |f_k(x) - f_k(y)| = 0 \end{array} $$

which is the L1 variation of the classical manifold regularization constraint [1].

3.2 Teaching Plans

The results shown in the previous section suggest a natural heuristic to deal with harder SBR problems:

  • solve the optimization problem introducing only the convex constraints: this means to optimize via gradient descent until the gradient vanishes (e.g. its module falls below some threshold) and the learning process has found a good approximation of the best solution for the convex problem.

  • Introduce the other constraints into the previous problem and run the training process until convergence.

The second step can be further subdivided into multiple stages by forcing first the formulas with a lower number of possible valid (e.g. satisfying the formula) assignments to the atoms to be learned. As explained in the previous sections, these formulas introduce a lower number of local minima into the cost function. This heuristic is similar to what done in constraint satisfaction programming [15], where the variables with the smallest number of admissible values remaining in its domain are selected first during the search process over the possible assignments [10].

4 Experimental Results

The experimental analysis has been carried out on an animal identification benchmark proposed by P. Winston [18], which was initially designed to show the ability of logic programming to determine the class of an animal from some initial clues regarding its features. Unlike in the original challenge, we do not input to the test phase a sufficient set of clues to perform classification, but only the raw images, leaving to the learning framework the duty to develop the intermediate clues over which to perform inference.

The dataset is composed of 5605 images, taken from the ImageNet Footnote 1 database, equally divided in 7 classes, each one representing one animal category: albatross, cheetah, giraffe, ostrich, penguin, tiger and zebra. The feature vector used to represent each image is composed of bag-of-feature and color histogram descriptors. In particular, for each image SIFT descriptors [12] have been extracted, and then later clustering them into 600 visual words. A vector containing the normalized count of each visual word for the given image is provided as representation. We also added a 12-dimension normalized color histogram for each channel in the RGB color space to the feature representation (Fig. 1).

Fig. 1.
figure 1

The feature vector representation for each image in the Winston benchmark.

Table 1. The KB used for training the SBR model. The rules are divided into groups: only the first “definite” group is formed by definite clauses that were originally proposed by Winston to classify the animals. The “excl” rule states the fact that one and only one class should be assigned to each image. The “inter” rules add another intermediate classification level that can be exploited to perform classification over the final classes.

The images have been split into two initial sets: the first one is composed of 2100 images utilized for building the visual vocabulary, while the second set is composed of 3505 images used in the learning process. The experimental analysis has been carried out using by randomly sampling from the overall set of the supervisions the labels to keep as training, validation and test set, randomly sampling 50, 25, 25% of the supervisions, respectively.

Fig. 2.
figure 2

Experimental results obtained in a transductive setting using standard Neural Networks with no constraints, SBR with and without using different set of rules and learning schemas.

Fig. 3.
figure 3

Experimental results obtained by performing SBR collective classification and using different set of rules and learning schemas.

Knowledge base. The knowledge domain is expressed in terms of FOL rules. Table 1 shows the full set of rules used in this task. A total of 33 predicates are available in the KB, but only 7 of them are considered in evaluating the results, while the other ones are intermediate predicates helping to determine the final classes during the inference process.

The rules in KB can be subdivided into subsets:

  • the original set of the rules as provided in the original problem definition by Winston. These rules are definite clauses resulting into a convex constraint and they are marked as definite in the table;

  • the excl rule states that each pattern should belong to one and only one final class, this rule does not translate into a convex constraint;

  • the inter rules show how it is possible to inject any amount of additional knowledge into the classification problem. These rules do not translate into a convex constraint.

4.1 Results

The first set of experiments tests the performance of SBR in a transductive context, where all the images are available at training time, but only the training labels are made available during training. One Neural Network with one hidden layer using a sigmoidal activation function on the output layer and a rectified linear activation in the hidden layer is trained for each of the 33 predicates in the KB. Figure 2 reports a summary of the results, evaluated over the training data for the 7 final predicates corresponding to the final classes in the Winston benchmark. Using the convex constraints coming from the definite clauses provides an improvement of 2 points of F1. A small additional improvement can be obtained by adding all the available constraints at the beginning of training. A larger improvement can be obtained by using a training plan where the convex constraints are added first, and the remaining constraints are added when the cost function has converged during learning with the first set of constraints.

Even if the constraints are already enforced on the test data given the transductive context, there is no guarantee that the input representations are powerful enough to allow the neural networks to respect the constraints on the test data. Therefore, it can be beneficial to further perform a collective classification step as described in Sect. 2.1, where the constraints are enforced over the output label assignments over the test set. The initialization of the assignments is done by using the output of the Neural Networks of the transductive step. Figure 3 reports the results of the collective classification on the test set. It is clear that collective classification improves significantly the results obtained by transductive classification. In this round of experiments, the subset of convex constraints is already providing a large improvement, which can not be moved higher by adding the more complex constraints. This is likely due to the high number of local minima present in the resulting cost function, which prevent the training process to discover better solutions. Like in the transductive learning case, breaking the learning complexity into stages seems to be very useful. Indeed, using the training plan described in Sect. 3.2 delivers a significant boost of the classification performance.

5 Conclusions

Semantic Base Regularization seamlessly integrates First Order Logic into multi-task learning allowing to tackle complex learning problems even when supervised data is scarce. This is possible by leveraging unsupervised data and any domain knowledge available on the field. However, the integration sometimes requires to solve a challenging optimization problem during the learning process. This paper shows a large class of FOL knowledge that can be integrated into learning, while keeping the resulting optimization problem easy. By leveraging this class of clauses, the paper shows how to improve the trained solution in a more general case, by breaking the complexity of learning into multiple stages, which are initialized using a solution built over the “easy” clauses. The experimental results on image classification show the effectiveness of the framework and of the proposed training heuristic. While providing extensive prior knowledge is cumbersome in large and complex experimental setups, we still think that the integration of prior knowledge and learning will be a required step to achieve real human-level capabilities in vision and language understanding. As future work we plan to extend the experimental evaluation to other larger image datasets.