1 Introduction

A great deal of effort has been spent in representing the uncertainty in the type-1 fuzzy sets [1,2,3]. This uncertainty called fuzziness of a type-1 fuzzy set is essentially an attempt at measuring the lack of distinction between a set and its negation as suggested by Yager [2, 3]. As we are aware any crisp set is deemed to have zero fuzziness, this suggestion amounts finding the difference between the uncertainty measure and the measure of specificity [4] of a fuzzy subset, which is related to the degree to which the set contains one and only one element.

In this work, we make an attempt to represent the uncertainty in a type-1 fuzzy set using Hanman–Anirban entropy function [5]. Before embarking on this let us see the need for expanding the scope of fuzzy sets in the realm of uncertainty representation.

1.1 The motivation for the uncertainty representation

The primary objective of this paper is to represent the uncertainty associated with the information source values (attribute values) in a fuzzy set called the possibilistic uncertainty. In the literature, only the probabilistic uncertainty in the probabilities of information source values is addressed. The fuzzy set theory has no provision to represent this uncertainty as it treats each information source value and its membership function value separately. This pair is an element of a fuzzy set. What we need a connecting link between the two to pave the way for uncertainty representation. The representation of uncertainty in the spatially varying and time varying information source values is another problem. To solve these problems, the uncertainty in the information source values forming a type-1 fuzzy set is sought to be quantified by the entropy function leading to the information set theory that expands the scope of a fuzzy set by assigning the role of an agent to the membership function.

Another motivation stems from the desire to analyze, modify, and evaluate the information set based features formulated in this paper on the real-life application like face recognition by developing classifiers based on information processing.

1.2 A brief literature on face recognition

We have chosen the face recognition as an important application of the proposed information set theory. An intuitive way to recognize a face is to extract the important features and compare them with the similar ones on other faces. Thus, a majority of the contributions made to the biometric-based recognition of the human face have focused on detecting prominent parts such as the eyes, nose, mouth, and head contour.

The methods in vogue in face recognition are broadly classified into: (i) holistic matching methods [6,7,8] in which the whole face acts as an input to the recognition system, (ii) global matching methods [9,10,11,12,13,14] which deal with local features such as the eyes, mouth, and nose and their statistics which are fed into the recognition system, and (iii) hybrid methods [15,16,17,18,19,20], where the recognition system makes use of both the local features as well as the whole facial region in the recognition system.

During the past two decades, appearance-based face recognition techniques such as principal component analysis (PCA) [8] and linear discriminant analysis (LDA) [21] have dominated the scene. These two algorithms seek a compact Euclidean space for efficient face recognition. A number of manifold learning algorithms attempt to unearth the geometric properties of high-dimensional feature spaces, including locality-preserving projections (LPP) [22], discriminant LPP (DLPP)[23], orthogonal DLPP (ODLPP) [24]and uncorrelated DLPP (UDLPP) [25]. Dai and Yuen [26] have introduced a regularized discriminant analysis (RDA) to address the problem of small sample size (sss) and to enhance the recognition performance of LDA. A parametric regularized locality-preserving projections (PRLPP) method is presented in [27] for face recognition. In this case, LPP space is regulated parametrically to tap the discriminant information from the whole feature space instead of the projection subspace of PCA as in [26]. To address the problem of small sample size (sss), direct LDA [28] and maximal margin criteria (MMC) [29] are advocated.

The 2D discriminant analysis has been increasingly used in PCA and LDA for face recognition giving rise to 2-D PCA [30, 31] and 2-D LDA [32, 33] and (2-D)\(^{2}\) PCA [34]and (2-D)\(^{2}\) LDA [35,36,37] are their offshoots. Wang et al. [38] put forward a bidirectional PCA plus LDA method \((\hbox {BDPCA}+\hbox {LDA})\) where LDA is performed in the BDPCA space. Non-uniform selection of Gabor features from faces with variations in pose and illumination is made [39] to capture their local statistics and to classify faces using PCA and LDA.

While PCA, LDA and their variants, neural network-based approaches and their variants, etc., are widely employed for face recognition, these are all holistic approaches and we will now survey some patch-based approaches.

The patch-based methods are more suitable for the recognition and analysis of the facial expressions. A few patch-based methods are discussed here. Patch-based models contain important information.

Hu et al. [40] address the problem of face recognition on small sample size (SSS). They have also implemented a patc-based CRC (collaborative representation-based classification) known as PCRC method that classifies the query sample by combining the recognition outputs of all the overlapped patches, each of which is collaboratively represented by the corresponding patches of training samples.

The organization of the paper is as follows: Sect. 2 presents the concept of information set and its properties. The formulation of Hanman filter (HF) and Hanman transform (HT) is given under higher form of information set in Sect. 3. The three new classifiers named inner product classifier (IPC), normed error classifier (NEC), and Hanman classifier (HC) are described as part of information processing in Sect. 4. The details of face databases are detailed out in Sect. 5. The conclusions are given in Sect. 6.

2 The concept of information set and its properties

The concept Information set was introduced by Hanmandlu as a guest editorial in [5] to enlarge the scope of a fuzzy set using the Hanman–Anirban entropy function [41]. The information set arises from representing the uncertainty in a fuzzy set. The features based on information sets are used for the ear based authentication in [42] and for the infrared face recognition in [43]. The concept is elaborated here along with presenting the properties of information set for wider foreseeable applications.

Consider a fuzzy set with its elements as pairs of gray levels \(I=\left\{ {I_{{ ij}} } \right\} \) in a window and the corresponding membership function values \(\left\{ {\mu _{{ ij}} } \right\} \) that represent the degree of association of gray levels to the set. Let g versus h(g) be the histogram plot where g stands for distinct gray levels and h(g) is the probability of occurrence of gray levels in the same window. The probabilistic uncertainty represented by the well-known entropy functions such as Shannon entropy function uses only the probability. The possibilistic uncertainty as represented by fuzzy entropy functions gives only the uncertainty in the membership function. In both these uncertainty representations, the gray levels are disregarded. Our concern here is to represent both probabilistic and possibilistic uncertainties by the same entropy function.

2.1 Derivation of information set

The multimedia components (i.e., an image, speech, text or video) are the information sources. After granualization, they are considered as the information sources. The granualization amounts to partitioning into windows in the case of an image or text, whereas it amounts to sampling into frames in the case of speech and video. In this paper, we have employed granulation as means of partitioning an image into different window sizes. In this context, the work on the information granules by Pedrycz and his co-researchers [44,45,46,47,48] merits a mention. In a broader sense, the information granules refer to the information sets that arise out of partitioning the information source values by the fuzzy equivalence relation. The partitioning of information source here is not based on the fuzzy equivalence relation, which is a different direction. Thus our granulation differs from that of Pedrycz and his co-researchers in [44,45,46,47,48] where they use it on the interval sets, type-2 sets, rough sets, etc., to generate different information granules.

The property values, attributes, or cues comprising the information sources contained in windows or frames form the fuzzy sets. The distribution of these information sources in the fuzzy sets requires an appropriate membership function. Let us consider the commonly used membership functions such as the exponential and Gaussian type functions. The exponential and Gaussian-type membership functions are given by

$$\begin{aligned} \mu _{{ ij}}^\mathrm{e} =\mathrm{e}^{-\left\{ {\frac{\left| {I_{{ ij}} -I\left( \mathrm{ref} \right) } \right| }{f_h^2 }} \right\} };\quad \mu _{{ ij}}^\mathrm{g} =\mathrm{e}^{-\left[ {\frac{I_{{ ij}} -I\left( {ref} \right) }{\sqrt{2}f_h }} \right] ^{2}} \end{aligned}$$
(1)

The fuzzifier \(f_h^2 \) in (1) is devised by Hanmandlu and Jha [46], and it gives the spread of attribute values with respect to the chosen reference (symbolized as ref). It is defined as

$$\begin{aligned} f_h^2 =\frac{{\sum }_{i=1}^W {\sum }_{j=1}^W \left( {I\left( \mathrm{ref} \right) -I_{{ ij}} } \right) ^{4}}{{\sum }_{i=1}^W {\sum }_{j=1}^W \left( {I\left( \mathrm{ref} \right) -I_{{ ij}} } \right) ^{2}} \end{aligned}$$
(2)

One can take \(I (\text {ref}) =I_{\mathrm{avg}}\) or \(I_{\mathrm{max}}\) or \(I_{\mathrm{med}}\) from the values in a window. It may be noted that the above fuzzifier gives more spread than is possible with variance as used in the Gaussian function.

Our objective here is to convert fuzzy sets into information sets. The Hanman–Anirban entropy function [41] can be used to do the conversion. Consider the non-normalized form of this 1D entropy, given by

$$\begin{aligned} H=\sum p\mathrm{e}^{-\left( {ap^{3}+bp^{2}+cp+d} \right) } \end{aligned}$$
(3)

where abc and d are real-valued parameters, p is the probability and \(ap^{3}+bp^{2}+cp+d\) is assumed positive.

The Hanman–Anirban entropy function is defined originally in terms of probabilities to provide a measure of probabilistic uncertainty. We now adapt to represent the possibilistic uncertainty by replacing the probabilities information source values which can be attribute/property values or gray levels in the case of an image considered as an information source as mentioned above. In the context of face image which is our chosen application of the proposed information set theory, \(p=I_{{ ij}}\). So we need 2D form of (3), i.e.,

$$\begin{aligned} H=\sum \sum I_{{ ij}} \mathrm{e}^{-\left( {aI_{{ ij}}^3 +bI_{{ ij}}^2 +cI_{{ ij}} +d} \right) } \end{aligned}$$
(4)

Taking \(a=0\), \(b=0\), \(c=\frac{1}{2f_h^2}\); \(d=-\frac{I\left( \mathrm{ref} \right) }{2f_h^2}\) in (4) leads to:

$$\begin{aligned} H_\mathrm{e} =\sum \sum I_{{ ij}} \mu _{{ ij}}^\mathrm{e} \end{aligned}$$
(5)

Substituting the above parameters in (A.2) gives:

$$\begin{aligned} H_{\mathrm{Ne}} (I_{{ ij}} )=\frac{[H(I_{{ ij}} )-C_\mathrm{e} ]}{\lambda _\mathrm{e}} \end{aligned}$$
(6)

where \(C_{\mathrm{g}}\), and \(\lambda _\mathrm{e}\) are constants \(H(I_{{ ij}})=H_\mathrm{e}\).

With a minor adaptation of (4) using the following parameters

$$\begin{aligned} a=0,\quad b=\frac{1}{2f_h^2 },\quad c=-\frac{2I(\mathrm{ref})}{2f_h^2 },\quad d=\frac{I^{2}(\mathrm{ref})}{2f_h^2} \end{aligned}$$

Takes the form as:

$$\begin{aligned} H_\mathrm{g} =\sum \sum I_{{ ij}} \mu _{{ ij}}^\mathrm{g} \end{aligned}$$
(7)

Substituting the above parameters in (A.2) gives:

$$\begin{aligned} H_{\mathrm{Ng}} (I_{{ ij}} )=\frac{[H(I_{{ ij}} )-C_\mathrm{g}]}{\lambda _\mathrm{g}} \end{aligned}$$
(8)

where \(C_{\mathrm{g}}\), and \(\lambda _e\) are constants and \(H(I_{{ ij}} )=H_\mathrm{g}\).

For any membership function, there are three representations : (i) normalized information \(H_\mathrm{N}=H-C/\lambda \), which says that the normalized information \(H_{\mathrm{N}}\) results from the information H after it is corrected by C and scaled by \(\lambda \). (ii) Corrected information \(H_\mathrm{C}=H{-}C\) and (iii) the basic information \(H=\sum \sum I_{{ ij}} \mu _{{ ij}}\). In case iii, we do not convert the information sources as probability by taking \(I_{{ ij}} /\sum \sum I_{{ ij}} \), because this makes the information value too small to have any discriminating power.

In the real-life scenario, the received information is invariably pruned either by correcting or by normalizing. The information source values received by our senses are perceived as differing information values because the perception is different depending on how much importance we attach to the source. Like fuzzy variables, information values are also natural variables.

But for simplicity and for imparting the discriminating power, we choose the basic information as H, which is the product of the information source values and its membership function values. This product misleads the readers that the information sets are no way different from the fuzzy sets. By combining both the information source values and membership values provides us new paradigm to deal with the representation of uncertainty of either type, probabilistic or possibilistic.

Definition

Information set: Any fuzzy set defined over a universe of discourse can be converted into information set. Its elements, which are the products of information sources and their membership grades, are called information values. The Information set, comprising the information values, is expressed as:

$$\begin{aligned} \mathcal{H}\left( I \right) =\left\{ {I_{{ ij}} \mu _{{ ij}} } \right\} =\{H_{{ ij}} \left( I \right) \};\quad I\in \left[ {0,1} \right] \end{aligned}$$
(9)

As we do not know the suitable membership function we need to try out the well-defined memberships like the exponential and Gaussian functions. If they do not fit, any arbitrary membership functions may be sought without affecting the definition of the information value. One such function is:

$$\begin{aligned} \mu _{{ ij}} =\frac{\left| {I_{{ ij}} -I\left( \mathrm{ref} \right) } \right| }{f_h^2 } \end{aligned}$$
(10)

Seven propertiesof the information sets referred to as Property-i, where \(i=1,2,\ldots ,7\) are now elaborated.

2.2 Properties of information set

2.2.1 The information set can be converted into different forms

It has been observed that the basic information set may not be effective. Hence we convert it into different forms for dealing with different problems by assuming the information value as a unit of information. We can apply any function on the information values.

For example, the information value \(\left\{ {I_{{ ij}} \mu _{{ ij}} } \right\} \) on the application of a sigmoid function S leads to

$$\begin{aligned} S\left( {I_{{ ij}} \mu _{{ ij}} } \right) =S_{{ ij}} =\frac{1}{1+\mathrm{e}^{-I_{{ ij}} \mu _{{ ij}} }} \end{aligned}$$
(11)

Thus modified information set \(\{S_{{ ij}}\}\) provides betterfeatures than those generated from the basic information set \(\left\{ {I_{{ ij}} \mu _{{ ij}} } \right\} \). Similarly, we can generate log \(\left\{ {I_{{ ij}} \mu _{{ ij}} } \right\} \) features. It is also possible to derive information sets from texture images which are in turn obtained from an image by applying either local binary pattern (LBP) or local directional derivative (LDP) operators. The resulting information sets that carry texture information are denoted by \(\left\{ {\mathrm{LBP}(I_{{ ij}} )\mu _{{ ij}} } \right\} \) or \(\left\{ {\mathrm{LDP}(I_{{ ij}} )\mu _{{ ij}} } \right\} \). We can use the sigmoid function to operate on these information sets as well. If we are desirous of dimensionality reduction, then PCA or 2DPCA can be applied on these information sets.

2.2.2 Probability and possibility can be addressed very easily through information sets

For example, when the gray levels g(k) are represented by membership functions \(\mu \left( k \right) \) and the frequency of their occurrences by the probability h(k). The histogram is a plot of g(k) versus h(k).

We can get two types information values: possibilistic uncertainty given by \(g\left( k \right) \mu \left( k \right) \) and the possibilistic-probabilistic uncertainty \(h\left( k \right) g\left( k \right) \mu \left( k \right) \) from the kth gray level.

2.2.3 The desired components can be captured from the information by a weighting function

Let us consider the weighted entropy,

$$\begin{aligned} H=f\left( p\right) p\mathrm{e}^{-\left( {ap^{3}+bp^{2}+cp+d} \right) } \end{aligned}$$
(12)

The above can be written in the form

$$\begin{aligned} H=I_{{ ij}} \mu _{{ ij}} f\left( {I_{{ ij}} } \right) \end{aligned}$$
(13)

by replacing p with \(I_{{ ij}}\) and by the appropriate choice of parameters. The weighting function is chosen to get the desired information. As we will see later, a particular weighting function f acting on the information converts it into a filter.

2.2.4 The spatial and time variation of 1-D (signals) and 2-D (images) can be characterized effectively by the Information sets

The spatial variation of variable represented by a histogram and the time variation through a time function are discussed later in connection with the formulation of Hanman transform.

2.2.5 The information sets make the fuzzy modeling easier in the absence of the output information

We will now explore the role of information sets in the fuzzy modeling. Consider the generalized fuzzy model (GFM) proposed by Ahmad et al. [47]. The fuzzy rule underlying the model is of the form:

$$\begin{aligned}&\hbox {GFM Rule:}\,\mathbf{If}\,x_{1}\,\mathrm{is}\, A(x_{1})\quad \hbox {and}\quad x_{2}\,\hbox {is}\,A(x_{2})\quad \hbox {and}\quad \ldots \ldots x_{n}\,\hbox {is}\,A(x_{n})\nonumber \\&\mathbf{Then}\,y = (B, f(x)) \end{aligned}$$
(14)

where \(x_{1},x_{2},{\ldots },x_{n}\) are the fuzzy variables, \(A(x_{1}), {\ldots },A(x_{n})\) are fuzzy sets. B is a fuzzy set of y and f(x) is its centroid value. The gneralized fuzzy model (GFM) becomes Takagi–Sugeno model if \(B=\phi \) and Mamdani–Larsen model if \(f(x)=0\). The GFM rule can be converted into information rule as follows:

$$\begin{aligned}&\hbox {Information Rule:}\,\mathbf{If}\,H_{1}(x_{1})\,\hbox {is}\, \{A(x_{1})\mu (x_{1})\}\quad \hbox {and}\quad H_{2}(x_{2})\,\hbox {is}\nonumber \\&\{A(x_{2})\mu (x_{2})\}\ldots ,\quad \hbox {and}\quad H_{n}(x_{n})\,\hbox {is}\, \{A(x_{n})\mu (x_{n})\},\nonumber \\&\mathbf{Then}\,y = (B({\bar{A}_i }, \bar{\mu }_i ),\,f(x)) \end{aligned}$$
(15)

Assuming that the information sets are independent and of proportions \(p_{\mathrm{i}}\) as in Gaussian mixture model (GMM), we obtain

$$\begin{aligned} {\bar{A}_i} =\frac{{\sum }_{x_i}^{p_i} H_i \left( {x_i } \right) }{{\sum }_{x_i }^{p_i} \mu \left( {x_i } \right) };\quad \bar{\mu }_i =\frac{{\sum }_{x_i }^{p_i} H_i \left( {x_i } \right) }{{\sum }_{x_i }^A\left( {x_i } \right) }\quad \hbox {and}\quad f\left( x \right) =\frac{{\sum }_i^n \bar{A}_{i\bar{\mu } _i } }{{\sum }_i^n \bar{\mu } _i } \end{aligned}$$
(16)

If all information sets have the same cardinality, then \(\hbox {p}_{\mathrm{i}}=1\). Note that we are able to easily derive all the arguments of \(y\,(B({\bar{A}_i},\,\bar{\mu }_i ),\,f(x))\) without any knowledge of y, thus bringing us into the realm of unsupervised learning. If y is known, then the estimated output f(x) will tell us the modeling error \((y-f(x))\). Based on (16), we state a lemma that demonstrates the usefulness of information sets in the context of fuzzy modeling.

Lemma

The representation of fuzzy sets in the GFM by the information sets converts the antecedent part of the rule into the output thus facilitating the unsupervised learning. In the aftermath of this lemma, the unsupervised neural networks can be modified to facilitate easy learning.

Interactive Information: If the information sets have overlapping information, f(x) requires the Choquet integral type of computation [48]. The interactive information set features based on S-norms are already presented in [43], but these are different from what we are going to propose here. We will make use of the adaptive Hanman–Anirban entropy function to see whether we can convert into the Choquet integral form as follows:

Substituting \(p_{i}=\bar{A}_i;\,a=b=0\) and \(d=-\bar{{A}}_{i-1} \) and \(c=1\) in (3) we get

$$\begin{aligned}&\bar{y}=f(x)=\sum _{i=1}^d \bar{A}_{i}\mathrm{e}^{-\left( \bar{A}_{i}-\bar{A}_{i-1}\right) } =\sum _{i=1}^n {\bar{{A}}_i } \mathrm{e}^{-\Delta \bar{{A}}_i }\nonumber \\&\hbox {s.t.}\,\bar{{A}}_n =\bar{{A}}_0 \end{aligned}$$
(17)

In the Choquet integral [48], the fuzzy measures are estimated from the input sets starting with one element set and then two-element set and finally ending with the complete set as follows.

$$\begin{aligned} x_1= & {} \left\{ {\bar{A} _1 } \right\} ,\quad x_2 =\left\{ {\bar{A} _2,\quad \bar{A} _1 } \right\} ,x_{d-1}\\= & {} \left\{ {\bar{A} _{d-1} ,\ldots ,\bar{A} _2 ,\bar{A} _1 } \right\} ,\ldots ,x_1 \\= & {} \left\{ {\bar{A} _d ,\ldots ,\bar{A} _2 ,\bar{A} _1 } \right\} . \end{aligned}$$

As can be seen from (17) that it cannot be put in the Choquet integral form as the exponential gain function is a function of the difference between two adjacent values, whereas fuzzy measure is a function of all previous values. In order to convert (17) into the Choquet integral form, we need to modify the Hanman–Anirban entropy function into the form as follows:

$$\begin{aligned} \bar{y}= & {} f(x)\sum _{i=1}^d g\left( {x_i } \right) \mathrm{e}^{\left( {\bar{{A}}_i -\bar{{A}}_{i-1} } \right) }\nonumber \\= & {} \sum _{i=1}^d g\left( {x_i } \right) \mathrm{e}^{\left( {\Delta \bar{{A}}_i } \right) } \end{aligned}$$
(18)

In this form, \(g(x_{\mathrm{i}})\) being a fuzzy measure requires learning of the interaction parameters.

2.2.6 The information sets can be extended to information rough and rough information sets

In real life, only the information values are available. For example, during the admission of a candidate to a program, each expert of the interview committee x gives only the relative marks \(H\left( x \right) =I\left( x \right) \mu \left( x \right) \) (where I(x) refer to the candidates’ performance and \(\mu \left( x \right) \) his relative grade as perceived by the expert x with respect to the previous performance of the candidates he has interviewed so far).

Consider different membership functions (agents) representing the same fuzzy set, then some are on the higher side \(\left\{ {H\left( x \right) =I\left( x \right) \mu \left( x \right) ;\mu \left( x \right) \ge \alpha } \right\} \) which correspond to the lower Information Rough set and the others are on the lower side that corresponds to the upper Information Rough set. When there is a divergence (div) in the evaluations or attributes, then roughness arises.

If \(\left\{ {H\left( x \right) \ge M} \right\} \), where M is a threshold, it is the lower Rough Information set; otherwise, it is the upper rough information set. There are also several other aspects of the rough set theory that can be easily embedded into the information set theory, but these are not addressed here.

2.2.7 Information sets allow the application of agent theory

When different membership functions (agents) judge the information source values differently, then we can aggregate the membership function values through t-norms or s-norms to improve the representation. If we have one sample of a user fitted with the two membership functions, then these functions can be aggregated to provide better representation. An agent is a higher form of membership function. We define an agent as the one that generates the information when its parameter is varied.

Definition of an Agent: Consider the exponential gain function \(\mathrm{e}^{-\left( {cp_i +d} \right) }\) from (18) which when differentiated with respect to c gives us

$$\begin{aligned} -p_{i}\,\mathrm{e}^{-\left( {cp_i +d} \right) } \end{aligned}$$
(19)

The absolute value of this derivative is the information value associated with \(p_{i}\).

In the context of an agent, it is imperative to define two types of information.

Auto Information set HA(x): If the membership \(\mu \left( x \right) \) is obtained from the statistics of the information source I(x), then we can derive the auto information value, i.e., \(HA\left( x \right) =I\left( x \right) \mu \left( x \right) \).

Hetero Information set HC(x): If the membership \(\mu \left( y \right) \) is obtained from the statistics of another information source I(y), then we can derive the hetero information value, i.e., \(HC\left( x \right) =I\left( x \right) \mu \left( y \right) \).

Some Important notations used in this paper I stand for the information source (say image). \(\hbox {I}_{\mathrm{ij}}\) is the information source value. \(\mu _{{ ij}}\) Stands for the membership function (also an agent). H stands for the information and \(\mathcal{H}\) stands for the information set \(\{H_{\mathrm{ij}}\}\). Note that \(H_{\mathrm{ij}}\) is the information value. \(H_{\mathrm{t}}\) is the Hanman transform; the subscript “t” denotes that it is a transform. The other subscripts on H such as e, g, \(N_{\mathrm{e}}, N_{\mathrm{g}}\) denotes that the transforms are based on exponential, Gaussian, normalized exponential, normalized Gaussian respectively.

\(H_{\mathrm{N}}\) is the normalized entropy and H(p) is the entropy as a function of probability “p”. \(\mathcal{H}\left( s\right) \) is an information set as a function of s. \(\mathcal{H}\left( {s,F} \right) \) is the Hanman filter. \(F_{\mathrm{ij}}(s,u)\) is the filter function. \(\mu _{{ ij}} \left( s \right) \) as a function of s. The superscripts e and g on \(\mu _{{ ij}} \) indicate the type of membership function (exponential and Gaussian). \(f_{\mathrm{ij}}\) is the feature vector. In the context of classifier design, \(e_{\mathrm{ij}}\) is taken as the error vector and \(E_{\mathrm{ij}}\) is the normed error vector. The less important notations are ignored to save space.

3 Higher form of information sets

We will now make use of information sets in the formulation of Hanman filter and Hanman transform which are higher form of information sets.

3.1 Hanman filter

Development of a filter is motivated by the desire to change the information set as per our requirement. For example an image convolved with Gabor filter displays the highlighted frequency components. The under exposed image can be made to have the pleasing look by applying an enhancement operator. The original information that is not very useful needs to be modified by a filter or an operator. There are two ways to change the information: (i) by changing the membership function/agent that multiplies the information source, and (ii) by devising a filter function or an operator.

We will now discuss the change of information or the generation of different information sets by changing the parameter of a membership function. Consider a set of information values originating from the information sources (gray levels) in a window and assume that they are fitted with the membership function of Gaussian type as a function of the fuzzifier. When the fuzzifier is varied by a scale factor, it gives rise to different membership functions. The generation of information sets \(\mathcal{H}\left( s \right) \) accomplished by varying the scale factor s in the membership function, \(\mu \left( s \right) \), is expressed as

$$\begin{aligned} \mathcal{H}\left( s \right)= & {} \left\{ {\mu _{{ ij}} \left( s \right) I_{{ ij}} } \right\} \nonumber \\ \mu _{{ ij}} \left( s \right)= & {} \mathrm{e}^{-\left[ {\frac{\left( {I_{{ ij}} -I_{\mathrm{avg}}} \right) ^{2}}{sf_h^2 }} \right] }\quad \hbox {for}\quad s \ \epsilon \left\{ {0.4,0.6,0.8,1} \right\} \end{aligned}$$
(20)

where \(I_{{ ij}}\) is the gray level in a window. The membership function need not be Gaussian and \(\mathcal{H}\left( s \right) \) can also be modified by applying any function such as sigmoid function.

We will now see how to change the information sets. In this the original information set is modified by the choice of a filter function or an operator. As an example, we consider a suitable cosine to achieve our objective.

Following Property-1 and Property-3, the desired frequency components can be filtered out (captured) from the information sets by the cosine function, \(\cos \left( {2\pi F_{{ ij}} \left( {s,u} \right) } \right) \)called the Hanman filter. Invoking this filter modifies the information sets in (20) to:

$$\begin{aligned} \mathcal{H}\left( {s,F} \right) =\mu _{{ ij}} \left( s \right) I_{{ ij}}\,\hbox {cos}\left( {2\pi F_{{ ij}} \left( {s,u} \right) } \right) \end{aligned}$$
(21)

Note that unlike the Gabor filter, Hanman filter is not a function of orientation but is a function of scale s, frequency u and translation of the information source \(I_{\mathrm{ij}}\) by the amount \(I_{\mathrm{avg}}\). Thus, it has the capability of a wavelet function. The filter function \(F_{\mathrm{ij}}(s, u)\) acts on the original information to separate out the frequency components. It is chosen as:

$$\begin{aligned} F_{{ ij}} \left( {s,u} \right) =F_u \left[ {\frac{\left| {I_{{ ij}} -I_{\mathrm{avg}}} \right| }{2^{\mathrm{s}}}} \right] ;\quad s=0.4,0.6,0.8,1.0 \end{aligned}$$
(22)

where we have taken \(F_u =\frac{F_{\mathrm{max}} }{2^{\left( {u/2} \right) }}\) with \(u=1,2,3\); \(F_{\max }=0.25\). While the symbol of scale parameter in (22) for generality could have been changed, it has been avoided for simplicity. Another function that one could opt for is \(\mathrm{e}^{-i2\pi F_{{ ij}} \left( {s,u} \right) }\) instead of the cosine function but this produces both real and imaginary components. To simplify (22) further, one can do away with \(F_{\mathrm{u}}\) by incorporating its effect as follows:

$$\begin{aligned} F_{{ ij}} \left( {s,u} \right) =\frac{\left| {I_{{ ij}} -I_{\mathrm{avg}} } \right| }{2^{su}} \end{aligned}$$
(23)

Neglecting s and taking appropriate value for u in the range 3–5 converts \(F_{\mathrm{ij}}(s,\,u)\) into

$$\begin{aligned} F_{{ ij}} \left( u \right) =\left[ {\frac{I_{{ ij}} -I_{\mathrm{avg}} }{2^{u}}} \right] \end{aligned}$$
(24)

It may be noted that we have not accounted for the orientation in the filter (21) for the simple reason that the face images are bereft of large pose variation. As regards frequency content, it is hard to determine the most suitable frequency components for a particular problem. The frequency components of our choice can be retrieved. For instance, uth frequency is determined from (21) using \(s=1\) and \(u=2\) as:

$$\begin{aligned} \mathcal{H}\left( {1,F} \right) =\sum _{i=1}^W \sum _{j=1}^w I_{{ ij}} \mu _{{ ij}} \left( 1 \right) \,\hbox {cos}\left( {2\pi F_{{ ij}} \left( {1,2} \right) } \right) \end{aligned}$$
(25)

Algorithm: The steps for the Hanman filter features from (20)–(21) are as follows: (1) Generate 12 information sets from a window of size \(W\times W\) for \(W=3,5,7,9\) in an image by taking 3 values of u and four values of s, (2) Compute the composite information set by aggregating all 12 sets, (3) Compute the average value from each window as the feature, (4) Repeat Steps 1–3 until all windows in a face image are covered, thereby producing a feature vector, and (5) Generate different features corresponding to different values of W.

The effectiveness of Hanman filter over Gabor filter is demonstrated on face recognition by Sayeed and Hanmandlu [49]. The generality of the information values arising from the flexibility that they can be changed in several ways bestows it an immense power to Hanman filter whereas Gabor filter is stuck to Gaussian membership function only.

3.2 Hanman transforms

The second and fourth properties of information sets are used to derive a transform to assess higher form of uncertainty in the information source values in a window of an image based on the initial uncertainty representation. This can be accomplished by a possibilistic version of the adaptive Hanman–Anirban entropy function having variable parameters. The transforms have realistic applications. For example, we gather information about an unknown person of some interest to us. This is the first level of information (set) and then evaluate him again to get the second level of information camped with the first one.

We will now see the formulation of transforms using the Hanman–Anirban entropy function. For this, the parameters of the Hanman–Anirban entropy function (4) are selected as \(a=b=d=0\) and \(c=-\left( {\mu _{{ ij}} /I_{\mathrm{max}}} \right) \) leading to the Hanman Transform:

$$\begin{aligned} H_t \left( I \right) =\sum _{i=1}^W \sum _{j=1}^W I_{{ ij}} \mathrm{e}^{-\left( {\mu _{{ ij}} I_{{ ij}} /I_{\mathrm{max}} } \right) } \end{aligned}$$
(26)

Here one of the parameters, c is taken to be a function of \(\mu _{{ ij}} \). As can be seen from (30), the information source is weighted as a function of the information value. This can be observed in social contexts for example, where a person (information source) is judged by the opinions of others (information value).

If we have some prior information H0 in (26), then the exponential gain depends on the relative information as follows:

$$\begin{aligned} H_t \left( I \right) =\sum _{i=1}^W \sum _{j=1}^W I_{{ ij}} \mathrm{e}^{-\{(\mu _{i,j} .I_{{ ij}}-H0\}/I_{\mathrm{max}})} \end{aligned}$$
(27)

If the subimage is represented as a histogram of g(k) versus h(k), where g(k) is the kth gray level and h(k) is the frequency of occurrence of kth gray level, then the transform using Property-2 becomes

$$\begin{aligned} H_t \left( g \right) =\sum _k h\left( k \right) g\left( k \right) \mathrm{e}^{-\mu _k g\left( k \right) } \end{aligned}$$
(28)

where \(\mu _k\) is the membership function value of kth gray level. An integral form of the transform not satisfying the properties of the entropy function is expressed as:

$$\begin{aligned} H_t \left( h \right) =\int _g h\left( g \right) \mathrm{e}^{-\mu \left( g \right) h\left( g \right) }\hbox {d}g \end{aligned}$$
(29)

A simple representation of histogram is to have h(k) both as the membership function and g(k) as the information source values.

$$\begin{aligned} H_t \left( g \right) =\sum _k g\left( k \right) \mathrm{e}^{-h\left( k \right) g\left( k \right) } \end{aligned}$$
(30)

In some applications, the probability density function (PDF) (life expectancy h(a) of a person ‘a’ having the age a(y)) serves as the membership function and in such cases, (30) can be written as

$$\begin{aligned} H_t \left( a \right) =\sum _y a\left( y \right) \mathrm{e}^{-h\left( a \right) a\left( y \right) } \end{aligned}$$
(31)

The extension of (30) to the time varying signal of some fixed duration is very simple.

$$\begin{aligned} H_t \left( t \right) =\sum _t g\left( t \right) \mathrm{e}^{-\mu _t g\left( t \right) } \end{aligned}$$
(32)

However, this work is limited to spatial variations

Algorithm: The Hanman transform features are extracted from (30) in the following steps: (1) Compute the membership value associated with each gray level in a window of size \(W\times W\), (2) Compute the information as the product of the gray level and its membership function value, divided by the maximum gray level in the window, (3) Take the exponential of the normalized information and multiply it with the gray level, (4) Repeat steps 1–3 on all gray levels in a window and sum the values to obtain a feature, (5) Repeat steps 1–4 on all windows in a face image to get all features, and (6) repeat steps 1–5 for \(W=3,5,7,9\) for the performance evaluation.

The Hanman transform was used to transform the structure function for the representation of multispectral palmprints in [50] by Grover and Hanmandlu. Aggarwal and Hanmandlu [51] provide a comprehensive treatment of possibilistic uncertainty using higher order Shannon transforms along with several uncertainty measures. These transforms are offshoots of Hanman transforms that were proposed much before the Shannon transforms.

3.3 The new entropy function

With view to represent the uncertainty in the information source values under the unconstrained conditions that exist in surveillance application, a new entropy function called the Mamta–Hanman entropy is proposed in [52]. The unified features are derived to represent the three modalities, IR face, iris, and ear for the development of face-based multimodal biometric system by fusing them using the score level fusion in [53].

The Mamta–Hanman entropy function is of the form

$$\begin{aligned} H=\sum \limits _{i=1}^n \sum _{i=1}^n I_{{ ij}} ^{\gamma }\mathrm{e}^{-\left( {cI_{{ ij}} ^{\alpha }+d} \right) ^{\beta }} \end{aligned}$$
(33)

In view of this, the basic information set now becomes

$$\begin{aligned} \mathcal{H}\left( I \right) =\left\{ {I_{{ ij}}^\gamma \mu _{{ ij}} } \right\} \end{aligned}$$
(34)

The membership function can be chosen appropriately. The Hanman transform in (26) can be written as

$$\begin{aligned} H_t \left( I \right) =\sum _{i=1}^W \sum _{j=1}^W I_{{ ij}}^\gamma \mathrm{e}^{-\left( {\mu _{{ ij}} I_{{ ij}}^\alpha /I_{{\max }} } \right) } \end{aligned}$$
(35)

As can be seen that (35) offers a lot of flexibility in the choice of parameters but at the cost of increased learning.

4 Information processing

Here we make use of Property-5 of the information sets for facilitating the information processing. We define two rules: information modeling (IM) for the extraction of features based on information sets and information processing (IP) for the matching of training features with the test features. Let \(I_{\mathrm{tr1}}(l),{\ldots },I_{\mathrm{trn}}(l)\) be the n sub images(windows) of lth training sample and \(f_{\mathrm{tr}}\) (lj) be the corresponding training feature vector and \(f_{\mathrm{te}}(j)\) be the test feature vector. The IM-Rule is of the form.

$$\begin{aligned}&\hbox {IM-Rule:}\, \mathbf{If}\,f_{\mathrm{tr}}(i,j);\,i\varepsilon ;\,M\quad and\quad f_{\mathrm{ts}}(j)\, \hbox {are training and the test}\nonumber \\&\hbox {feature vectors}\,\mathbf{Devise}\,\hbox {a Criterion function} \end{aligned}$$
(36)

The devise construct is intended to provide a choice for the user to devise any objective function. The Information set theory underlying the IPC and the Hanman integral has been encapsulated in the above rule.

Development of classifiers

An attempt is made to formulate both inner product classifier (IPC) and Hanman classifier (HC) from Hanman–Anirban conditional entropy function, given the feature vectors of the training set and the test set.

4.1 Inner product classifier

This classifier is built on a basis of the training features and the absolute errors between the training and test sample features. We consider the average of the two training features and aggregation of errors using t-norms for the development of the classifier. The purpose of aggregation is to account for the interactions between the errors. Their inner product between the average of the training features and the fused errors must be the least for the test features to match with the training features. This is the concept behind the proposed classifier.

The aggregated error vectors act as the support vectors which when projected onto the average of the feature vectors, become the inner products that are akin to the margins in support vector machine (SVM). The difference between the highest and the lowest inner products gives the range of margins. The training feature vectors associated with the lowest margin give the identity to the test feature vector. As the absolute errors are considered, the margin is toward the positive side of the projection plane, i.e., the hyperplane. The other forms of errors like square of the errors can also be investigated in the future.

The t-norms generalize the logical conjunction of two fuzzy variables (feature vectors) a and b in the interval [0,1]. If the training set contains a number of sample faces, then the t-norm is taken between the feature vectors of two training samples to increase the difference, i.e., margin between them thus facilitating easy classification. The choice of t-norms suitable for a feature set is by trial. Of the many families of t-norms, Yager t-norm is found to be most suitable to the face recognition for it gives the maximum margin. It has a parametric form given by [54]:

$$\begin{aligned} t_Y =\hbox {max}\left[ 1-[(1-x)^{p}+\left( 1-y)^{p} \right] ^{1/{p}},0 \right] ;\quad p>0 \end{aligned}$$
(37)

In our case \(p=22\). We will now present the steps involved in the IPC algorithm.

Algorithm for IPC

It may be noted that the normalization is done on the entire feature data. Each feature vector is normalized using only the minimum and maximum feature values. Even if we use the maximum and minimum of the training feature vectors, the results will not be affected. The discriminating power of the classifier comes from the use of appropriate t-norms.

Normalize all the features of all users (\(\forall i)\) column-wise (\(j=1,2,\ldots ,N\)) using

$$\begin{aligned} \bar{f} \left( {i,j} \right) =\frac{f\left( {i,j} \right) -\min \left( {f\left( {i,j} \right) } \right) }{\max \left( {f\left( {i,j} \right) } \right) -\hbox {min}\left( {f\left( {i,j} \right) } \right) } \end{aligned}$$
(38)

where \(f\left( {i,j} \right) \) is the jth feature of ith sample.

  1. 1.

    Divide the normalized feature set \(\left\{ {\bar{f} \left( {i,j} \right) } \right\} \) into the training set \(\{f_{\mathrm{tr}} \left( {i,j} \right) \}\) and test set \(\left\{ {f_{\mathrm{te}} \left( j \right) } \right\} \).

    Here \(i =1,2,{\ldots },M;\,j= 1,2,{\ldots },N\); M being the total number of samples for each user in the training set and N being the total number of features from a sample; \(f_{\mathrm{tr}}\) and \(f_{\mathrm{te}}\) are the feature vectors of the training and test samples respectively.

  2. 2.

    Calculate the absolute errors \(e_{{ ij}} \) between the features of the ith and kth training samples of a user and any test sample as:

    $$\begin{aligned} e_{{ ij}}= & {} \left| {f_{\mathrm{tr}} \left( {i,j} \right) -f_{\mathrm{te}} \left( j \right) } \right| \nonumber \\ e_{kj}= & {} \left| {f_{\mathrm{tr}} \left( {k,j} \right) -f_{\mathrm{te}} \left( j \right) } \right| \end{aligned}$$
    (39)
  3. 3.

    Fuse the absolute errors of ith and kth training samples by the Yager t-norm denoted by

    $$\begin{aligned} E_{ik} (j)=t_Y \{e_{{ ij}},e_{kj}\},i\ne k \end{aligned}$$
    (40)

    We consider all possible combinations of the training sample errors in (40) entailing a marginal computation but with the prospect of obtaining the least value of \(E_{ik} (j)\)

  4. 4.

    Find the average feature value of ith and \(\hbox {k}\)th training samples

    $$\begin{aligned} f_{ik} \left( j \right) =1/2\left\{ {f_{\mathrm{tr}} \left( {i,j} \right) +f_{\mathrm{tr}}\left( {k,j} \right) } \right\} \end{aligned}$$
    (41)

The normed error vectors in (41) behave as the support vectors and average feature vectors in (41) as the weights of SVM. So it is necessary and sufficient that the inner product of \(E_{ik}(j)\) and \(\hbox {f}_{\mathrm{ik}}(j)\) must be the least for the training sample to be close to the test sample.

$$\begin{aligned} h_{ik} \left( l \right) =\sum _{j=1}^{N} f_{ik} \left( j\right) E_{ik} \left( j \right) =\left\langle {f_{ik} ,E_{ik}} \right\rangle ;\quad i\ne k \end{aligned}$$
(42)

As \(i,k=1,2,\ldots ,M\), the number of products generated from (42) is \({\sum }_{i=2}^M \left( {M-i+1} \right) \). The minimum of \(h_{ik} \left( l \right) \) is the measure of dissimilarity corresponding to the lth user. While matching, whichever user corresponds to the infimum of \(h_{ik} \left( l \right) \) for all l gives the identity of the test user. Note that \(f_{\mathrm{tik}} \left( j \right) \) is the jth information source (feature) and fusion of two errors gives the confidence about the information. As per experiments, another variant of (42), \(h_{ik} \left( l \right) ={\sum }_{j=1}^{N} f_{ik} \left( j \right) {\sum }_{j=1}^{N} E_{ik} \left( j \right) \) cannot be overlooked and must be given a trial. An interesting result emerges if we introduce the membership functions of the terms in this relation:

$$\begin{aligned} H_{ik} \left( l \right) =\sum _{j=1}^{N} f_{ ik} \left( j\right) \mu _{\mathrm{fik}} \left( j \right) \sum _{ j=1}^{N} E_{ik} \left( j \right) \mu _{\mathrm{Eik}} \left( j \right) =H_{\mathrm{fik}} H_{\mathrm{Eik}} , \end{aligned}$$

which is the product of the information of the training features and that of the errors.

4.2 Hanman classifier and normed error classifier(NEC)

The conditional Hanman–Anirban entropy of a partition \(A_{\mathrm{i}}\), given that \(B_{\mathrm{j}}\) has occurred, is expressed as:

$$\begin{aligned} H[A_i |B_j ]=\sum _{i=1}^n {p_{i|j} \mathrm{e}^{-[ap_{i|j} ^{3}+bp_{i|j} ^{2}+cp_{i|j} +d]}} \end{aligned}$$
(43)

where \(p_{i|j} =p_{{ ij}} /q_j =\Pr [A_i |B_j ]=\Pr [A_i B_j ]/\Pr [B_j ]\)

The Bayesian conditional entropy is not applicable here as we do not have the joint probability density function of \(A_{i}\) and \(B_{j}\). We now propose the possibilistic versions of the above conditional entropy function. Assuming that \(A=\{A_i =f_{\mathrm{tr}} (i,j)\}\) and \(B=\{B_i =f_{\mathrm{ts}} (j)\}\) which refer to the training and the test information sets, respectively, the conditional possibility cposs is defined as

$$\begin{aligned} \hbox {cposs}\left( {A/B} \right) =\left\{ {f_{\mathrm{tr}} \left( {i,j} \right) -f_{\mathrm{ts}} \left( j \right) } \right\} =\left\{ {e_{{ ij}} } \right\} \end{aligned}$$
(44)

If we take \(A=\{f_{\mathrm{tr}} (i,j)\mu _{\mathrm{tr}} (i,j)\}\) and \(B=\{f_{\mathrm{ts}} (j)\mu _{\mathrm{ts}} (j)\}\), then (44) becomes

$$\begin{aligned} \hbox {cposs}\left( {A/B} \right)= & {} \left[ {f_{\mathrm{tr}} \left( {i,j} \right) \mu _{\mathrm{tr}} \left( {i,j} \right) -f_{\mathrm{ts}} \left( j \right) \mu _{\mathrm{ts}} \left( j \right) } \right] \nonumber \\= & {} \left\{ {e_{{ ij}} } \right\} \end{aligned}$$
(45)

The above definition of the possibility justifies the fact that if we have already some information A and if new information B is received, then it is easy to observe its difference from A. Taking \(a=b=0\) in (43) gives rise to the Hanman distance, given by

$$\begin{aligned} H\left( {A|B} \right)= & {} \hbox {cposs}\left( {A|B} \right) \mathrm{e}^{-\left[ {\mathrm{c.poss}(A|B+d} \right] }\nonumber \\= & {} \sum _{i=1}^n e_{{ ij}} \mathrm{e}^{-\left[ {c.e_{{ ij}} +d} \right] } \end{aligned}$$
(46)

It can be proved that this is more general than Euclidean and Mahalanobis distances. Taking \(d=-1\) and \(c= -1/\hbox {cov}\,(e_{{ ij}})\) with \(\hbox {exp}({-}x) =1-x\) in (46) results in Mahalanobis distance and with \(d=-1\) and \(c=-1\), the Euclidean distance.

We will extend the conditional possibility to the case of two training feature vectors, \(\{A_i =f_{\mathrm{tr}} (i,j)\}\), \(\{C_i =f_{\mathrm{tr}} (k,j)\}\) and one test feature vector \(\{B=f_{\mathrm{te}} (j)\}\). The conditional possibility can now be written as:

$$\begin{aligned} \{\hbox {cposs}(A\cap C/B)\}=(A-B)\cap (C-B) \end{aligned}$$
(47)

Substituting for A, C and B we get

$$\begin{aligned}&\hbox {cposs}\left\{ (f_{\mathrm{tr}} (i,j)\cap f_{\mathrm{tr}}(k,j)/f_{\mathrm{te}} (j)\right\} \nonumber \\&\quad =\left\{ f_{\mathrm{tr}} (i,j)-f_{\mathrm{te}} (j)\right\} \cap \left\{ f_{\mathrm{tr}} (k,j)-f_{\mathrm{te}} (j)\right\} \nonumber \\&\quad =\left\{ e_{{ ij}} \cap e_{kj} \right\} =\left\{ t(e_{{ ij}} ,e_{kj} )\right\} =E_{ik} (j) \end{aligned}$$
(48)

where \(E_{ik} (j)\) is the normed error vector. As our aim is to build a classifier similar to IPC, the error transform named the Hanman classifier is obtained on replacing \(e_{{\mathrm{ij}}}\) by \(E_{\mathrm{ik}}(j)\) from (48) in (4).

$$\begin{aligned} H(E_{ik} )=\sum _{j=1}^n {E_{ik} (j)\mathrm{e}^{-[c.E_{ik}(j)+d]}} \end{aligned}$$
(49)

To avoid learning, we have taken \(c=1\) and \(d=0\) for implementation on databases. Let \(\varphi _{\mathrm{i}} \left( x \right) =x\mathrm{e}^{x}\forall i=1,2,\ldots n\) such that \(\varphi _i^{{\prime }{\prime }} \left( x \right) =\left( {x+2} \right) \mathrm{e}^{x}>0\,\forall \,x\in R\). Thus \(\varphi _{\mathrm{i}}\) is convex and twice differentiable hence it acts a splitting function. Thus \(H(E_{ik})\) is a splitting function. If the exponential gain is ignored in (49), we get the normed error classifier (NEC) given by

$$\begin{aligned} h_{ik} \left( l \right) =\sum _{j=1}^n E_{{\mathrm{ik}}} \left( j \right) \quad \hbox {with}\quad E_{ik} (j)=t_Y (e_{{ ij}},e_{kj} ) \end{aligned}$$
(50)

The entropy function in (49) permits another form for (50) such as \(H(E_{ik} )={\sum }_{j=1}^n {E_{ik} (j)\mathrm{e}^{-[E_{ik}^2 (j)/2]}}\) which can be seen as the product of the error function and the Gaussian membership function with zero mean and unit variance.

5 Results of face recognition

5.1 Face databases

The information set based features are tested on three face databases using SVM, IPC, and HC. The first one, ORL (AT&T) database [55] has 40 users with 10 samples per user. Of which, 7 are used for the training and 3 for the testing. The second is the Indian face database [56] that contains 53 persons, each having 11 images. The orientations of the face (both male and female) include: looking front, looking left, looking right, looking up, looking up toward left, looking up toward right, looking down. The emotions include: neutral, smile, laughter, sad/disgust. The third is the Sheffield Face database (UMIST database) [57] containing 20 users each having 23 sample images. The images have a range of pose variations from profile to frontal. In addition to these three databases, we have used two more databases, viz., Face-95 [58] having 72 users with 1440 images and FEI [59] having 100 users with 1400 images.

In the case of HF features, Euclidean distance, Bayesian LDC, SVM, IPC, NEC and HC give (See Table 1) the maximum recognition rates of 90% (\(7\times 7\)), 95.83% (\(9\times 9\)), 96.67% (\(9\times 9\)) for the polynomial degree 1, 96.67% (\(3\times 3\)), 97.5% (\(7\times 7\)) and 98.83% (\(5\times 5\) and \(7\times 7\)) respectively for the training to test ratio of 7:3. Note that the recognition rate on HF features has 98.83% with HC on \(7\times 7\) window and 97.5% with NEC but it drops to 96.67% with SVM(PR-Tools) for the polynomial of degree 3 on \(9\times 9\) window and also with IPC on \(3\times 3\) window.

The recognition rates obtained on HT features using two versions of SVM (PR-Tools and LIB) [60,61,62] differ widely. PR-Tools give the best recognition rate of 93.33% (\(7\times 7\) for polynomial of degree3 and \(9\times 9\) for degree 2) and LIB gives 98.33% (on both \(5\times 5\) and \(7\times 7\)) polynomial of degree 1 in Table 2. The same result is also obtained with IPC on HT features on the window size of \(7\times 7\), but the slightly improved rate of 99.2% is obtained with both NEC and HC on \(5\times 5\) and \(7\times 7\) windows. However with Bayesian LDC, the performance deteriorates to 94.17% (\(5\times 5\)). Thus, HT features are found superior to HF features because of the difficulty in the selection of appropriate frequency components in the latter. Here HC is more consistent and its performance fares well over that of IPC and slightly better than NEC and SVM. The recognition rate using Gabor features with HC and NEC is 97.5% but is 95.3% with SVM as in Table 3.

The performance of the Gabor, HF, and HT features is also evaluated on IIT Kanpur Indian Face database and UMIST Face databases using three classifiers and SVM in Tables 45678 and 9. The best recognition rate of 98.48% is achieved with HC on HF features on \(9\times 9\) window size. In the case of UMIST database, the best result of 95% is obtained with HC and NEC.

The Gabor filter features are found to yield the best result of 96.22% recognition rate with the HC on IIT Kanpur Face Database whereas the same features give 95% on the UMIST database with three new classifiers and SVM.

Table 1 Recognition rates with HF features for the ratio of 7:3 (AT&T database)
Table 2 Recognition rates with HT features for the ratio of 7:3 (AT&T database)
Table 3 Recognition rates with Gabor features for the ratio of 7:3 (AT&T database)
Table 4 Recognition rates with HF features for the ratio of 8:3 (Indian face database)
Table 5 Recognition rates with HT features for the ratio of 8:3 (Indian face database)
Table 6 Recognition rates with Gabor features for the ratio of 8:3 (Indian face database)
Table 7 Recognition rates with HF features for the ratio of 18:5 (UMIST database)
Table 8 Recognition rates with HT features for the ratio of 18:5(UMIST database)
Table 9 Recognition rates with Gabor features for the ratio of 18:5 (UMIST database)
Table 10 Recognition rates with different classifiers for Faces-95 database with HF and HT features
Table 11 Recognition rates with different classifiers for FEI face database with HF and HT features
Table 12 Recognition rates with different classifiers and database with LBP features
Table 13 Recognition rates with Gabor information features for the ratio of 6:4 (AT&T database)

In addition to the above three databases, we have also tested HF and HT features on two new databases Faces-95 and FEI using SVM, IPC, NEC, HC and cosine similarity. The results are tabulated in Tables 10 and 11 which show that Faces-95 has good performance with both features whereas FEI database has poor performance because the database has a large variation in poses and illumination. However, cosine similarity performs well on all other classifiers on FEI.

5.2 A comparison with LBP, sift and other features

The linear Binary Pattern (LBP) and Scale Invariant Feature Transform (SIFT) features are implemented on all five databases. The results of LBP are given Table 12 which show good results only on IIT Kanpur Indian Face database and those of SIFT are not given as the results are extremely poor.

For derivation of Gabor information features, we use the Gabor filter bank consisting of a set of 2D Gabor filter \(f_{{ ij}}\) at different orientations and frequencies. Each Gabor filter is convolved with the original gray level \(I_{{ ij}}\) resulting in the convolved image or Gabor image.

All the outputs of the Gabor filter bank called Gabor images (12 Gabor filters with 3 frequencies and 4 orientations) are aggregated and Gabor information features are extracted from the aggregated Gabor image for different window sizes by finding the information using the membership value for each pixel in the window.

In similar lines, wavelet transform is applied on the image to yield the approximation image which is divided into windows and membership function for every pixel a window is computed to obtain the wavelet information features [49]. The performance of Gabor-Information features on AT&T database is given in Table 13 and that of wavelet information features is given in Table 14. These tables indicate that HC gives the best results of 98.1 and 98.7% recognition rates on these features respectively. However, the best accuracy of 99.2% is obtained on this database with HT features in this work.

6 Conclusions

This paper presents a brief theory on Information sets and Information processing. The elements of an information set called information values are shown to be the products of information source values (gray levels) and their membership function or agent values. The usefulness of information sets is brought about by devising Hanman filter that modifies the information values and Hanman transforms that evaluate the information sources. Six features are developed using the information set, Hanman filter, and Hanman transforms. The generalized fuzzy rules describing the generalized fuzzy model (GFM) are transformed into the information rules that facilitate the unsupervised learning using three classifiers: inner product classifier (IPC), normed error classifier (NEC) and Hanman classifier (HC).

Table 14 Recognition rates with wavelet information features for the ratio of 6:4 (AT&T database)

The information values modified by a cosine function are the outcome of the HF whose features have been tested on three databases ORL, Indian and UMIST Face databases using Euclidean distance, Bayesian, SVM (PR-Tools), IPC, NEC and HC. The maximum recognition rates of 98.33% (ORL), 98.48% (IIT Kanpur) and 95% (UMIST) are obtained using HC. The next best is NEC with the corresponding recognition rates (97.7, 96.2, 95%) followed by SVM with (96.7, 91.82, and 93%). The Gabor features give the recognition rate 96.22% using HC on IIT Kanpur Face database and 95% on UMIST using HC. This shows that HF features fare well over Gabor features. HF features have also been implemented on Faces-95 and FEI but the best performance of 95.48% is obtained with IPC on Faces-95.

It may be noted that the Gabor-information features and wavelet information features yield 98.1 and 98.7% recognition rates respectively with HC.

The HT features provide superior recognition rates over HF features on three databases. The recognition rates due to HC are 99.2% (AT&T), 95.5% (Indian face database), 95% (UMIST) and those due to NEC are 99.2% (AT&T), 96.62% (Indian face database), 95% (UMIST). The SVM has the corresponding recognition rates as (98.33, 94.33, 93%). The best result with IPC on Faces-95 s 95.48%.

It is observed that the performance of the new classifiers depends on the choice of t-norms for a particular modality. Out of several t-norms, Yager t-norms are found to be suitable on face databases.

The contributions of the paper include the proposition of information sets and information processing as well as the development of two feature extraction methods, viz., HF, HT, and three classifiers, IPC, NEC, and HC.