Keywords

1 Introduction

Texture is the intrinsic property of images that plays a vital role in image classification, image segmentation and image synthesis. Texture, as per image processing literature is defined as the spatial variation of pixel intensities across the image. A major problem, however, is that textures in the real world are often not uniform due to variations in scale, orientation, illumination and other visual appearance. Texture classification is concerned with the repetition and flow of patterns within the texture, unlike typical image classification which focuses only on recognizing prominent objects in the foreground of the image. Figure 1 shows the textures of different commonplace objects used in our daily life.

Fig. 1.
figure 1

Examples of different textures taken from KTH-TIPS2b dataset. From top left: Aluminum Foil, Wool, Lettuce Leaf, Brown Bread, Wood, Cracker

Research on texture classification has received significant attention in recent years because of its applications in various real-life technologies like industrial and biomedical surface inspection, identification of diseases, ground classification and segmentation using satellite or aerial imagery, segmentation of textured regions in document analysis and content-based access to image databases. On the other hand, there exist a few limitations to be considered because of the problems posed by the non-uniformity in the texture surfaces. Differences in orientations, scales and illumination conditions account for significant variability in image textures and addressing it precisely is the challenging task of any texture classification algorithm. Researchers incorporated invariance with respect to properties such as spatial scale, orientation and grayscale. Capsule networks are one of the state-of-the-art methods that explicitly captures the spatial and relative relationships between the extracted features by using vectorial representation. This property of the capsule networks helps in learning the regular patterns embedded in the texture images. In this work, we attempted to study texture classification on various standard datasets by implementing a two layer capsule network.

2 Previous Work

The seeds of the study of texture analysis were sown in 1962 when Julesz [1] proposed the texture perception model which conducted the visual pattern discrimination experiments based on brightness and hue to explain the human visual perception of texture. One of the earliest Texture Descriptor methods - Gray Level Cooccurrence Matrix (GLCM) [2] was presented in 1973 by Haralick. It computed the probability of the joint occurrence of the two-pixel intensities at a specified angle and a specified distance. The statistical measures computed from the co-occurrence matrix serve as the features of the image. In the late 20th Century, researchers resorted to featuring extraction in texture analysis using filtering approaches such as Gabor filters [3], Gabor Wavelets [4], Differences of Gaussians [5]. Statistical modeling approaches considered that the texture images are derived from the probability distributions on Markov Random Fields (MRF) - MRF texture model [6], Gaussian MRF [7] or Fractal Models [8]. In the first decade of the 21st century, a computationally efficient, gray-scale and rotation invariant texture classification method was presented using the Local Binary Patterns (LBP) [9]. LBPs of an image serve as the texture features. The multiresolution analysis in this work was achieved by combining multiple operators for detecting the LBPs patterns.

Deep learning methods hegemonized the image recognition research when a major breakthrough occurred when Alex Krizhevsky [10] implemented Deep Convolutional Neural Networks on ImageNet [11] for image classification in 2012. The five layer convolution network with 60 million parameters achieved tremendous results, thus overshadowing the previously existing methods for the design of man crafted texture descriptors. Later on, translational invariance in images was captured by wavelet scattering networks [12]. Scattering representation of stationary processes was used to compute higher order moments and Fourier power spectrum was used to discriminate textures. In 2015, encouraging results for the classification of materials and surface attributes in clutter were obtained when researchers from Oxford put forward a novel texture descriptor using fisher vector pooling of a Convolutional Neural Network (CNN) filter bank [13]. Contemporarily, a deep learning architecture - Texture CNN, specific to the task of texture recognition was demonstrated. It aimed at designing learnable filter banks that are embedded in the architecture. The idea of pooling an energy measure from the last convolutional layer was used to accomplish the task [14].

3 Capsule Networks

3.1 Disadvantages of Convolutional Neural Networks

Geoffrey Hinton’s idea in 2012 [10] of using a deep CNN for image recognition laid a strong platform for the deep learning resurgence that took place. CNNs have continued to show their dominance in this area over the past few years.

The extraction of features using a set of convolutional and pooling layers is the principle underlying the working of a CNN. The convolution operations, performed using the weight filters are used to detect the key features in the image. The values of each weight filter are chosen in a way so as to get activated by a certain set of features. Unfortunately, the disadvantage of CNNs lies in the fact that the convolution operations do not consider spatial arrangement and the relative relationship between the features rather, merely look for their absolute presence. The lower level neurons of CNNs fail to send the details only to the relevant higher level neurons.

3.2 Advantages of Capsule Networks

A generic CNN involves the convolutions of scalar quantities namely the features extracted from the preceding layer and the kernel matrix of the present layer. Hinton in his paper [15] suggested that robust image recognition can be achieved by richer feature representation using vectors. More relative and relational information between features can be embedded using vectors.

A generic CNN should be supplied with different variants of the same image in order to classify the image into the correct group. The variant can differ from the original image in terms of size, angle, translation, light intensity, etc. The internal scalar representation of the original image and its variant is differently recorded by a CNN. Hence, the CNN network has to be trained using different variants of the image for it to capture the diversity.

In contrary, a capsule network learns the internal representation of an image in a way to encapsulate the orientation and likelihood of different features. The vectors are capable of learning the general representation of the class rather than memorizing the feature notations of every variant of the image. If all the features are transformed by the same amount and direction during the training phase, these variations are preserved in the vector notation. The constant length vector changes its value as a result of different transformations on the image. This property, which is termed as equivariance is a way of detecting a group of things that can be transformed into one another. It helps capsule networks to classify new and unseen variants of an image correctly. Modified image of the same underlying object can be classified by capsule networks precisely and with high probability. Capsule networks are proved to give accurate results with comparatively less amount of data [10].

3.3 What Is a Capsule?

Every capsule network is initialized with a normal convolutional layer where each pixel of the input image is considered as the input to the first layer. It is convolved with appropriate kernels to extract the corresponding features. Spatial dimension and stride are adjusted accordingly. The resultant feature maps are further activated by rectified linear unit (ReLU) to induce nonlinearity. The number of convolutional layers that precede the first capsule layer is dependent on the problem in consideration.

After the convolution of the feature maps with the required number of kernels in the convolutional layers, the output from each neuron is a scalar, the value of which might represent an internal feature. In a capsule layer, these scalar outputs from the preceding layer are stacked into small decks. Each deck is further split into smaller sets of values termed as capsules, whereas each deck is referred to as a capsule layer. Thus, a capsule is a set of values represented in the form of a vector to store more information about the instantiation parameters like size, angle, hue, thickness, etc. of an object or an object part. This vector is named as the activity vector. The length (norm) of the activity vector gives the probability of the existence of that entity and its orientation gives the details about the parameters.

3.4 Working of a Capsule

The working of an individual capsule layer is explained below with the help of Fig. 2.

Fig. 2.
figure 2

Internal working of a Capsule showing Affine Transform and Squash Function

Input Vector. A capsule receives a vector as its input and produces a vector as its output. The different dimensions of these vectors correspond to different instantiation parameters. The different capsules are represented by \(u_1\), \(u_2\), \(u_3\) in Fig. 2.

Affine Transformation. As shown in Fig. 2, transformation matrices (\(w_{1j}\), \(w_{2j}\), \(w_{3j}\)) of appropriate dimensions are multiplied with the vectors of the previous layer. These matrices can capture the relationships between features which might have gone unnoticed in the previous layer. The purpose of these matrices is to convert the input vector into the position of the predicted output vector representing the next higher level of features. The output vector \(v_1\) represents the predicted position of the next higher feature with respect to \(u_1\), \(v_2\) represents the predicted position of the higher feature with respect to \(u_2\) and so on. The final prediction of the higher level feature is a combination of the outputs \(v_1\), \(v_2\), \(v_3\). If all the outputs make the same prediction, the feature can serve to increase the certainty in the prediction of the next level features. The weighted sum of these output vectors is computed and a non-linear activation function is applied.

Dynamic Routing. Though capsule networks were invented long back, the lack of proper training algorithm hampered their application. The sudden rise in their usage can be attributed to the invention of the dynamic routing method [15], an algorithm to train capsule networks. Thus, it is not the capsule concept that created a revolution rather the training algorithm.

This algorithm is used to compute the weights of the affine transform. It can be viewed as a method which decides on all the output capsules each input capsule has to be mapped to. The lower level capsules compute their different projections onto the higher level capsules and thus optimize the weights in a manner to maximize the fit with the related capsule and minimize the fit with the other capsules in their decreasing order of relevance.

Squashing. Squashing is a new non-linear activation function designed to map a vector to a vector. Since the magnitude of the vector corresponds to its probability, this function ensures that the length of the mapped vector lies between 0 and 1. In order to preserve the direction of the vector, this function has a unit vector in the same direction multiplied with the required magnitude. The mathematical formula is shown below. For \(j^{th}\) class, \(v_j\) represents the output vector obtained from the squashing function for an input vector \(s_j\).

$$\begin{aligned} v_j&= \frac{||s_j ||^2}{1 + ||s_j ||^2} \frac{s_j}{||s_j ||}\\ v_j&\approx 0 \text { when}\ s_j\ \text {is small}\\ v_j&\approx \frac{s_j}{||s_j ||} \text { when}\ s_j\ \text {is large} \end{aligned}$$

4 Proposed Architecture

Transfer learning is a very effective technique in deep learning in which features learned from the first task are re-used in a related second task. The first model is trained suitably to extract the features that are to be transferred. These features are used as the starting point to the second target task. Sometimes, all or only a fraction of all parts of the first model can be used in training the second model. The most important criterion in this concept is that the features are to be general that are suitable for both the tasks.

Fig. 3.
figure 3

Overview of the proposed network architecture

Taking the assumption that the features extracted through a series of convolutional layers trained on a particular dataset would be useful for other similar problem statements, we used the Xception network architecture [22] as our base model for the transfer learning task and built upon it. The Xception network uses the concept of spatial convolutions on top of the Inception network for a lighter but more efficient model. The Xception network has been extensively trained on the trimmed ImageNet visual database [11] consisting of more than 1.2 million hand-annotated images. The weights of the Xception network were used as the initial weights for our model. The images from the datasets in use were fed as input to the pre-trained Xception network. The output feature vectors extracted from this network were sent to the subsequent layers namely primary capsule layer and dense capsule layer. The primary capsule layer consisted of a convolutional layer followed by reshaping layer which reshapes the output tensor into a series of capsules of 16 dimensions each. The dynamic routing algorithm is implemented in this layer. The output capsules obtained are then passed through the squashing activation function and lastly to the dense capsule layer to obtain an output feature vector of 16 dimensions for each texture class. The output class is to be chosen as the texture class with the minimum \(\hbox {L}_{2}\) norm of the error. Figure 3 shows a detailed image of our comprehensive architecture. For all the prescribed datasets, the normalized classification accuracy is used as the primary evaluation metric.

4.1 Loss Function

The cost function or error function that is to be optimized in capsules is called the margin loss \(L_c\) for each texture category c and class vector \(v_c\). It is an extension of binary cross entropy used in multi-class classification. The cost function will be a function of the weights whose equation is given below.

$$\begin{aligned} L_c = T_c \cdot max(0, m^+ - ||v_c ||)^2 + \lambda \cdot (1 - T_c) \cdot max(0, ||v_c ||- m^-)^2 \end{aligned}$$

where \(T_c = 1\) corresponds to an image of texture class c, or else \(T_c = 0\). We have set \(m^+ = 0.9\) and \(m^- = 0.1\) throughout our experiments. The ‘\(\lambda \)’ hyper-parameter is set to 0.5 and is useful in stopping the initial learning values from reducing the overall activity vectors of the classes. The total loss is calculated as the sum of the losses of each class.

4.2 Training

We ran our model using the Keras python framework with the Tensorflow backend on an Nvidia GTX 1080 Ti GPU. All the input images were resized to \(256 \times 256\) prior to being given as an input to the model. All the We trained our architecture separately on the different datasets for 50 epochs to obtain the test set accuracy. For the KTH TIPS 2b database, according to the pre-established norm, we took each sample set as the training set and the remaining three samples sets as the testing set. The Kylberg, UIUC, DTD and Curet datasets were randomly split into the ratio of 0.5:0.2:0.3 to get the training set, validation set, and the test set for each database respectively. The FMD dataset was split into the ratio 0.6:0.1:0.3 for the training, validation, and test set. We used the Adam optimizer for weight updation after every batch input. The batch size used was 16. We used a form of data augmentation while taking a batch by rotating each image by a random angle between \(0^{\circ }\) and \(90^{\circ }\). We used 3 routing iterations for each updation of the capsule layer’s weights. We took an initial learning rate of 0.001 and used a decay of 0.9 after every epoch of training. For the KTH-TIPS 2b database, we report the average classification accuracy, sequentially considering each sample as training set and the remaining as the testing set. For the other datasets, we randomly split the dataset five times and report the median classification accuracy. The images in all the datasets were used to train the model for 120 epochs each.

5 Datasets

We performed experiments to assess the texture classification ability using the aforementioned architecture. Our proposed model was trained on different texture databases, namely:

5.1 Kylberg

It consists of 28 texture classes [16] in total. The textures include different types of fabrics, grass, and surfaces of stones imaged in local surroundings. The other textured surfaces are obtained by placing articles like rice, lentils, sesame seeds on a flat surface. Types of ceilings and floors are also considered. Four images of each material were acquired where each image was divided into 40 square patches of \(576 \times 576\) pixels resulting in 160 unique unrotated samples per class. The patches were saved as 8-bit data in a gray level format with all of them being normalized to have the same mean gray value and standard deviation.

5.2 UIUC

The UIUC [17] database contains 40 images each of 25 distinct texture classes thus adding up to 1000 uncalibrated, unregistered images. These are gray-scale images with an image resolution of \(640 \times 480\) pixels. The classes include surfaces with textures pertaining to albedo variation (e.g., wood and marble), 3D shape (e.g., gravel and fur), and also a mixture of both (e.g., carpet and brick). The dataset has relatively few sample images per class, but high intra-class variability manifested as non-homogeneity in textures and unconstrained non-rigid deformations. This characteristic feature of the database makes it the most challenging testbed for texture classification. Moreover, viewpoint and scale variations are also strongly evident.

5.3 KTH TIPS 2b

The database [18] comprises of 4 planar images each of 11 different materials namely; linen, wood, cork, lettuce leaf, brown bread, white bread, crumpled aluminum foil, wool, corduroy, cotton, and cracker. Each class has samples with variations in scale, pose, and illumination. 3 different poses - frontal, rotated \(22.5^{\circ }\) left and rotated \(22.5^{\circ }\); 4 illumination conditions and 9 scales equally spaced logarithmically over two octaves are used. At each scale, 12 images are taken in a combination of three poses and four illumination conditions thus giving a total of \(12 \times 9 = 108\) images for each of the 44 samples.

5.4 DTD

Describable Texture Database (DTD) [19] is organized according to a list of 47 different categories inspired by human perception, with 120 images per category thus making a total of \(120 \times 47 = 5640\) images. Size of the images ranges between \(300 \times 300\) and \(640 \times 640\) such that at least 90% of it represents their category.

5.5 FMD

The name Flickr Material Database (FMD) [20, 21] suggests that it was collected from the site Flickr.com (under Creative Common License). It is constructed with 100 images per category of 10 distinct categories. Images are pictured to capture real-world appearances of common materials like fabric, foliage, glass, leather, metal, paper, plastic, stone, water and wood.

Table 1. Performance observed for various datasets

6 Results and Comparison

As seen in the Table 1 below, we obtained results which are on par with state of the art models. We observed that the accuracy for the KTH-TIPS 2b database was slightly on the lower side when compared to the state of the art [13] due to the fact that the procured images were highly preprocessed by removing background objects. Hence, the advantage of obtained efficient spatial relationships between different objects cannot be observed as clearly as in the case of other datasets. This dataset, however, is not consistent with the real-world natural images as real world images of textures will most likely be present with various other objects. The model for the other datasets produces accuracies that are on par with state of the art results ([13] for DTD, FMD and CURET, [23] for UIUC, [24] for Kylberg datasets respectively as seen in Table 2). Overall, our model has achieved commensurate results for different datasets, compared to other models which showed promising results for a few datasets but not so much for the remaining. The reason for lower accuracies on the DTD and FMD datasets is due to the fact that there is a high amount of diversity in images belonging to the same texture class.

Table 2. Performance observed for various datasets against the existing State of the Art models

7 Conclusion

To solve the problem of texture classification, we conducted experiments on multiple datasets to investigate the performance of our proposed model, which comprises of Xception network followed by two capsule layers. Out of all the standard networks used for transfer learning, with proper fine-tuning of weights, we found that Xception network has the best performance to computation ratio. We achieved admirable overall results on all the datasets using our proposed architecture. We also observed that the capsule network can effectively preserve the spatial interdependence between features, thereby subsiding the challenge of data scarcity, highly prevalent in the texture classification problem.