Unsupervised Rotation Factorization in Restricted Boltzmann Machines

Finding suitable image representations for the task at hand is critical in computer vision. Different approaches extending the original Restricted Boltzmann Machine (RBM) model have recently been proposed to offer rotation-invariant feature learning. In this paper, we present an extended novel RBM that learns rotation invariant features by explicitly factorizing for rotation nuisance in 2D image inputs within an unsupervised framework. While the goal is to learn invariant features, our model infers an orientation per input image during training, using information related to the reconstruction error. The training process is regularised by a Kullback-Leibler divergence, offering stability and consistency. We used the $\gamma $ -score, a measure that calculates the amount of invariance, to mathematically and experimentally demonstrate that our approach indeed learns rotation invariant features. We show that our method outperforms the current state-of-the-art RBM approaches for rotation invariant feature learning on three different benchmark datasets, by measuring the performance with the test accuracy of an SVM classifier. Our implementation is available at https://bitbucket.org/tuttoweb/rotinvrbm.


Unsupervised Rotation Factorization in
Restricted Boltzmann Machines Mario Valerio Giuffrida and Sotirios A. Tsaftaris , Senior Member, IEEE Abstract-Finding suitable image representations for the task at hand is critical in computer vision.Different approaches extending the original Restricted Boltzmann Machine (RBM) model have recently been proposed to offer rotation-invariant feature learning.In this paper, we present an extended novel RBM that learns rotation invariant features by explicitly factorizing for rotation nuisance in 2D image inputs within an unsupervised framework.While the goal is to learn invariant features, our model infers an orientation per input image during training, using information related to the reconstruction error.The training process is regularised by a Kullback-Leibler divergence, offering stability and consistency.We used the γ -score, a measure that calculates the amount of invariance, to mathematically and experimentally demonstrate that our approach indeed learns rotation invariant features.We show that our method outperforms the current state-of-the-art RBM approaches for rotation invariant feature learning on three different benchmark datasets, by measuring the performance with the test accuracy of an SVM classifier.Our implementation is available at https://bitbucket.org/tuttoweb/rotinvrbm. Index Terms-Machine learning, neural networks, rotation-invariant features, restricted Boltzmann machines.

I. INTRODUCTION
T HE unsupervised learning of image representations is an important computer vision task, allowing to learn suitable features from unlabeled data.Several algorithms have been proposed [1], such as kernel PCA [2], autoencoders [3], and Restricted Boltzmann Machines [4].When features are learned in an unsupervised manner, it is typically unclear what can be considered a "good" representation [5].However, it is widely acknowledged that invariant features possess good properties, as they disentangle irrelevant variation from the dataset [6].
There are two potential approaches to learning invariant representations.On one side, we hope that an algorithm will learn to factor out invariance implicitly [7], assuming the presence of sufficient training data.On the other side, invariance can be learned explicitly, where a feature extractor has some prior knowledge of invariance in the form (x) = (T x) [8], [9].In this case, the transformation T is modeled during training and the algorithm learns to factorize it out.Representation of the proposed rotation-invariant RBM.In this example, we have S = 4 rotations, corresponding to the equidistant angles = φ 0 = 0 • , φ 1 = 90 • , φ 2 = 180 • , φ 3 = 270 • .Each angle is associated with a matrix W s .When an image is provided to the network, the weight matrix minimizing the reconstruction error is chosen, as highlighted in bold red.We depict the unfolded steps of the CD-1 [13].
It comes as no surprise that explicit encoding of prior knowledge should help, instead of trying to learn it (and factor it out) implicitly.Findings in human understanding do actually point to this direction, noting the distinct benefits of explicit learning (such explicability of actions and interpretation) [10], [11].Thus, in the context of learning from few data, one must openly consider that shallow neural network architectures that are specifically designed to be invariant to some form of nuisance can be beneficial.
We endow a shallow unsupervised generative model to be invariant to rotations, which is an ubiquitous transformation in several computer vision problems [12].Specifically, we propose an extension of Restricted Boltzmann Machines (RBMs) [4].In their original formulation, RBMs are a shallow neural network characterized by a bipartite graph, whose sides are called visible (denoted as x) and hidden (denoted as h) layers respectively.In its original formulation, RBMs do not accommodate for geometric transformations occurring in an image.The most straightforward way to learn variability in a dataset is to provide the network with a sufficient amount of data.However, training sets may lack variability, resulting in models with poor generalization capabilities.To cope with this, other approaches to regularize the learning process are considered, such as dataset augmentation.The drawback of this approach is that it does not explicitly enforce the network to learn transformation-invariant features.Therefore, our aim is to build a model that is capable of learning features invariant to rotations, which is one of the most ubiquitous geometric transformations in images, extending the original RBM model.
In this paper, we present RBMs that explicitly factorize rotations in 2D images in an unsupervised manner.Our architecture, represented in fig 1, uses a weight matrix per each dominant orientation of the input images.During training, an input image is passed through all weight matrices.To determine the orientation of the input image, the reconstruction error (per orientation) is computed.The weight matrix that best reconstructs the input is chosen, and then the gradient for that matrix is computed to update the parameters.Furthermore, the contribution of each input is also shared with the other weight matrices, by applying proper rotations to the gradient [14]. 1 This step of sharing the gradients is essential to achieve rotation-invariance.The training process is regularized with a KL-Divergence term that enforces a prior distribution of the rotations.We measure the performance of our method by training an SVM classifier [14]- [16].
The contributions of this paper are multi-fold: i. rotation invariant features: we mathematically prove, using the γ -score [17], that our model learns rotation invariant features.This is further shown with quantitative experimental results on several benchmark datasets.The feature space that our method learns is compact (limited number of hidden units are required); ii.robust dominant orientation inference: we show that our model can infer robust dominant orientations, rather than relying on exogenous inputs.During training, the use of a regularization term maintains the balance among the predictions of the dominant orientations.Our inference method obtains similar test-time performance with respect to supervised methods; iii.no augmentation: we show that our approach outperforms baseline and state-of-the-art methods when trained with limited data, without requiring augmentation.The rest of this paper is structured as follows.Section II discusses related works.In Section III, we offer the theoretical background of the original RBM.Then, Section IV discusses our proposed approach, showing mathematical proofs of its validity.Section V describes the algorithm to infer the dominant rotation of an input image, offering a stability analysis that shows the robustness of our approach.In Section VI we present experimental results on three different datasets, including ablation experiments demonstrating how our approach benefits from the gradient sharing method [14] and the KL-Divergence based regularization term.We also experimentally show how consistent our unsupervised dominant orientation inference method is, including a comparison with supervised methods.Finally, Section VII concludes the manuscript.
II. RELATED WORKS Several approaches have been proposed in the last years to improve the original RBM model to accommodate for geometric image transformations.Goodfellow [18] showed that Deep Belief Networks (DBNs), obtained by stacking several RBMs, produce high-level features that carry a certain amount of invariance, as the number of layers increases.In Reference [19], they proposed a different way to train DBNs, such that filters were transformed at the end of each training epoch to account for geometric transformations.A more sophisticated DBN model that learns rotation equivariant features is STEER-DBN [20], where the authors aim to learn steerable filters [21]- [24] to achieve translation and rotation equivariance.Filters are defined as steerable when they can be expressed as a linear combination of (directional) basis filters [25].Sohn and Lee [16] proposed an RBM model that can learn transformation-invariant features (TI-RBM), using a set of transformations S. The transformations are incorporated during training and the actual image representation is then obtained using max-pooling.In Reference [26], they proposed a block RBM, where invariance is achieved by pre-aligning the input patches according to their dominant orientation and scale, computed using SIFT descriptor [27].Schmidt and Roth [28] proposed a convolutional RBM that learns rotation equivariant features [9].
Extensive efforts have also been made to explicitly learn transformation invariant features using deep networks.The Convolutional RBM (C-RBM) [29] learns shift invariant features, extending the original RBM to accommodate the convolution.The Spatial Transformer Network aligns input images in a common reference space, allowing end-to-end training [30].Laptev et al. [31] introduce TI-POOLING, a siamese network that extracts features from transformed versions of the inputs using convolutional layers.Then, the output of each of the sub-networks is merged using a max-pool operation.Marcos et al. [9] proposed the RotEqNet, a convolutional neural network that accounts for rotation equivariant features.A similar approach producing rotation invariant features was proposed in [32].Recently, harmonic networks have been proposed to learn deep translation and rotation equivariant features [33].They replace the convolutional filters with circular harmonics, then max-pooling is applied to obtain the orientation at each location of the receptive field.
Albeit the important contributions that deep learning brings to the computer vision community, often optimizing such networks involves learning of millions of parameters [34], which are not suitable for all datasets and tasks.As an example, Romera-Paredes and Torr [35], Ren and Zemel [36] adapt the network architecture to perform image segmentation for each of the datasets they tested.Furthermore, deep networks have also a vast number of hyper-parameters that need to be tuned, such as the number and size of the filters, or the depth (in terms of layers) of the network.The main drawbacks of deep neural networks are essentially two: i) prone to overfitting, especially for reduced training set; and ii) computationally expensive [37].

A. Adopted Notation and Conventions
In this paper, we will adopt the following notation.Matrices are written in bold and capital letters (e.g.W), while vectors are written in bold and lower letters (e.g.x).Vectors are always of size n ×1 (column-wise vectors).Scalars are written with italic and lower letters, using both Latin and Greek alphabet (e.g., a or λ), and constant are written with capital italic Latin letters (e.g., H).Vector elements (e.g.x k ) are considered as scalars.The notation W kj refers to the item located at the k-th row and j -th column in the matrix W.Then, we will use W k• and W • j to refer to the k-th column and j -th row respectively.Capital Greek letters or calligraphic capital Latin letters are typically used for sets (e.g., X or ).The corresponding small letter generally denotes an item in such a set (e.g., x ∈ X or φ ∈ ).Lastly, the notation X s denotes a (finite) partition of a set X .

B. Restricted Boltzmann Machine
An RBM is a probabilistic shallow neural network that maximizes the following joint probability density function: where E is an energy function, taking as input x ∈ {0, 1} V and the hidden layer h ∈ {0, 1} H , and Z is the partition function that ensures that eq. ( 1) is a probability density function (the integral over all possible values of x and h is 1).The energy term E is defined as follows: where W ∈ R V ×H is a weight matrix, c ∈ R V and b ∈ R H are the bias vectors for the input and hidden layer respectively.The network parameters to be optimized are = {W, b, c}.To achieve this, the negative log-likelihood −log( p(x|)) is minimized, using the Contrastive Divergence algorithm [38].The update rules for the parameters are [39]: ( The function h(•) in eqs.( 3) and ( 5) computes the conditional probability of h given x (this probability is explained in the next paragraph).The first term in the update rules in eqs.(3) to ( 5) is usually referred to as the positive phase, whereas the second term is called the negative phase [39].The positive phase is computationally easy to calculate, whereas the negative phase is generally intractable, due to the partition function Z that appears in the computation of the partial derivatives.To overcome this problem, the negative phase is approximated with Gibbs sampling, which produces the reconstructed input x.
Gibbs sampling performs alternate inference and sampling of h and x, using the following conditional probabilities where the function σ (•) is the logistic activation function, is the j-th element in the hidden layer, and x k is the k-th element in x.The conditional probability in ( 6) is the one used to compute the function h(•) in eqs.( 3) and ( 5), thus h(x) ≡ p(h|x).This formulation of RBM is typically referred in literature as Bernoulli-Bernoulli RBM (BB-RBM), as eqs.( 6) and ( 7) define the success probability of a Bernoulli distribution.
In general, the update rule for a generic parameter θ ∈ is applied as follows: where η is the learning rate, and the superscript t refers to the current iteration number.

IV. PROPOSED ROTATION INVARIANT FACTORIZATION
In our formulation of rotation-invariant RBM, we assume a set S , for all φ i ∈ .Details of how the rotation matrices R i are generated are in the Supplemental Material Sec I.
Given that angles have a periodicity of 2π radians, it can be proven that φ t ∈ , |t| ≥ S. As an example, assuming S = 4, then φ 5 = φ 1 (the argument holds also for negative indices).In general, φ i = φ m(i, j ) , where m(i, j ) is the modulo function, which allows for cyclical indexing.The modulo function is defined as m(i, j ) = (i + j ) mod S, where i, j ∈ {0, ±1, ±2, . . ., ±(S − 1)}.The definition of the modulo function is useful to support the proof of the theorem showing our approach learns rotation invariant features.

A. Revised Energy Function
Differently from the original RBM formulation [4], we associate a specific weight matrix to each of the rotations in S. Therefore, W can be seen as a third-order tensor of dimension V × H × S. A revised version of eq. ( 2), taking into account rotations, takes the form of:; In this formulation, a new binary vector r ∈ {0, 1} S is introduced, with one non-zero entry (i.e., only one orientation is active at one time).This constraint can be formalized as: In addition, we will say that, if r s = 1, then the dominant orientation is φ s .(How r is inferred is discussed in Section V).Consequently, the conditional probabilities in eqs.( 6) and ( 7) are revised accordingly:

B. Sharing the Gradients
Optimizing the third-order tensor W via eqs.( 11) and ( 12) has the drawback that inputs with a specific dominant orientation will contribute to update only the corresponding slice in W.This is equivalent to splitting the training set X into several non-overlapping partitions X s and train a separated RBM for each of them.This will negatively affect the learned features: each "individual" RBM is trained on a rather limited dataset [14].To overcome this problem, the contribution of gradient ∇W s computed on X s can be shared across the other slices in W. Therefore, we will apply the gradient sharing step during training as proposed in [14], which is an essential step of our training procedure to guarantee rotation-invariance.
For sake of clarity, let us consider an example with only two rotations, R 0 and R 1 , which account for the 0 • and 180 • rotations respectively.Since ∇W 0 and ∇W 1 were computed on different portions of the data, namely X 0 and X 1 , we want to transfer the contribution of ∇W 1 to ∇W 0 (and vice versa).To do so, we add a rotated version of ∇W 1 by −180 • (we denote such a rotation as R −1 ) to ∇W 0 .In this example, we can define the new gradients used to update the parameters of the network as • W 1 can be defined similarly).Using this example, the update rules for W can be generalized as follows: We use the gradients computed in eq. ( 13) to update the parameters of the network θ ∈ using eq.( 8).

C. Rotational Equivalence
Following eq. ( 13), using the periodicity of rotations as discussed in Section IV, the following holds: Given that m(−1, 0) = 1 and m(−1, −1) = 0, the above relation becomes: due to the rotational periodicity (e.g., rotating by ±180 • produces the same result).This example with S = 2 shows that all gradients of the form ∇ • W s are rotated versions of each other.We can generalize this property for eq.( 13) as follows: R r (∇ In order to facilitate the proof of the theorem stating that our approach learns rotation-invariant features, we need the following lemma that makes use of this rotational equivalence. Lemma 1: Optimizing the tensor W ∈ R V ×H ×S as described above for t > 0 iterations, then W (t )  s = R κ (W (t ) s ), with: (15) Proof is provided in Appendix A. This lemma states that all the slices in the tensor W are rotated versions of each other.

D. Measuring the Invariance
We adopt the γ -score proposed in [17] to measure invariance.Considering a set of transformations S and a dataset X , the mean activation of the j -th hidden unit h j over all the transformations T ∈ S is computed as: where h j (x) ≡ p(h j = 1|x, r).It is important to note that r is a function of x, hence when the transformation T (x) is applied, the vector r has to be recomputed accordingly.Then, the γ -score is defined as: We employed the γ -score because it is bounded to the interval [0, 1], where values close to 1 indicate features invariant to the set of transformations S. The γ -score is closely related to the auto-correlation [17] and does not require extra parameters to be computed, as e.g. the firing threshold in [18].Although false positives (e.g., the ratio in eq. ( 17) is '1' but the full invariance is not achieved) might occur (as reported in [17]), we will show in the next section that such cases do not arise in our rotation-invariant RBM formulation.

E. Proving Rotation Invariance of the Proposed Method
In this section, we will prove that our model can learn rotation invariant features on the basis of the γ -score (c.f.Section IV-D).Our theorem is based on the hypothesis that the model is trained using the revised energy model and further adaptions showed in Section IV.The proof shows that the γ -score reaches the highest value (numerator and denominator in eq. ( 17) are equal).
Theorem 1: Under the hypotheses of Lemma 1 and given a support set S of S rotations, γ = 1 for our revised rotation-invariant RBM model.Proof is provided in Appendix B. The proof of this theorem shows that our method achieves full invariance w.r.t. the γ -score.We can make the following remarks about the theorem: Remark 1: Theorem 1 does not make any assumptions how the r vector is computed, as long as eq.(10) holds.
Remark 2: Theorem 1 is a theoretical result and does not account for artifacts due to the discrete nature of inputs and rotations.Empirical computations of the γ -score might result in slightly lower values.
Remark 3: Lemma 1 is compatible with any typical additional terms that can be added in eq. ( 8), such as momentum and L 2 regularizer [40].
Remark 4: Optimizing eq. ( 9) as in Section IV, the negative log-likelihood − ln p(x|) is minimized as well.

V. INFERENCE OF THE DOMINANT ORIENTATION
In this section, we describe how to infer the optimal r vector for an input x in the dataset.We propose an approach that exploits the intrinsic information learned by the network during training, using the reconstruction error.In our formulation, we can define the reconstruction function ϕ(x, r) as ϕ(x, r) = v(h(x, r), r), where h(x, r) = p(h|x, r) and v(h, r) = p(x|h, r).We define the dominant orientation for an input x as the one that minimizes the following function: 2  2 , such that r t = 0 and r s = 1, s = t. (18) Thus, the corresponding r for the input x is r ŝ = 1 and r t = 0, t = ŝ.This satisfies the one-hot encoding constraint in eq. ( 10).The optimization of eq. ( 18) can be easily computed for all the possible values of r, as it comes automatically during the forward pass of the training process.

A. Implementation Details
Training: Our training algorithm is an extended version of the Contrastive Divergence [38].In our implementation, we represent the third-order tensor W explictly, although all the slices are rotated versions of each other.As discussed in the previous section, the core part of our architecture is the inference of the dominant orientation.For each minibatch B, we compute the reconstructed input using all weight matrices in W. For each image in B, the dominant orientation is inferred, by selecting the weight matrix that better reconstruct the input (c.f.fig 1).Then, the gradients to update the parameters are computed and the contribution of all the ∇ • W s is shared with the other weight matrices.Other gradients coming from e.g.sparsity regularizer [40], the KL-Divergence in eq. ( 19), or momentum are also used to update the parameters of the network.Details are shown in Algorithm 1.Our Theano [41] implementation takes ∼ 0.8s for training on CPU (Intel Xeon E5-1660), and ∼ 0.4s on GPU (TITAN Xp) per batch.
Testing: During inference, an input image is provided to the network, and is passed through to obtain activations using all the weight matrices (one per orientation).The hidden layer activations produced by the weight matrix minimizing the reconstruction error are selected.The chosen activations represent the features of the input image (cf.fig 1).

B. KL-Divergence Regularization Term to Improve Dominant Orientation Inference
The rotation estimation approach may potentially assign most inputs to one dominant orientation.To avoid this, we opted to regularize the training process, by forcing a prior on the distribution of orientations across the dataset.We achieve this by minimizing the following KL-divergence: Algorithm 1 Training Procedure of Our Proposed Rotation-Invariant Restricted Boltzmann Machine where p is a prior distribution, r is the average assignment of the dominant orientation of the images in the training set, as discussed in Section V, and λ r is a positive constant weighing the strength of the regularizer.Following [42], we compute the average prediction vector r over a mini-batch, rather than the whole training set.

C. Consistency Analysis
In this section, we want to assess the consistency of the predictions performed by our approach to infer the dominant orientations, computing what we define the change probability.
During training, we tracked the predictions made by our algorithm to infer the dominant orientation of each image in the training set.Then, we analyzed how many times each image has been assigned to a dominant orientation over time.We computed the probability at each epoch that an assignment change occurs and we plotted the result in fig 2 .It can be observed that our inference method stabilizes in less than 10 epochs, becoming very consistent in ≈ 40 epochs (probability of a reassignment is very close to 0).Furthermore, fig 2 contains two inset transition matrices, where we show how assignments are redistributed between two consecutive epochs.Specifically, the left-hand side transition matrix show reassignments occurring between the first two epochs, whereas the right-hand side inset shows the changes occurring in the last two epochs.These two plots show that, although the initial predictions of the dominant orientation are unstable, the network is able to automatically assign each image to the same class of orientation.
In Supplemental Material Sec.II, we show a consistency analysis of the estimation orientation.In particular, we show the robustness of our approach to error in estimation during training, and consistent predictions are at test time.Experimental results show that our algorithm can 'self-correct' estimation errors occurred during training and we also show high consistency during testing.

VI. EXPERIMENTAL RESULTS
We demonstrate our model on the following datasets: mnistrot [8], the MPEG-7 Shape Silhouette database [43], and a rotated version of the zalando fashion-mnist dataset [44].We compared our performance with the following baseline and state-of-the-art approaches: • Support Vector Machine [15]: SVM trained directly on the data, without preprocessing.
• RBM [4]: the original model will provide a baseline result for our experiments.
• RBM [4] with data augmentation: we also compare with the original model trained with augmented data (we will refer to this method as RBM+).
• TI-RBM [16]: state-of-the-art method for learning transformation-invariant features.Specifically, we only used rotations as transformations.• ERI-RBM [14]: state-of-the-art approach that computes rotation-invariant features, using histograms of gradients approach to split the dataset according to the dominant orientation of the inputs.

A. Parameters
If not otherwise stated, we run all the experiments using the following parameters, which were set the same for all methods.We trained RBMs with H = 500 hidden units for 100 epochs, using a learning rate η = 0.003.We also adopted a sparsity regularizer target p = 0.1 with regularization constant λ = 0.003 [40]. 2 Furthermore, for the regularizer in eq. ( 19), we set the constant λ r = 100 and the prior probability distribution p = U(0, S − 1).We used a momentum α = 0.5 for the first 5 epochs, then we increased it to α = 0.9.If it is not explicitly specified, number of rotations is set to S = 4, as angles multiple of 90 • avoid pixel interpolation (this allows fair comparison with the other baseline and stateof-the-art approaches, but we show later the effect of changing the number of rotations S).We initialized the weight matrices using the Glorot method [45], using random numbers sampled from a Gaussian distribution with zero mean and standard deviation of √ 2/(V + H ), where bias terms are initialized with 0s.For classification, we followed the same protocol as in [14], adopting SVM [15] with an RBF kernel.For the classifier, 3 we set the spread parameter σ = 0.002.The loss parameter C is set accordingly for each dataset.Experiments were repeated 5 times, with different initialization, and mean and standard deviation were computed.

B. Experiments
Tests on Mnist-Rot [8]: This well-known benchmark dataset contains 10,000 images for training and 50,000 for testing of hand-written digits.For training, we adopted the parameters discussed in Section VI-A, setting C = 10 for the classifier.
Table I shows the results of our experiments.Overall, TI-RBM and ERI-RBM perform similarly on this setup, outperforming the baseline by ≈ 10%.Our proposed method obtains the best performance, achieving more than 92% test accuracy.This result shows that our method of inferring rotations is more reliable than e.g., max-pooling across all rotations [16], or relying on exogenous methods [14].Then, we empirically computed the γ -score as described in eq. ( 17) and our method scored γ = 0.98, as expected from Theorem 1.In fig 3, we show a subset of the filters learned by our method and, as it can be observed, filters are rotated versions of each other, providing experimental evidence for Lemma 1.As mentioned in Section I, our method learns compact representations with high discriminative power.We compared also with the Contractive Autoencoder on the same dataset [46].This method minimizes a regularization term based on the Jacobian matrix of the encoder step of the network to learn invariant features.Their results with 1,000 hidden units showed a smaller test accuracy of 90.34%.In comparison, our method learns high discriminative features with half of the hidden units, thus learning a more compact representation.
Tests on a Small Training Set: Here, we want to demonstrate that our method learns robust features even when trained on a small dataset.We used the MPEG-7 Shape Silhouette database [43], containing only 1,400 images belonging to 70 categories.Since the images have a variable size, we cropped and resized them to 28 × 28 pixels.We randomly split the dataset into 700 images for training and 700 for testing, maintaining class balance.In this case, we set the loss parameter for SVM C = 100.
Results on this dataset are also reported in Table I.Our method outperforms all other approaches, reducing the testing error by ≈ 10%.Specifically, we can observe that TI-RBM [16] and ERI-RBM [14] suffer from lack of data in the training set, obtaining a testing accuracy lower than the SVM and RBM baselines.On the other hand, RBM is not able to accommodate the rotational variance in this dataset, causing it to perform poorly compared with our approach.Therefore, our method can learn better rotation-invariant features also in the case of a reduced training set.
Tests on the Rotated Zalando Fashion Mnist Dataset: We also tested our approach on a customized version of the zalando fashion mnist dataset [44].Specifically, the original dataset contains images of 10 categories of clothes.Images are grayscale and 28 × 28 pixels size. 4For these experiments, we generated a rotated version of the dataset, using uniformly distributed random rotations.To create this customized dataset, we adopted the original code from [8] used to generate mnist-rot. 5We generated 10,000 images for training and 50,000 for testing [8].To the best of our knowledge, an equivalent mnist-rot for the zalando fashion mnist has not been created yet.We refer to such a generated Fig. 4. Comparison of our approach and TI-RBM [16] with respect to the number of rotations S on mnist-rot test dataset [8].Overall, TI-RBM requires the double of rotations (S=8) to match our performance.
dataset as zalando fashion mnist-rot.Results on this version of the zalando dataset are also shown in Table I. 6 Overall, we can observe that our method outperforms all the other approaches on this dataset as well.
The Effect of Increasing the Number of Rotations S: Here, we asssess the effect of increasing the number S of rotations during training.Comparing with TI-RBM [16], we determine the minimum number of rotations S required by TI-RBM to match our performance.In fig 4, we report the classification accuracy when the methods are trained on mnist-rot and zalando mnist-rot datasets.Overall, TI-RBM requires S = 8 rotations to match our best performance.Our method improves when S = 8 is used, although it results in a minimal increase in performance.Thus, our method can sufficiently learn highly discriminative rotation-invariant features with S = 4.
Discussion: From our experiments, it appears that it is better to rely on the intrinsic information encoded in the network to infer the dominant orientation.In this way, our model explicitly cancels out the nuisance given by rotations, producing fully rotation-invariant image representations with a reduced number of rotations.We showed that our approach is better than marginalizing across all possible rotations, as it happens in TI-RBM [16], or using an exogenous process to estimate orientations, as in ERI-RBM [14].Our method is better than the standard approach of training a neural network with data augmentation (RBM+ in Table I).This shows that data augmentation does not automatically ensure the learning of invariant features, as it still needs to learn variability in the dataset [31].In addition, we showed that our method works particularly well in datasets with small size.To further demonstrate this, we compared our method with TI-POOLING [31], a recent deep learning method for transformation-invariant feature learning.Most importantly and differently from all the methods described in Table I, TI-POOLING is a deep network, trained end-to-end with supervision.
We trained TI-POOLING with 4 rotations and we set the size of the last fully connected layers to 500 to maintain a similar setup as the unsupervised methods described in Section VI.Although TI-POOLING outperformed our method by ∼ 5% on the mnist-rot and zalando-rot datasets, classification accuracy in the MPEG-7 dataset was 78.11% ± 1.28 (best results was 79.57%), compared to 85.71% for ours.This indicates that deep architectures require big datasets to efficiently train their network parameters.

C. Ablation Experiments
We want to assess how our approach benefits from the gradient sharing step [14] and the KL-Divergence based regularizer described in Section V-B.Experiments were performed using the same protocol as discussed in the previous sections.We show the result of our experiments in Table II.
To establish a reference baseline, we trained the original RBM model [4].Next, we trained our model disabling sharing gradients and eq.(19).In this case, our model has lower performance compared to the baseline.Training our network without shared gradients is similar to training S different RBMs, such as the s-th model is trained with only the X s partition of the data.This means that the training set X is split and each RBM is trained independently on a smaller portion of the data.This procedure is closely related to the Oriented RBM baseline method described in [14].By enabling the shared gradients, the performance of our approach improves by 20% (even ≈ 40% on mnist-rot), showing that this technique is effectively improving the training.The gradient sharing step also ensures the learning or rotation-invariant features (see Theorem 1), thanks to the rotation equivalence property in eq. ( 14).When the regularizer in Section V-B is also enabled during training, performance improves further, as the inference of the dominant orientation becomes more robust and reliable.

D. Unsupervised vs. Supervised Rotation Inference
In this experiment, we want to assess the performance of our unsupervised method to infer the dominant orientation (c.f.Section V).It is important to point out that making the actual correct orientation estimation is not the purpose of our method.What is indeed important is the consistency of the predictions during training.As a first test, we compared the performance of our algorithm with an SVM classifier in the dominant orientation prediction task, using the ground truth rotations (c.f.Section VI-B), setting C = 1 and γ = 0.02 as parameters.As expected, the supervised SVM classifier outperformed our unsupervised approach.Specifically, on the zalando mnist-rot dataset, our method predicts the correct dominant orientation with an accuracy of 60% vs 92% obtained by SVM.
Given this result, we trained our rotation-invariant RBM, using a supervised SVM classifier loss: we replaced our inference process with the SVM classifier to infer the dominant orientation during training.In Table III, we show the result of this test.Although using a classifier loss minimally improves on mnist-rot, the performance of the unsupervised and supervised approaches are the same on the zalando mnist-rot.This shows that it is important to make a consistent decision in predicting the dominant orientation during training.
As a further test, we trained our rotation-invariant RBM using the actual ground truth rotations, without employing any (un)supervised processes.This test establishes an upper bound performance of our rotation-invariant RBM.Test accuracy is reported in Table III.Overall, the performance of our unsupervised approach edges with the upper bound computed using ground truth rotations.Although our unsupervised method may make errors in predicting the actual rotation, its performance is comparable to both the supervised and the upper bound performance.Overall, using the actual ground truth rotations improves the test accuracy of ∼ 2%.

VII. CONCLUSION
Finding suitable features for the task at hand is considered hard in computer vision.We presented a novel method to extract rotation invariant features, extending the original model for Restricted Boltzmann Machines (RBMs).The core part of our method is the inference of the dominant orientation of the input, that is done by minimizing the reconstruction error.In order to have more robust inference of such a dominant orientation, we regularize the learning process with a term derived from the KL-Divergence.This regularizer enforced a prior distribution over the dominant orientation in the dataset.
We evaluated our method on three publicly available datasets and it outperformed baseline and state-of-the-art approaches in all the cases.Our approach scored γ = 0.98, demonstrating full rotation invariance.Furthermore, we also showed that our method can learn highly discriminative features in the case of a reduced training set.Our ablation experiments showed that our method benefits from the sharing gradient method [14], as well as the regularizer based on the KL-Divergence.This allows our method to compete with supervised methods, such as SVM trained to infer dominant orientations.This was further demonstrated by a consistency analysis, where rotated images obtain the correct dominant orientation prediction w.r.t. the unrotated version.We also showed that the training of the network is not affected by errors in the estimation of the dominant orientation.We showed that our approach outperforms the other shallow state-of-the-art methods, although is still challenged by the size of the input image.A future research direction is to embed our method within a deep network, where a fully-connected layer can be extended with a third-order tensor to accommodate a set of rotations S to extract rotation-invariant features.
In conclusion, our proposed method explicitly factorizes rotational nuisance from the training set, learning high discriminative and compact features.fact, experimental evidence showed that explicit unsupervised feature learning performs better than the others (e.g., [14], [16]).Our python implementation is available at https://bitbucket.org/ tuttoweb/rotinvrbm.

Fig. 1 .
Fig. 1.Representation of the proposed rotation-invariant RBM.In this example, we have S = 4 rotations, corresponding to the equidistant angles= φ 0 = 0 • , φ 1 = 90 • , φ 2 = 180 • , φ 3 = 270 • .Each angle is associated with a matrix W s .When an image is provided to the network, the weight matrix minimizing the reconstruction error is chosen, as highlighted in bold red.We depict the unfolded steps of the CD-1[13].

Fig. 2 .
Fig. 2. Probability (in log-scale) of a prediction change of the dominant orientation occurs during training on the mnist-rot dataset (change probability).The stability of our inference method increases exponentially over time.We also show two inset transition matrices: the left-hand side inset shows the amount of misclassification between the first two epochs, whereas the right-hand side for the last two epochs.

Fig. 3 .
Fig. 3. Filters learned by our model IRI-RBM for the mnist-rot dataset [8].We display for brevity a set of 5 filters for each of the S = 4 weight matrix.

TABLE II ABLATION
RESULTS SHOWING THAT OUR METHOD BENEFITS FROM SHARED GRADIENTS AND THE REGULARIZER PRESENTED IN SECTION V-B

TABLE III COMPARISON
OF OUR UNSUPERVISED DOMINANT ORIENTATION INFERENCE METHOD WITH TWO SUPERVISED METHODS.2nd Row: THE INFERENCE METHOD IN SECTION V WAS REPLACED WITH AN SVM TRAINED DIRECTLY ON GROUND TRUTH ROTATIONS.3rd Row: RBM IS TRAINED ON GROUND TRUTH ROTATIONS