Using ANNs for Feature Learning and Dimensionality Reduction

The following article describes work in progress. It might be published in a more complete form at a conference or journal. If you want to use this technique, please quote the web article.

Update (12/13/2013): The work on distance based ANN training has been published this December at ICMLA 2013. You can find the extended and refined paper here.


This work is inspired by the current need of hand-designed features for all computer vision tasks. While the 'perfect' feature can solve every machine learning task it can hence be expected to be equally hard to find as the solution to the original task. However, it might still be desirable to find at least 'helpful' features automatically to ease the learning process for a successively applied machine learning algorithm.

A 'helpful' feature space transformation for a classification task (for the sake of simplicity, this first version the article focuses solely on this problem setting) is defined by the following three properties:

  1. It maps input vectors with similar classification labels on as similar outputs as possible.
  2. The distance between mapping values of input vectors with different labels is as large as possible.
  3. Noise in the input vector is ignored that is irrelevant for the label assignment.

These properties are completed with the usual requirement for generalization. Note that in these requirements only distances play a role, but no requirements are given to map inputs on specific values: (1. and 2.) similarity between outputs of the feature space transformation can be measured using the euclidean distance. Noise translates to higher differences in the original space that can be reduced in the feature space.

It is possible to formulate the error function of an Artificial Neural Network for regression such, that it only uses the desired distance between points, instead of using target values. Since in the above requirements towards a good feature only distances play a role, such a network is the appropriate means to automatically learn a useful feature space transformation as described before.

The benefits of using a distance based error function instead of a target value distance function are twofold:

  • The amount of network output nodes does not have to be equal to the dimension of the network annotations, i.e. it can learn a feature space transformation into a space with an arbitrary number of dimensions, independent of annotations and input space.
  • Many variants of annotation are possible from the class label over PCA data to the original distances. The first case leads to a feature learner, fulfilling the aforementioned requirements. The latter case leads to the network learning a distance preserving transformation into a lower-dimensional space, similar to MDS (multi-dimensional scaling) but with the advantage that 'new' points not included in the dataset at training time can be projected to the lower-dimensional space later.

This short web article will not cover a complete review of related work. It is however interesting to note that to the authors knowledge no one used a similar approach to feature learning yet. The only mentioning in literature of ANNs with a distance based error function is in the work "A similarity-based neural network for facial expression analysis" by Suzuki et al. (see here). However, that work focuses on the use of the distance based error function in the specific context of facial expression analysis, and further features of the network are not investigated.

Distance based ANN training

The basic rules for the distance based neural network training can be derived by applying gradient descent to an appropriate error function. The error function should capture the difference of differences between two examples in

  • the resulting values by forward-propagation through the network and
  • a desired difference for the two examples.

In general, this can be expressed by specifying two distance functions, one for the outputs of the ANN $\delta_{o}\left(x_{1},x_{2}\right)$ and one for the desired difference of the examples $\delta_{d}\left(x_{1},x_{2}\right)$.

Edge weight updates

Let the forward propagation operation of the network be denoted by $y$. With these definitions, the error for two examples is defined as

\begin{eqnarray*} \mbox{E}\left(x_{1},x_{2}\right) & = & \frac{1}{2}\left\Vert \delta_{o}\left(y\left(x_{1}\right),y\left(x_{2}\right)\right)-\delta_{d}\left(x_{1},x_{2}\right)\right\Vert ^{2}\ & = & \frac{1}{2}\left(\delta_{o}\left(y\left(x_{1}\right),y\left(x_{2}\right)\right)-\delta_{d}\left(x_{1},x_{2}\right)\right)^{2} \end{eqnarray*}

In this article, a fully-connected two-layer network layout is assumed (i.e. one input layer, a non-linear hidden layer with a tanh activation function and a linear output layer), since this layout is sufficient to learn arbitrary continuous functions. The difference function for the network output is chosen as $$ \delta_{o}\left(x_{1},x_{2}\right)=\sum_{r=1}^{O}\left(y\left(x_{1}\right)_{r}-y\left(x_{2}\right)_{r}\right)^{2}, $$

$\newcommand{\leftexp}[2]{{\vphantom{#2}}^{#1}{#2}}$ with $O$ being the total number of output nodes. For a network edge weight $\leftexp2w_{n}^{m}$ in the second layer, connecting hidden node $m$ with output node $n$, the gradient is defined as \begin{eqnarray*} \frac{\partial E\left(x_{1},x_{2}\right)}{\partial\leftexp2w_{n}^{m}} & = &\underset{\delta_{t}\left(x_{1},x_{2}\right)}{\underbrace{\left(\delta_{o}\left(y\left(x_{1}\right),y\left(x_{2}\right)\right)-\delta_{d}\left(x_{1},x_{2}\right)\right)}}\cdot\ & & \left(y\left(x_{1}\right)_{n}-y\left(x_{2}\right)_{n}\right)\cdot\left(i\left(x_{1}\right)_{m}-i\left(x_{2}\right)_{m}\right), \end{eqnarray*} with $i\left(x\right)$ denoting the result vector of the hidden layer for the network input $x$. Similarly, the gradient for an edge weight $\leftexp1w_{n}^{m}$ in the first layer is given by \begin{eqnarray*} \frac{\partial E\left(x_{1},x_{2}\right)}{\partial\leftexp1w_{n}^{m}} & = & \delta_{t}\left(x_{1},x_{2}\right) \left(\sum_{r=1}^{O}\left(y\left(x_{1}\right)_{r}-y\left(x_{2}\right)_{r}\right)\cdot\leftexp2w_{r}^{n}\right)\cdot\ & & \left(i'\left(x_{1}\right)_{n}\cdot x_{1_{m}}-i'\left(x_{2}\right)_{n}\cdot x_{2_{m}}\right) \end{eqnarray*} with $i'\left(x\right)$ denoting the derivation of the activation function for input $x$. With the definition of the gradient, gradient descent can be used to adjust network connection weights. For our network, we use an automatic step-size estimation method described by Paweł Wawrzyński in ''Fixed point method of step-size estimation for on-line neural network training'' (see here), leading to results superior to manual step size tuning.


The following examples shows a series of sample points and learning results in 3D space. The color represents the annotation. The network learns a mapping to a new 3D representation in which points with similar color are close, while points with different colors are further away. Click on the images to see additional information.

Point example Point example Point example

The distance based training leads to impressive learning results. The network has the advantage that it can ''place'' the samples at well-fitting positions in its output space, as long as they have the appropriate distances. The absolute network output positions can hence be very different over several runs, since their error is equivalent for the error function if the positions are equivalent modulo a distance-preserving transformation.

In the given examples the network learns a mapping from 3D to 3D space using the annotations represented by the color of the points. Similarly colored points should be mapped to the same spot, whereas differently colored points should be mapped to different spots. Note that it is not possible to use a linear classifier in the original space to classify differently colored points (e.g. yellow vs. blue), but it is possible easily in the learned ''feature space'' using the mapping of the DANN depicted in the last image.


Since the distance function for the desired distance between two examples $\delta_{d}\left(x_{1},x_{2}\right)$ occurs in the gradient terms only in its original form, it is possible to use arbitrary functions or a distance matrix to realise it. Theoretically, the distance matrix may contain arbitrary (e.g. non-symmetric) distances, though it might take a large network or a higher-dimensional target space to obtain a low error.

Similar to Support Vector Machines, the distance function $\delta_{d}\left(x_{1},x_{2}\right)$ can be replaced by an arbitry kernel function $\kappa\left(x_{1},x_{2}\right)$ that returns a value representing the desired distance between two input vectors. In contrast to SVMs, this function does not have to satisfy any axioms, giving a lot of design freedom to the user. However, it should be symmetric to facilitate learning. This is an assumption that holds for the most useful metrics. In the following sections, I present the kernels that were implemented for our experiments.

$\kappa_{\mbox{Class}}\left(x_{1},x_{2}\right)=\left(l\left(x_{1}\right)-l\left(x_{2}\right)\right)^{2}$ : The distance should be the squared distance in the annotation space. This function is a straight-forward choice to make the feature mapping useful for classification. It is used to find the mapping in the preceeding examples.

$\kappa_{\mbox{Original}}\left(x_{1},x_{2}\right)=\left(x_{1}-x_{2}\right)^{2}$ : The desired distance should be equal to the original distance. This function leads to a completely unsupervised learning method, similar to the Autoencoder and MDS. In contrast to the Autoencoder, however, it does not develop a latent representation of the data but tries to preserve the original distances. This method does not necessarily find a representation of the data that is useful for classification.

$\kappa_{\mbox{Original+Class}}\left(x_{1},x_{2}\right)=\left(x_{1}-x_{2}\right)^{2}+\left(l\left(x_{1}\right)-l\left(x_{2}\right)\right)^{2}$ : This kernel tries to combine the benefits of both aforementioned kernels. It tries to find a distance preserving transform including the class distance.

$\kappa_{\mbox{DPCA}}\left(x_{1},x_{2}\right)=\left(\mbox{DPCA}^{-1}\left(\mbox{DPCA}\left(x_{1}-x_{2}\right)\right)\right)^{2}$ : The desired distance should be equal to the distance reduced on the main differences between samples from the different classes. This leads to a completely unsupervised setting that learns a useful feature.

For the last Kernel, DPCA refers to the projection of the difference vector into a PCA space. This space is learned before network training by calculating an arbitrary number of differences between class instances of different classes. The PCA hence learns the ''principal differences'' between the classes. Applying the inverse mapping immediately yields the distance reduced to the main differences between the classes. Note that this unsupervised variant of DANNs has additional parameters: the amount of difference vectors for calculating the PCA and the amount of principal components to use.


Experimental Setup

To make a comparison of performance simple and give the reader the possibility to compare the DANN for different amounts of output nodes, we designed the experiment as follows: for each dataset, we trained the DANN with ten different amounts of output nodes constantly increasing in the range from one feature dimension to more than the original data dimension.

For each amount of output nodes, we increased the amount of hidden nodes step-wise until the error on the validation set did not decrease any more. Similarly, we used an early- stopping criterion to determine the amount of stochastic gradient descent steps.

The network weights are initialized by fan-in. The data is normalized before being presented to the network by subtracting the mean and normalizing each feature variance to 1.0 as suggested as best practice by LeCun et al. in their paper ''Efficient BackProp''. To account for the random results of stochastic gradient descent and the random weight initialization, each experiment is conducted eight times and only the best result is used.

In the resulting feature space, we use a linear SVM classifier. The implementation used is LibLinear. To have a fair comparison of SVM performance, the $C$ parameter is evaluated for each SVM at each point in the range from $10^{-3}$ up to $10^{3}$. The performance is evaluated after every ten epochs (the "strip-length") of DANN training with a maximum of 120 strips. This upper limit is hardly ever reached and training usually stops due to the other stopping criteria.

Since there are very unbalanced datasets used during the experiments (see the table below), an evaluation metric must be used that truly reflects the performance. To be able to use the same metric for all experiments, including multiclass classification ones, we use the mean balanced accuracy. The balanced accuracy for one class is defined as $$ 0.5\frac{tp}{tp+fn}+0.5\frac{tn}{tn+fp}. $$


The first set of experiments is conducted on some datasets of the Proben1 ANN benchmark set. We use the version that is published together with the FANN library, since it is particularly easy to use. The training part of the data is used unchanged, the test data is split into validation and test parts with equal size. An overview of the data can be found in the following table:

  thyroid mushroom gene
Dimensions 21 125 120
Classes 3 2 3
Samples (train/val/test) 3600/1800/1800 4062/2031/2031 1588/794/794
Balance 2.5%/95%/2.5% 52%/48% 25%/50%/25%
Lin. SVM 0.81 1.00 0.93
RBF SVM 0.92 1.00 0.94
Lin. SVM after DANN 0.92 1.00 0.86

The Lin. SVM performance denotes the performance of a Linear SVM trained on the original data, RBF SVM the performance of an RBF kernel SVM on the original data and the Lin. SVM after DANN the best performance of a Linear SVM trained on the data transformed by the presented method.


Results on the gene dataset Results on the mushroom dataset Results on the thyroid dataset

The black solid line shows the level of a linear SVM applied to the original data, while the black dashed line shows the performance of an RBF kernel SVM on the original data.

The results of the first experiment on the thyroid dataset show the potential of the presented method: the classification performance can be substantially increased with most kernels, even for a significantly lower amount of data dimensions. The performance of the RBF kernel SVM on 21 dimensions is reached with this method and the kernel $\kappa_{\mbox{class}}$ with only five feature dimensions!

The second experiment is performed on the mushroom dataset. This dataset is very "easy" to separate properly; the linear SVM can reach 1.0 performance on the test set on the original data. The results on this dataset show that linear separability can be maintained by a trained DANN up to a strongly reduced number of data dimensions (for a feature dimension of 25, all kernels still achieve near 1.0 performance).

The last figure shows the performance on the gene dataset. This last experiment shows, that depending on the dataset, the performance of the original linear SVM might not be topped on the data, but can be kept on a near constant level while strongly reducing the data dimensionality. This can be especially interesting when data size matters, e.g. when storing large amounts of feature vectors.


In this article, I introduced the method of DANNs as a flexible and reliable method for feature learning and compression. Results show that linear classifiers on a strongly reduced feature space can reach the performance of non-linear methods in a much larger space. Since the DANN reduction method can be used in an unsupervised way, it maybe could be applied for layer initialization of deep learning networks and partially replace RBMs in this setting.

Applying the concept of difference based learning on convolutional neural networks remains an issue for further research. This would make this method applicable to higher-dimensional inputs such as plain images.

My implementation of the learning algorithm in C using the Intel MKL library might be published on this page soon. It provides the user with a convenient python interface, tied together using Cython. If you're interested in the topic or the implementation, write me an email or leave a comment!