discriminative feature transformations based on
maximum mutual information

This page presents a few video clips illustrating the convergence in learning dimension-reducing feature transformations that aim at preserving class discrimination. Criterion is to maximize the mutual information between class labels and transformed features in a way described in the references. In the first four cases the transformation is linear, parametrized as Givens-rotations, and onto a two-dimensional space for simple visualization.

Classes are depicted in different colors, with a small arrow whose starting points are the actual locations of the projected points. The arrows represent "information forces",  the directions and magnitudes where the samples of data would like the transform move them in order to maximize the criterion.

Here are the examples:

  • Two classes in 3-dimensional space. The blue class is a uniform distribution inside a unit cube. The red class is a slab completely INSIDE the blue class.  Simple gradient ascent is used for learning. QuickTime video (4.2 MB)

  •  
  • Three classes in 3-dimensional space.  The blue class is again a uniform distribution inside a unit cube. The red class is a thin rod through the center of the cube. There is a third class (green), which is a Gaussian distribution, also inside the blue class. Simple gradient ascent is used for learning. QuickTime video (3.6 MB)

  •  
  • Three classes in 12-dimensional space. This is the oil-pipeflow data from NCRG at Aston University. Levenberg-Marquardt optimization is used this time. QuickTime video (1.7 MB)

  •  
  • Six classes in 36-dimensional space. This is the Landsat satellite image database from Statlog-project and UCI Machine learning repository. Levenberg-Marquardt optimization. QuickTime video (4.8 MB)
An example of a nonlinear transform (see ref. 3 below):
  • The same three classes in 3-dimensional space as in the second example. Now the transform is a Radial Basis Function network.  Simple gradient ascent is used for learning. QuickTime video (6.0 MB)
In the following two examples, instead of Parzen density estimation coupled with quadratic mutual information, a Gaussian Mixture Model (GMM) is fitted to the data in the output space after the initial (random) transform (see ref. 4 below). A GMM is constructed of the same samples in the high dimensional input space, and a transform is learned to separate these GMMs of different classes. Mixture component covariance ellipsoids are plotted together with the actual samples, but the samples are not used in the adaptation, only the GMMs:
  • Oil-pipeflow data (same as 3rd example). 12-dimensional data in three classes transformed into two dimensions. The GMM is 2-3 spherical components per class. QuickTime video (2.5 MB)
  • Phoneme data. 20-dimensional data in 20 classes transformed into two dimensions. The GMM is 2-3 diagonal components per class. QuickTime video (4.5 MB)
An example of an on-line stochastic gradient algorithm (see ref. 5 below):
  • Oil-pipeflow data (same as 3rd example). 12-dimensional data in three classes transformed into two dimensions. Gradient ascent is used for learning, but rather than computing all million mutual interactions between the thousand data points each iteration, a random sample of mere 30 interactions (pairs of points) is evaluated at each iteration (which corresponds to a frame of the movie clip). QuickTime video (1.5 MB)
An example with high-dimensional data:
  • A subset of the Reuters-21578 database was used. This subset consists of 6535 brief news articles. 5718-dimensional term-histogram data is transformed into two dimensions. This data set has 52 classes. Above described GMM-approach was used. In this example an attempt was made to find a projection that could simultaneously separate all 52 classes. Two features is obviously too low, but we can see how the method concentrates on the separation of the two major classes ("earn" and "acq") from the rest. avi video (1.7MB)
  • Reuters-21578 again. This time the aim is to find a projection to discriminate a smaller class ("coffee") from the rest. avi video (0.8MB)
References
  1. K. Torkkola and W. Campbell, "Mutual Information in Learning Feature Transformations", Proceedings of ICML 2000, Stanford, CA, June 29-July 2, 2000. ( .ps.gz), ( .pdf)
  2. K. Torkkola, "Visualizing Class Structure in Data Using Mutual Information", Proceedings of NNSP 2000, Sydney, Australia, December 11-13, 2000. ( .ps.gz), ( .pdf)
  3. K. Torkkola, "Nonlinear Feature Transforms Using Maximum Mutual Information", Proceedings of IJCNN 2001, Washington DC, USA, 14-19 July 2001. ( .pdf)
  4. K.Torkkola, "Learning discriminative feature transforms to low dimensions in low dimensions" ,  In Advances in neural information processing systems 14 ( NIPS 2001), Vancouver, BC, Canada, December 3-8 2001. MIT Press ( .pdf). Same as a pdf-poster.
  5. K. Torkkola, "On feature extraction by mutual information maximization". In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2002), Orlando, FL, May 13-17 2002 (.pdf)
  6. K. Torkkola, "Feature Extraction by Non-Parametric Mutual Information Maximization", Journal of Machine Learning Research, 3(Mar):1415-1438, 2003. [abs] [pdf] [ps.gz] [ps]
  7. J.C. Principe , J.W. Fisher III and D. Xu, " Information Theoretic Learning", in Simon Haykin (ed.), Unsupervised Adaptive Filtering, Wiley, 2000.


more links

to come ....

Back to my home page!