Michael Biehl, earlier research in machine learning

Earlier research (1989-2003) in the area of machine learning

Overview   (back to the top)
A key issue in my scientific activities is the analysis, modelling, and application of learning in adaptive information processing systems. This includes, for instance, neural networks for classification and regression tasks as well as problems of unsupervised learning and data analysis.
Frequently, training can be formulated as an optimization problem which is guided by an appropriate cost function. One particularly successful approach is suitable for the investigation of systems with many adaptive parameters, the choice of which is based on high-dimensional randomized example data.
Monte Carlo computer simulations of training processes play an important role in this context. Furthermore, analytical methods borrowed from statistical physics allow for the computation of typical learning curves in the framework of model scenarios. In classification, for instance, one is interested in the generalization error, i.e. the expected error rate when classifying novel data, as a function of the number of examples used in training.
Key ingredients of the appoach are:
* the consideration of very large systems in the so-called thermodynamic limit which allows for
* the description in terms of a few macroscopic quantities or order parameters
* performing averages over randomized examples and the stochastic training process.
The thermodynamic limit corresponds to infinite dimensional data, formally. However, in most cases results are in excellent quantitative agreement with observations in systems of, say, a few hundred degrees of freedom. The analysis of such systematic finite size effects is of course one important aspect of the approach.
Two, in a sense extreme cases of training have been considered with particular success:

• Off-line or batch learning
• If a given set of example data is used to define a cost function for training, the latter can be interpreted as an energy in statistical physics language. A formal temperature controls the minimization and typical properties of the system can be evaluated from the corresponding free energy. Technical subtleties arise in the evaluation of the corresponing quenched average over random data. They require, for instance, a saddle-point integration in terms of macroscopic order parameters and the application of the replica formalism known from disorder statistical physics.

• On-line training
If examples are presented one-by-one in a temporal sequence, it is possible to analyse the learning dynamics exactly. Under simplifying assumptions, a system of non-linear ordinary differential equations (ODE) describe the temporal evolution of order parameters, and hence the learning curve. The formalism can also be applied in the case of purely heuristic training schemes with no obvious relation to a cost function.

All these investigations aim at an understanding of phenomena and problems which occur in practical applications of learning systems. The insights obtained from the model scenarios, ultimately, enable to improve and optimize existing algorithms and -moreover- to develop novel and efficient training schemes. Examples of this strategy and related key publications are given below.

Some reviews and longer papers   (back to the top)

• Statistical mechanics of on-line learning and generalization
M. Biehl and N. Caticha
in: Handbook of Brain Theory and Neural Networks (2nd ed.)
M.A. Arbib (ed.), MIT Press (2003)

• Learning in freed-forward neural networks: dynamics and phase transitions
M. Biehl, in: Adaptivity and Learning, R. Kühn et al. (eds.), Springer (2003)

• Statistical physics of learning: phase transitions in neural networks
M. Biehl, M. Ahr, E. Schlösser,
in: Advances in Solid State Physics 40, B. Kramer (ed.), Vieweg (2000)

• The statistical mechanics of learning a rule
T.L.H. Watkin, A. Rau, and M. Biehl, Reviews of Modern Physics 65 (1993) 499.

For a more complete overview of my activities, see also my related, electronically
available articles and the complete list of publications.

• Off-line perceptron training, the AdaTron algorithm   (back)
A statistical physics analysis of perceptron training, due to M. Opper, led to the development of this training prescription. It maintains the favorable convergence properties of the Adaline algorithm (Widrow and Hoff) and - at the same time - yields the perceptron of optimal stability in linearly separable cases.
Recently, the algorithm has re-gained popularity in the context of Support Vector Machines, see the book by N. Christianini and J. Shawe-Taylor, for instance. Besides typical properties of the AdaTron, we have addressed numerous other problems in off-line perceptron training in the statistical physics framework. This includes, among others, learning from clustered examples and the typical ROC curve of the perceptron as discussed in the context of medical diagnosis.

the influence of sample size and prevalence

A. Freking, M. Biehl, C. Braun, W. Kinzel, M. Meesmann, Phys. Rev. E 60 (1999) 5926.

Supervised learning from clustered input examples
C. Marangi, M. Biehl, and S.A. Solla, Europhysics Letters 30 (1995) 117.

Perceptron learning by constrained optimization: the AdaTron algorithm
M. Biehl, J.K. Anlauf, and W. Kinzel
in: Neurodynamics, F. Pasemann and H.D. Doebner (eds.), World Scientific (1991)

J.K. Anlauf and M. Biehl, Europhysics Letters 10 (1989) 687.

• On-line perceptron training   (back)
The single layer perceptron can be interpreted as the basic building block from which to construct more complex devices, including dynamical networks of the Little-Hopfield type and multilayered feed-forward architectures. Consequently, it has served as a prototype system for the investigation of learning processes, in general. This applies in particular to the framework of on-line learning. Results are directly applicable to multilayer networks with non-overlapping receptive fields. The following are just a few major questions addressed in this context:

* Variational optimization of on-line algorithms
* Learning from noisy data and unrealizable rules
* Learning of drifting rules in non-stationary environments

Learning from noisy data: an exactly solvable model
M. Biehl, P. Riegler, and M. Stechert, Physical Review E 52 Rapid Comm. (1995) R4624.

On-line learning with a perceptron
M. Biehl and P. Riegler, Europhysics Letters 28 (1994) 525.

On-line learning of a time-dependent rule
M. Biehl and H. Schwarze, Europhysics Letters 20 (1992) 733.

• Unsupervised learning   (back)
As opposed to problems of classification and regression, example data are unlabelled in unsupervised learning. Here the aim is to extract relevant inforamtion from input data only. Potential applications of unsupervised learning include feature detection for low-dimensional representation, data compression, or as a pre-processing step in supervised learning. Both, on-line and off-line unsupervised learning can be modelled and analysed along the same lines as supervised training. Example topics include

* Principal component analysis and density estimation
* Competitive learning and clustering
* Variational optimization of unsupervised training

Optimisation of on-line principal component analysis
E. Schlösser, D. Saad, M. Biehl, Journal of Physics A 32 (1999) 4061.

Dynamics of on-line competitive learning
M. Biehl, A. Freking, G. Reents, Europhysics Letters 38 (1997) 427.

Statistical mechanics of unsupervised learning
M. Biehl and A. Mietzner, Europhysics Letters 24 (1993) 421.

• Learning by gradient descent and backpropagation   (back)
In regression problems it is possible to base learning on a continous error measure and use gradient based methods for the choice of, for instance, the weights in a neural network. Our studies of on-line gradient descent initiated the first analytic treatment of the learning dynamics in multilayered neural networks. A topic of major importance in this context is the discussion of quasistationary states in the learning dynamics. These so-called plateaus are observed in a variety of learning scenarios and are well known from practical applications of machine learning. Insights into the nature of the plateau problem were the basis for the later development of novel, more efficient training prescriptions for, both, regression and classification.

Transient dynamics of online-learning in two-layered neural networks
M.Biehl, P. Riegler, and C. Wöhler, Journal of Physics A 29 (1996) 4769.

M. Biehl and H. Schwarze, Journal of Physics A 28 (1995) 643.

On-line backpropagation in two-layered neural networks
P. Riegler and M. Biehl, Journal of Physics A 28 (Letters) (1995) L507.

• Specialization processes in on-line and off-line learning   (back)
The above mentioned plateaus in on-line training are just one example for specialization process which reflect inevitable symmetries of the learning scenario. Often, elements of the trained device can be exchanged with no effect on the performance. A permutation symmetry holds, for instance, for the branches of a multilayered neural network or for the prototypes in unsupervised competitive learning. Successful learning requires the breaking of such symmetries or, in other words, the specialization of sub-systems, e.g. the network branches.
The counterpart of on-line plateaus in off-line or batch training are phase transitions in the learning curve: frequently, the success of learning depends on the number of examples in a discontinous way. Only above a critical size of the training set, specialization and good generalization performance become possible.

Efficiently learning multilayer perceptrons
C. Bunzmann, M. Biehl, and R. Urbanczik, Physical Review Letters 86 (2001) 2166.

Statistical physics and practical training of soft-committee machines
M. Ahr, M. Biehl, R. Urbanczik, European Physics Journal B 10 (1999) 583.

Specialization processes in on-line unsupervised learning
M. Biehl, A. Freking, G. Reents, and E. Schlösser, Philosophical Magazine B 77 (1998) 1487.

(back to the top)     last update February 2005