Nordic Informatics Network in the Agricultural Sciences

Pattern Recognition in high dimensional data and complex structures

Background

Nowadays equipment for measuring various processes within the agricultural field deliver a huge amount of output data. In a second step these data have to be analysed. One main object is to lower dimensions and identify patterns. The course will focus on the following four issues where the first two are of more theoretical character and the other consider some applications.

  1. Pattern recognition and classical statistical tools
    Here it will mainly be focused on discriminatory/identification analysis and cluster analysis. The main ideas of principal components analyses will also be considered. The object with discriminant analysis is to decide if an object either belongs to a specific class from a family of classes or if it does not belong at all to the family. It is either based on a distance (discriminant score) or on the likelihood. In cluster analysis there are no predefined classes as in discriminant analysis. Instead on tries to split the set of observations into subsets. We will focus on hierarchical clustering methods. Each observation constitutes a cluster by itself. The two in some sense closest clusters are merged to form a new cluster that replaces the two old clusters. Merging of closest clusters is repeated until only one cluster is left. Different clustering methods varies in how the distance between two clusters is computed. Principal components reduces the dimension in the data by constructing linear functions of the data. These functions can instead of the original be used as input in other analysis, for example discriminant analysis.
  2. Pattern recognition and data mining
    Choosing a model of appropriate complexity is important for drawing accurate conclusions. Simple models are used for learning simple functions of the data. Complex models are required for learning complex functions. For data mining models, one way to increase the complexity of a model is to add variables. Other ways to increase complexity depend on the type of model: In regression models, one can add interactions and polynomial terms. In neural networks, one can add hidden units. In tree-based models, one can grow a larger tree.
     
    For both regression and neural nets, the simplest models are linear functions of the input variables. Therefore regression and neural nets are both good for learning linear functions whereas tree-based models require many branches. For many input variables, learning becomes difficult because of the curse of dimensionality. To learn non-linear functions, all modelling methods require a degree of complexity that grows exponentially with the number of inputs. That is, as the number of inputs increases, the number of interactions and polynomial terms required in a regression model grows exponentially, the number of hidden units required in a neural network grows exponentially, and the number of branches required in a tree grows exponentially. The amount of data and the amount of training time required to learn such models also grow exponentially.
  3. Pattern recognition and images
    Feature extraction from signals will be discussed. In particular remote sensing data are of interest as well as image transformations. Furthermore, signal representation relevant to machine learning is highly relevant to consider.
  4. Pattern recognition and microarrays
    Classification and analysis of a huge number of cellular gene expression profiles measured by means of DNA microarrays will be considered. Useful criteria for performance evaluation and methods for estimating reliability are presented. Subset selection algorithms will be put into relation to the curse of dimensionality problem.

A common thread in the above is that we have high dimensional data. It is a challenge to among others statistics to handle these kind of data. Usually high dimensional data is characterized by many variables in relation to the number of independent experimental units. On top of this one may have complex or dynamic relations. Classical statistical asymptotic results do not apply. Multiple testing is often used but it is hard work to determine significance levels. At the same time new technologies arise where the purpose is to extract information from high dimensional data and which may be classified as pattern recognition methods. In these methods often some statistical ingredient exists, which is fairly natural because data is random, but conclusions are in general not based on a probabilistic reasoning. Instead simulations convinces applicants of the correctness of the analysis.

Aim of the course

The main objective is to introduce the participants into two worlds. One is the classical statistical frame and the other is computer science. In particular participants should become knowledge in the terminology used by the different t disciplines. For example training and trained models often used in scientific computations with synonyms estimation and fitted model in statistics. Emphasizes will be put on what kind of conclusions can be drawn from data. For example in the analysis of micro arrays often a lot of statistical tests are performed and it soon becomes clear that for this procedure the significance level can only be directing. During the course various techniques and approaches will extensively be illustrated on computers. Participants have to work a lot with computers in order to solve problems and to understand the basic principles.

After the participation in the course, the PhD students will be able to use various methods in order to deal with huge and complex problems. Besides getting a glimpse into statistics they should be able to apply pattern recognition methods, to understand the output of these as well as to know about their limitations.

Required knowledge

Familiarity with computers at user level, and with basic probability calculus and related concepts.

Topics and Key Words

This summer school will focus on the following main topics:

The material will be illustrated with various examples and supplemented with guest lectures on related issues. Throughout the course, the theory will be supplemented with exercises and computer assignments. At the end of the course, the students will work on a two-day project that involves both modelling and computing aspects.

Scientifically responsible

Professor Geoffrey McLachlan, Department of Mathematics, University of Queensland, Australia, Hans Liljenström, Department of Biometry and Informatics, Swedish University of Agricultural Sciences, Sweden, Dietrich von Rosen, Department of Biometry and Informatics, Swedish University of Agricultural Sciences, Sweden.

Organisational responsible

Dietrich von Rosen, Department of Biometry and Informatics, Swedish University of Agricultural Sciences, Sweden.

External Lecturers

Refer to the appendices for presentation of the teachers.

Teaching methods

Lectures alternating with intensive use of computer exercises. The availablility of network connected computers is therefore essential for the benefit of the students. A small project is carried out by the students at the end of the course (individually or preferably in small groups).

Examination

Examination will be based on a written project report handed in at the end of the course in combination with an oral presentation. The number of credits proposed is 6 ECTS.

Dina logoAuthor: phd@dina.kvl.dk. Updated: 04 november 2003