Nordic
Informatics Network in the Agricultural Sciences

Pattern Recognition in high dimensional data and complex structures
Background
Nowadays equipment
for measuring various processes within the agricultural field deliver a huge
amount of output data. In a second step these data have to be analysed. One main
object is to lower dimensions and identify patterns. The course will focus on
the following four issues where the first two are of more theoretical character
and the other consider some applications.
-
Pattern
recognition and classical statistical tools
Here it will mainly be focused on
discriminatory/identification analysis and cluster analysis. The main ideas
of principal components analyses will also be considered. The object with
discriminant analysis is to decide if an object either belongs to a specific
class from a family of classes or if it does not belong at all to the
family. It is either based on a distance (discriminant score) or on the
likelihood. In cluster analysis there are no predefined classes as in
discriminant analysis. Instead on tries to split the set of observations
into subsets. We will focus on hierarchical clustering methods. Each
observation constitutes a cluster by itself. The two in some sense closest
clusters are merged to form a new cluster that replaces the two old
clusters. Merging of closest clusters is repeated until only one cluster is
left. Different clustering methods varies in how the distance between two
clusters is computed. Principal components reduces the dimension in the data
by constructing linear functions of the data. These functions can instead of
the original be used as input in other analysis, for example discriminant
analysis.
-
Pattern recognition and data mining
Choosing a model of appropriate complexity is important for
drawing accurate conclusions. Simple models are used for learning simple
functions of the data. Complex models are required for learning complex
functions. For data mining models, one way to increase the complexity of a
model is to add variables. Other ways to increase complexity depend on the
type of model: In regression models, one can add interactions and polynomial
terms. In neural networks, one can add hidden units. In tree-based models,
one can grow a larger tree.
For both regression and neural nets, the simplest models are
linear functions of the input variables. Therefore regression and neural
nets are both good for learning linear functions whereas tree-based models
require many branches. For many input variables, learning becomes difficult
because of the curse of dimensionality. To learn non-linear functions, all
modelling methods require a degree of complexity that grows exponentially
with the number of inputs. That is, as the number of inputs increases, the
number of interactions and polynomial terms required in a regression model
grows exponentially, the number of hidden units required in a neural network
grows exponentially, and the number of branches required in a tree grows
exponentially. The amount of data and the amount of training time required
to learn such models also grow exponentially.
-
Pattern recognition and images
Feature extraction from signals will be discussed. In
particular remote sensing data are of interest as well as image
transformations. Furthermore, signal representation relevant to machine
learning is highly relevant to consider.
-
Pattern recognition and microarrays
Classification and analysis of a huge number of cellular
gene expression profiles measured by means of DNA microarrays will be
considered. Useful criteria for performance evaluation and methods for
estimating reliability are presented. Subset selection algorithms will be
put into relation to the curse of dimensionality problem.
A common thread in the above is that we have high dimensional
data. It is a challenge to among others statistics to handle these kind of data.
Usually high dimensional data is characterized by many variables in relation to
the number of independent experimental units. On top of this one may have
complex or dynamic relations. Classical statistical asymptotic results do not
apply. Multiple testing is often used but it is hard work to determine
significance levels. At the same time new technologies arise where the purpose
is to extract information from high dimensional data and which may be classified
as pattern recognition methods. In these methods often some statistical
ingredient exists, which is fairly natural because data is random, but
conclusions are in general not based on a probabilistic reasoning. Instead
simulations convinces applicants of the correctness of the analysis.
Aim of the course
The main objective is to introduce the participants into two
worlds. One is the classical statistical frame and the other is computer
science. In particular participants should become knowledge in the terminology
used by the different t disciplines. For example training and trained models
often used in scientific computations with synonyms estimation and fitted model
in statistics. Emphasizes will be put on what kind of conclusions can be drawn
from data. For example in the analysis of micro arrays often a lot of
statistical tests are performed and it soon becomes clear that for this
procedure the significance level can only be directing. During the course
various techniques and approaches will extensively be illustrated on computers.
Participants have to work a lot with computers in order to solve problems and to
understand the basic principles.
After the participation in the course, the PhD students will
be able to use various methods in order to deal with huge and complex problems.
Besides getting a glimpse into statistics they should be able to apply pattern
recognition methods, to understand the output of these as well as to know about
their limitations.
Required knowledge
Familiarity with computers at user level, and with basic probability calculus
and related concepts.
Topics and Key Words
This summer school will focus on the following main topics:
- pattern recognition
- data mining
- discriminant analysis
- cluster analysis
- signal processing
- analysis of microarrays
The material will be illustrated with various examples and supplemented with
guest lectures on related issues. Throughout the course, the theory will be
supplemented with exercises and computer assignments. At the end of the course,
the students will work on a two-day project that involves both modelling and
computing aspects.
Scientifically responsible
Professor Geoffrey McLachlan, Department of Mathematics, University of
Queensland, Australia, Hans Liljenström, Department of Biometry and
Informatics, Swedish University of Agricultural Sciences, Sweden, Dietrich von
Rosen, Department of Biometry and Informatics, Swedish University of
Agricultural Sciences, Sweden.
Organisational responsible
Dietrich von Rosen, Department of Biometry and Informatics, Swedish
University of Agricultural Sciences, Sweden.
External Lecturers
-
Professor Geoff McLachlan, Department of Mathematics, University of
Queensland, Brisbane, Australia.
- Associate Professor Mats Gustafsson, Signal and Systems, Uppsala
University, Sweden
-
Professor Jan Komorowski, The Linnaeus Centre for Bioinformatics, Uppsala
University, Sweden.
Refer to the appendices for presentation of the teachers.
Teaching methods
Lectures alternating with intensive use of
computer exercises. The availablility of network connected computers is
therefore essential for the benefit of the students. A small project is carried
out by the students at the end of the course (individually or preferably in
small groups).
Examination
Examination will be based on a written
project report handed in at the end of the course in combination with an oral
presentation. The number of credits proposed is 6 ECTS.

Author:
phd@dina.kvl.dk. Updated:
04 november 2003