Nordic Informatics Network in the Agricultural Sciences

Likelihood-based inference for hierarchical/mixed statistical models

Background

The application of mixed models has a long history in agricultural research. A mixed model is a statistical model where variation is attributed to both fixed effects and random effects. Fixed effects are non-random and a goal of the statistical analysis is often to estimate the sizes of the fixed effects. Random effects are random in the sense that they are considered to be realisations of random variables with a specified distribution. The fixed effects in a model often comprise the effects of covariates or factors with a moderate number of levels which are under the experimenters control, say the amount of fertilizer applied within a plot or the type or amount of fodder given to a domestic animal. Random effects are often used to model sources of variation which may not be of intrinsic interest but nevertheless influence the variability of the data. In a field trial random effects are for instance used to model heterogeneity of the soil fertility whereas random effects associated with animals can be used to model individual specific variability in the response to a treatment. The random effects may on the other hand also be of primary interest. This is especially the case when the distributional characteristics of the population of random effects are important. In quantitative genetics for example, individual specific random effects are used to model the genetic effect influencing the phenotype of an individual. The estimation of the genetic variance is then a main objective since the size of this variance determines how much the phenotype can be changed using a breeding programme. In a Bayesian statistical approach all unknown quantities are considered random but the distinction between different types of effects exemplified above may still be relevant.

Random effects often naturally appear in a hierarchy where the random effects are associated to nested groups of observations. Examples are studies based on measurements on plants within fields within farms, or similarly animals within herds within breeds. The random effects then serve to model the correlation between observations belonging to the same group within a level of the hierarchy. In hierarchical (also known as multilevel) modelling the hierarchy is accounted for explicitly since the model is specified in terms of conditional distributions of the random effects at each level given the random effects at the level above. In the Bayesian approach an extra top-level hierarchy is added when prior distributions are imposed for unknown parameters. It is often useful to display the structure of a hierarchical model using a directed graph.

When a mixed or hierarchical model is specified in terms of linear normal models, the marginal distribution of the data and hence the likelihood function is completely known. Maximum likelihood estimation is therefore in principle straightforward although computational problems involving high-dimensional covariance matrices may occur in practice. However, linear normal modelling is often not appropriate since data may be discrete or skewed, or due to non-linear relations between the observations and the fixed or random effects. Such situations call for generalised linear mixed models. Computation of the likelihood function then involves a highly intractable integral and explicit expressions for the likelihood function is rarely available. It has then been suggested to use alternatives like marginal or penalized quasi-likelihood estimation. These methods, however, are not always reliable and may for example produce biased results when applied to binary data. With the great computing power offered by modern day computers it is becoming increasingly feasible to evaluate the likelihood function using numerical integration or Monte Carlo methods and maximum likelihood estimation for hierarchical models may therefore gain increasing popularity in the future.

The Bayesian approach is also based on the likelihood function and Bayesian statistics for non-normal or non-linear models should hence also be troubled by the intractable likelihood function. However, in the Bayesian approach one may use a demarginalization strategy and consider the joint posterior distribution of all unknown quantities. Thereby explicit knowledge of the (marginal) likelihood function is not required. The joint posterior may still be analytically intractable but can be evaluated using Markov chain Monte Carlo (MCMC) methods which have gained much popularity within the last two decades. In agricultural science, the frequentist approach to statistics is dominating and the more or less subjective nature of Bayesian inference may not fit well in the existing scientific tradition. The Bayesian approach for hierarchical models has on the other hand gained much popularity within the statistics community due to the flexibility for modelling and since one avoids the dependence on asymptotic results, which are often not well justified for mixed models.

Aim of the course

The course will provide an introduction to likelihood-based statistical analysis for mixed and hierarchical models. The course will focus on modelling issues, on how a statistical analysis can be carried out for a mixed model, and on interpretation of models and results. The background for the various existing software packages will be discussed as well as the potential and limitations of the different packages. The course will consider software packages based on both the "classical" (frequentist) and the Bayesian approach. So a theme of the summerschool will also be to discuss benefits and disadvantages of both the classical and Bayesian approachs to statistics for hierarchical models in relation to current scientific practice in the agricultural sciences and practical implementation. After participation in the course, the PhD students will be able to formulate a mixed model for a given data set and carry out an analysis of the data using one of the software packages considered in the course. They will also have an understanding of how to interpret parameter estimates or posterior distributions and how to present the results of the statistical analysis in their scientific papers.

Required knowledge

Familiarity with basic statistical inference (estimation, confidence intervals, tests) and a good working knowledge of practical statistical modelling and analysis, including regression models - as a general rule this would correspond to at least two statistics courses. Familiarity with computers at user level, as well as with one "advanced" statistical software package (S-Plus/R, WinBugs, MLwiN, SAS, Stata etc.).

Topics and Key Words

This summer school will focus on the following main topics:

Scientifically responsible

William Browne, School of Mathematical Sciences, University of Nottingham, and Henrik Stryhn, Atlantic Veterinary College, University of Prince Edward Island.

Organisational responsible

Anders Ringgaard Kristensen, The Royal Veterinary and Agricultural University of Denmark, and Rasmus Waagepetersen, Department of Mathematics, Aalborg University.

Teaching methods

Lectures alternating with intensive use of computer exercises. The availability of network connected computers is therefore essential for the benefit of the students. The material will be illustrated with various case studies. A two-day project is carried out by the students at the end of the course (individually or preferably in small groups). The project will involve both modelling and computing aspects. The students are encouraged to use their own datasets in the project.

Examination

Examination (pass/no-pass) will be based on a written project report handed in at the end of the course in combination with an oral presentation. The number of credits proposed is 6 ECTS.

 

Dina logoAuthor: phd@dina.kvl.dk. Updated: 17 september 2004