St@tmaster > ST113 Previous module Module 11 Next module Examples Exercises SAS R About
Prepared by The Statistics Group, KVL - Last modified: Feb 27, 2004
Printer friendly version : [PDF] [PS]

Module 11: Repeated measures I, simple methods

11.1  Notes
    11.1.1  Main example: Activity of rats
    11.1.2  Separate analysis for each time-point
    11.1.3  Analysis of summary statistic
    11.1.4  Random effects approach
    11.1.5  Pros and cons of simple approaches


Top of pagePrevious section Next section 11.1  Notes  


This module describe various simple approaches for analyzing ``repeated measurements'', and show how these analysis can be carried out in SAS. Data referred to as ``Repeated measurements'' (or sometimes as ``longitudinal data'') can be characterized by having several measurements on the same individuals or experimental units. These measurements are typically taken at different times, or at different positions within the individuals. Consider for instance the following experimental design, to compare two drugs (A and B) to reduce blood pressure:

  1. Twenty individuals were selected randomly from the relevant population.
  2. Half of these were given drug A and half were given drug B (randomly selected).
  3. For a period of two months these individuals had their blood pressure measured every week, which resulted in eight measurements on each individual.
The problem is that data collected this way, might be in violation of the standard assumption of independent measurements. It seems fair to expect two measurements from the same individual to be positively correlated, which would result in more similar measurements than two measurements from different individuals. Furthermore, two measurements taken on the same individual might be highly correlated if they are measured at two time-points close to each other, but less correlated (or maybe independent) if they are measured far apart.

This module describes some fairly simple (and maybe crude) methods for analyzing these data types. These methods include:

  • Separate analysis for each time-point
  • Analysis of summary statistic
  • Random effects approach
The analysis of repeated measurements will continue in the next module, where some more explicit covariance models will be shown.

The simplest approach to analyse repeated measurements would be to include time as a factor, and ignore the dependence between two observations on the same individual. Such an approach may lead to completely wrong conclusions. The essence of the problem is that this is the same as pretending to have more observations than are actually available. Two correlated observations contain less information than two independent observations, because one is partly explained by the other. This approach is unacceptable.


Top of pagePrevious section Next section 11.1.1  Main example: Activity of rats  

To investigate the effect of a certain type of exposure on the activity of rats, the following experiment was carried out. The experimental unit was a cage with two rats. During the entire experimental period the rats were daily exposed to the matter under investigation, in the concentration of 1, 2 or 3 units (treatment 1, 2 and 3, respectively). Once per month during 10 months the activity of the rats was measured by placing the rats from one cage in a chamber in which each intersection of a light beam was counted. The total count through a period of 57 hours was used as the result for that cage. Notice that in this setting the ``individual'' variable is cage.

Summary of experiment:

  • 3 treatments: 1, 2, 3 (concentration)
  • 10 cages per treatment
  • 10 contiguous months
  • The response is activity (count of intersections of light beam during 57 hours).
Here y=log(counts) is used for the analysis, because a residual plot showed that this was the most reasonable. The observations are listed in table
11.1.1, and the observations are plotted in figure 11.1. From the figure it seems that the activity is decreasing from month 1 to month 10 (maybe as a linear function?), and maybe that there is a small difference between the different doses. Plotting the individual curves is a very useful tool in the analysis repeated measures. This should always be the first step. Quite often the main conclusions from the analysis can already be seen from a good plot of the data.

Month
DoseCage12345678910
112058415439173761478511189103668725997495766849
1323265169561620012934137631189399491049086747153
15170651242914757105241178388289016963580288099
17192651931620598166191609213422105321061494669494
192106214095132671254312734122681221911791103798463
11123456109391327014089129861372311878133381244210094
113133831189912531150811429513650998811518119157844
115227172243423151131631002910408911910188954911153
117174371395015535141991154095688481914381175765
11918546125201539410137921873436702717372575708
2371853616827191851244513227104129855916996396853
239188311404316493125621039785688599881860115062
2411501613765166481453713929107789897922594915523
24322276154972202415616124401145410290945695677003
245189431483418403162321308512679104899495108968836
24713598102331339210457923688479445950185095656
24920498221362209419825181571145214809145641450310643
2511958612710127457294157571529614097143081393310210
253114748108177141679517364167661501613475143498698
255102841076015628106928420584261381027184354486
37318459158051992418337241971879019333222341829111595
3751618611750164701863714862146951445814228129099079
377961483191137594461315711153105401147689766123
37915688150162092912706173511508914605159521479510434
38115864131692099120655197631918019003181721502511790
38317721144891908521333170111614815280147621574510477
385176067558156461519413036103168172897783783962
3873490729247358311509397541006190421173287164922
38915189140461490914713149991420113184130731463910330
39116388145381754819416220341776114488160681477310595
Table 11.1: The rats data set, here the raw activity counts are listed.

Click to enlarge, close new window to return
Figure 11.1: The log(counts) for each cage plotted against month. The solid black lines are cages receiving dose=1, dashed blue lines are dose=2, and dotted red lines are dose=3



Top of pagePrevious section Next section 11.1.2  Separate analysis for each time-point  

One way to avoid the problem of correlated measurements is to do a separate analysis for each point in time. This way only one observation from each individual is used, and hence they are independent. This way of analyzing repeated measurements is not wrong, but it is very inefficient, as all the remaining observations are waisted. This approach avoids the problem, instead of dealing with it.

Separate analysis can be carried out for all the observed time-points, but it will likely be very difficult to reach a coherent conclusion from all these sub-tests. These sub-tests will be correlated, and because the correlation structure is not part of the model, it is not possible to tell how strong this correlation is.

Separate analysis can be carried out for selected time-points ``far apart''. This will (hopefully) cause the separate sub-tests to be uncorrelated, or at least less correlated. Even with uncorrelated tests it will be difficult to reach a coherent conclusion, because of a problem known as mass significance (or multiplicity) . For instance, if 20 tests are carried out at a 5% significance level, one of them is likely to be false significant. This problem is partly solved by using the Bonferroni correction for performing n tests (one for each time-point). The Bonferroni correction simply states that the P-value 0.05/n should be used instead of the usual 0.05.

When selecting time-points far apart, is important that the selection must be done independently of the actual observations. Naturally the time-points may not be selected systematically where there is large (or small) difference between treatments. Ideally the time points should be selected before data is collected.

Example: Activity of rats analyzed separately for each month

For the analysis in this section it is assumed that the data is read into a SAS data set named rats with the columns treatm (=1,2,3), cage (=cage number), month (=1¼10), and lnc (=log(counts)). The data set has 300 lines. Here is the fist few lines:

Obs    treatm    cage    month      lnc
1       1        1        1      9.9323
2       1        1        2      9.6447
3       1        1        3      9.7628
4       1        1        4      9.6014
5       1        1        5      9.3227
6       1        1        6      9.2463
.       .        .        .       .
.       .        .        .       .
.       .        .        .       .

To analyze the rats data set separately for each month, a simple one way analysis of variance model with treatment treatm as the only factor is used. The information about cage can not be included, as we only have one observation from each cage in each monthly analysis. The model for each month is:
lnci = m+ a(treatmi)+ei ,    ei ~ i.i.d. N(0,s2),     i=1¼30
To analyze this via proc mixed in SAS, write:

proc sort data=rats; by month; run;
proc mixed data=rats method=REML;
  class treatm month;
  model lnc = treatm;
  by month;
run;
The fist line ensures that the data set is sorted in ascending order by month. This is required by proc mixed in order to use the by statement. The second line calls proc mixed with the rats data set, and requests that the restricted/residual maximum likelihood method is used to estimate the parameters. The third line specifies the factors, and the fourth specifies the model. The fifth line states that this should be fitted for each month separately. This line is effectively cuts the data set into ten data sets (one for each observed month) and runs the model on each data set independent of each other.

The SAS output is summarized in the following table of F-tests for no treatment effect:

Month 1 2 3 4 5678910
F-value 1.220.271.022.303.874.104.707.294.090.88
These F-values should be compared with F95%,2,27=3.35 or with F99.5%,2,27=6.49 if the Bonferroni correction is used.

A few significant values are found, and even one where the Bonferroni correction is used, so the conclusion should be that weak evidence of group difference have been seen.



It is possible to make a correct analysis time-by-time, but it is weak and often confusing, because it does not combine all information into one test.


Top of pagePrevious section Next section 11.1.3  Analysis of summary statistic  

Another way to avoid the problem of correlated measurements is to choose a single measure to summarize the individual curves, and then base the analysis on this measure. This again reduces the data set to independent ``observations'' - one for each individual. To analyze the summary data set, standard methods for independent observations for instance analysis of variance can be used.

The key is to choose a good summary measure. One possibility is to choose the value at a given time-point, which reduces this summary method to the separate time-point analysis described in the previous section. This choice is poor in most cases, because all other measurements are waisted.

It is difficult to give general advice about the choice of summery measurement. Ideally, the summary measure should capture the most important feature of the curve. In some situations the most important feature is the net growth (last minus first), the average growth (slope), or time to reach the maximum point. It depends on the problem at hand.

Some common choices of summary measures are:

  • Average over time
  • Slope in regression with time (or higher order polynomial coefficients)
  • Total increase (last point minus first point)
  • Area under curve (AUC)
  • Maximum or minimum point

With the right choice of summary measure this type of analysis can be very useful, at least as a first step. These models have relatively few assumptions, and they can be checked via standard residual methods. Of course the downside of this method is that information may be lost by reducing each curve to one single measure.

Example: Activity of rats analyzed via summary measure

The choice of summary measure for the rats data set is partly inspired by figure 11.1. It seems that the average slope is similar for the three treatments, but that the curves from dose=3 tends to be a slightly higher than the rest of the curves. To see if this is a significant difference the logarithm of the total count during all ten months lnTot=log(Total count) is used as summary measure.

To calculate this summary measure from the previously described data set, the variable containing the log counts from each month lnc must be transformed back to the original counts, then the sum must be calculated, and finally the logarithm must be applied to the sum. These operations can be done in SAS by writing:

data ratsTot; set rats; count=exp(lnc); run;
proc sort data=ratsTot; by cage treatm; run;
proc means data=ratsTot sum noprint;
  var count;
  by cage treatm;
  output out=TenMonthTot sum=count10;
run;
data TenMonthTot; set TenMonthTot; lnTot=log(count10); run;

The new data set is called TenMonthTot and the variable containing the logarithm of the total counts is called lnTot.

This summary data set consists of independent measurements, as the each cage is only used to generate one summary observation. Because it is now independent observations, it can be analyzed with a simple one way ANOVA model:
lnToti = m+ a(treatmi) + ei,     ei ~ i.i.d. N(0,s2),     i=1¼30
The following lines implement this model in SAS:

proc mixed data=TenMonthTot;
  class treatm;
  model lnTot = treatm;
run;
Notice that the new data set TenMonthTot, and the response variable lnTot is used.

The P-value for no treatment effect in this summary model is 5.23%. This is above the standard 5% significant level, but only slightly. In this analysis the entire curve has been summarized into a single measure, so a lot of information has been lost. A P-value this low for the crude summary analysis could indicate that a significant treatment effect might be found with a more sophisticated analysis.




Top of pagePrevious section Next section 11.1.4  Random effects approach  

The two approaches described above both illustrated ways to reduce the data set to independent measures. This section explains the first step in modeling the actual covariance structure in the data set.

As seen in previous modules, for instance the module about hierarchial random effects, the effect of adding a random effect is that two observations from the same level will possibly be positively correlated. Adding the ``individual'' factor to the model as a random effect will allow two observations from the same individual to be positively correlated.

Example: Activity of rats analyzed via random effects model

It is reasonable to assume that two observations from the same cage could be correlated, so the model with cage as random effect is used. The factor month and the interaction between month and treatment are included. This was not possible in the previous models, because each curve was reduced into one number. In this analysis all observations are included into one coherent analysis. The model is:
lnci = m+a(treatmi) +b(monthi) +g(treatmi,monthi) +d(cagei) +ei ,
where i=1¼300, d(cagei) ~ N(0,s2d), ei ~ N(0,s2), and all independent.

Recall from previous modules that the covariance structure for this model is:
cov(yi1, yi2) = ì
ï
ï
í
ï
ï
î
0
, if cagei1 ¹ cagei2 and i1 ¹ i2
sd2
, if cagei1 = cagei2 and i1 ¹ i2
sd2+s2
, if i1 = i2
In other words this is the variance structure, where two observations from different cages are uncorrelated, and two observations from the same cage are positively correlated with correlation coefficient sd2/(sd2+s2).

The following lines implement this model in SAS:

proc mixed data=rats;
  class treatm cage month;
  model lnc = treatm month treatm*month / ddfm=satterth;
  random cage;
run;
Notice the random cage statement to specify the random cage effect, and the option /ddfm=satterth to choose Satterthwaite's approximation of the degrees of freedom. The relevant part of the SAS output is listed below:
Covariance Parameter Estimates
Cov Parm     Estimate
cage          0.02748
Residual      0.03790

-2 Res Log Likelihood      8.6

                 Num     Den
Effect            DF      DF    F Value    Pr > F
treatm             2      27       3.22    0.0557
month              9     243      46.11    <.0001
treatm*month      18     243       2.12    0.0059

This output give estimates of the variance parameters (s2d=0.02748 and s2=0.03790), twice the negative restricted/residual log likelihood (2lre=8.6), and an ANOVA table for the fixed effects of the model. From this ANOVA table it is seen that the interaction between treatment and month is significant with a P-value=0.0059. The conclusion from this model is that treatment does have an effect on the activity, but the effect is not the same in all ten months.



The main problem with this random effects approach is that all measurements on the same individual are assumed equally correlated, but some measurements are taken far apart and some measurements are taken close to each other, so this assumption is not always valid. The next module will suggest a few ways to deal with this problem.

However, this random effects approach may give reasonable results for short series (with 2, 3, or 4 measurements on each individual) since the assumption of equal correlation may be ok in those cases.

This random effects approach is also known as the split-plot approach, or the split-plot model. It is possible to view repeated measurements data as resulting from a kind of split-plot experiment, with individuals as the ``main-plots'' to which the the treatments are applied. The ``sub-plots'' are then the single measurements on each individuals. This interpretation is a bit weak, as the single measurements on each individual (typically at different times) cannot be randomized within the individual.


Top of pagePrevious section Next section 11.1.5  Pros and cons of simple approaches  

In this module a few simple approaches to the analysis of repeated measurements have been described. In many practical cases these simple approaches, especially the summary method, will give a sufficient and useful analysis of the data. Even in those cases where more sophisticated models are needed it is often helpful to run a few simple models first. Here follows a few pros and cons of the different methods:

Separate analysis for each time-point

    + Not wrong
    -- Can be confusing
    -- Difficult to reach coherent conclusion
    -- In general not very informative
Analysis of summary statistic

    + Good method with few and easily checked assumptions
    -- Important to choose good summary measure(s)
Random effects approach

    + Good method for short series
    + Uses all observations
    -- Usually not good for long series


Optimized for Microsoft Internet Explorer 6.0 for Windows
webmaster