This post was originally part of my previous post about linear models. However, I later decided to split it into several texts because it was effectively too long and complex to navigate.
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): Linear Models (lm, ANOVA and ANCOVA) in Agriculture
There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.
As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:
As you can see now the p-values are showing and we can assess the significance for each term.
Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.
When we are dealing with random slope and intercept we would use the following syntax:
If you struggle to follow the code in this page please refer to this post (for example for the necessary packages): Linear Models (lm, ANOVA and ANCOVA) in Agriculture
Linear Mixed-Effects Models
This class of models is used to account for more than one
source of random variation. For example, assume we have a dataset where we are trying to model yield as a function of nitrogen levels. However, the data were collected in many different farms. In this case, each farm would need
to be consider a cluster and the model would need to take this clustering into
account. Another common set of experiments where linear mixed-effects models
are used is repeated measures, where time provides an additional source of
correlation between measures. For these models we do not need to worry about
the assumptions from previous models, since these are very robust against all
of them. For example, for unbalanced design with blocking, probably these
methods should be used instead of the standard ANOVA.
At the beginning on this tutorial we explored the equation
that supports linear model:
This equation can be divided into two components, the
fixed and random effects. For fixed effect we refer to those variables we are
using to explain the model. These may be factorial (in ANOVA), continuous or a
mixed of the two (ANCOVA) and they can also be the blocks used in our design.
The other component in the equation is the random effect, which provides a
level of uncertainty that it is difficult to account for in the model. For example,
when we work with yield we might see differences between plants grown from
similar soils and conditions. These may be related to the seeds or to other
factors and are part of the within-subject variation that we cannot explain.
There are times however where in the data there are multiple
sources of random variation. For example, data may be clustered in separate
fields or separate farms. This will provide an additional source of random
variation that needs to be taken into account in the model. To do so the
standard equation can be amended in the following way:
This is referred to as a random intercept
model, where the random variation is split into a cluster specific variation u and the standard error term. Effectively, this model assumes that each cluster will only have an effect on the slope of the linear model. In other word, we assume that data collected at different farm will have the same correlation pattern but will be shifted, see image below (source: Mixed Models):
A more complex form, that is normally used for repeated measures is the random slope and intercept model:
A more complex form, that is normally used for repeated measures is the random slope and intercept model:
Where we add a new source of random variation v related to time T. In this case we assume that the random variation happends not only by changing the intercept of the linear model, but also its slope. The image below from the same website should again clarify things:
As a general rule we can use plotting to determine if and what random effects to use for modelling our data. In the examples above, a simple xy plot with colour would provide a lot of information. Alternatively, we could use the plotting method with ggplot2 and the function facet_wrap to divide our scatterplots by factors and see if there are changes only in intercept or also slope.
Random Intercept Model for Clustered Data
In the following examples we will use the function lme in the package nlme, so please install and/or load the package first. For this example we are using the same dataset lasrosas.corn from package agridat we used in the previous post Linear Models in Agriculture
Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describes some clusters. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:
Just to explain the syntax to use linear mixed-effects model in R for cluster data, we will assume that the factorial variable rep in our dataset describes some clusters. To fit a mixed-effects model we are going to use the function lme from the package nlme. This function can work with unbalanced designs:
lme1 = lme(yield ~ nf + bv * topo, random= ~1|rep, data=dat)
The syntax is very similar to all the models we fitted
before, with a general formula describing our target variable yield and all the
treatments, which are the fixed effects of the model. Then we have the option
random, which allows us to include an additional random component for the
clustering factor rep. In this case the ~1 indicates that the random effect
will be associated with the intercept.
Once again we can use the function summary to explore our
results:
> summary(lme1)
Linear mixed-effects model fit by REML
Data: dat
AIC BIC logLik
27648.36 27740.46 -13809.18
Random effects:
Formula: ~1 | rep
(Intercept) Residual
StdDev: 0.798407 13.3573
Fixed effects: yield ~ nf + bv * topo
Value Std.Error DF t-value p-value
(Intercept) 327.3304 14.782524 3428 22.143068 0
nfN1 3.9643 0.788049 3428 5.030561 0
nfN2 5.2340 0.790104 3428 6.624449 0
nfN3 5.4498 0.789084 3428 6.906496 0
nfN4 7.5286 0.789551 3428 9.535320 0
nfN5 7.7254 0.789111 3428 9.789976 0
bv -1.4685 0.085507 3428 -17.173569 0
topoHT -233.3675 17.143956 3428 -13.612232 0
topoLO -251.9750 20.967003 3428 -12.017693 0
topoW -146.4066 16.968453 3428 -8.628162 0
bv:topoHT 1.1945 0.097696 3428 12.226279 0
bv:topoLO 1.4961 0.123424 3428 12.121624 0
bv:topoW 0.7873 0.097865 3428 8.044485 0
We can also use the function Anova
to display the ANOVA table:
> Anova(lme2, type=c("III"))
Analysis of Deviance Table (Type III tests)
Response: yield
Chisq Df Pr(>Chisq)
(Intercept) 752.25 1 < 2.2e-16 ***
nf 155.57 5 < 2.2e-16 ***
bv 291.49 1 < 2.2e-16 ***
topo 236.52 3 < 2.2e-16 ***
year 797.13 1 < 2.2e-16 ***
bv:topo 210.38 3 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We might be interested in understanding if fitting a more
complex model provides any advantage in terms of accuracy, compared with a
model where no additional random effect is included. To do so we can compare
this new model with mod6, which we created with the gls
function and includes the same treatment structure. We can do that with the function anova ,
specifying both models:
> anova(lme1, mod6)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 15 27648.36 27740.46 -13809.18
mod6 2 14 27651.21 27737.18 -13811.61 1 vs 2 4.857329 0.0275
As you can see there is a decrease in AIC for the model
fitted with lme, and the difference is
significant (p-value below 0.05). Therefore this new model where clustering is
accounted for is better than the one without an additional random effect, even
though only slightly. In this case we would need to decide if fitting a more
complex model (which is probably more difficult to explain to readers) is the
best way.
Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:
Another way to assess the accuracy of GLS and Mixed Effects Models is through the use of pseudo R squared, which are indexes that can be interpreted as the normal R-Squared but calculated in different ways, since in these more complex models we do not calculate the sum of squares.
There are two important functions for this, both included in the package MuMIn:
library(MuMIn)
> r.squaredLR(mod6)
[1] 0.5469906
attr(,"adj.r.squared")
[1] 0.5470721
> r.squaredGLMM(lme1)
R2m R2c
0.5459845 0.5476009
The first function r.squaredLR can be used for GLS models and provides both and R-Squared and an Adjusted R-Squared. The second function, r.squaredGLMM, is specific for mixed-effects models and provides two measures: R2m and R2c. The first reports the R2 of the model with just fixed effects, while the second the R squared of the full model.
In this case we can see again that the R squared are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.
In this case we can see again that the R squared are similar between models, and most importantly R2c is only slightly different compared to R2m, which means that including random effects does not improve the accuracy.
Random Intercept and Slope for repeated measures
If we collected data at several time steps we are looking at
a repeated measures analysis, which is most cases can be treated as mixed random slope and intercept model. Again, we cannot just assume that because we have collected data over time we have a random slope and intercept, we always need to do some plotting first and take a closer look at our data. In cases like this where we are dealing with a factorial variable, we may be forced to rely on barcharts, divided by years. In such cases it may be difficult to understand whether we need a model this complex. So it may be that the only way is to just compare different models with anova (this function can be used with more than 2 models if needed).
The code to create such a model is the following:
The code to create such a model is the following:
lme2 = lme(yield ~ nf + bv * topo + year, random= ~year|rep, data=dat)
summary(lme2)
Anova(lme2, type=c("III"))
The syntax is very similar to what we wrote before, except
that now the random component includes both time and clusters. Again we can use
summary to get more info about the model. We can also use again the function anova to compare this with the previous model:
> anova(lme1, lme2)
Model df AIC BIC logLik Test L.Ratio p-value
lme1 1 15 27648.36 27740.46 -13809.18
lme2 2 18 26938.83 27049.35 -13451.42 1 vs 2 715.5247 <.0001
Warning message:
In anova.lme(lme1, lme2) :
fitted objects with different fixed effects. REML comparisons are not meaningful.
From this output it is clear that the new model is better
that the one before and their difference in highly significant. If this happens it is generally better to adopt the more complex model.
We can extract only the effects for the random components
using the function ranef:
> ranef(lme2)
(Intercept) year
R1 -0.3468601 -1.189799e-07
R2 -0.5681688 -1.973702e-07
R3 0.9150289 3.163501e-07
This tells us the changes in yield for each cluster and time
step.
We can also do the same for the fixed effects, and this will return the coefficients of the model:
We can also do the same for the fixed effects, and this will return the coefficients of the model:
> fixef(lme2)
(Intercept) nfN1 nfN2 nfN3 nfN4
-1.133614e+04 3.918006e+00 5.132136e+00 5.368513e+00 7.464542e+00
nfN5 bv topoHT topoLO topoW
7.639337e+00 -1.318391e+00 -2.049979e+02 -2.321431e+02 -1.136168e+02
year bv:topoHT bv:topoLO bv:topoW
5.818826e+00 1.027686e+00 1.383705e+00 5.998379e-01
To have an idea of their confidence interval we can use the
function intervals (package nlme):
> intervals(lme2, which = "fixed")
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -1.214651e+04 -1.133614e+04 -1.052576e+04
nfN1 2.526139e+00 3.918006e+00 5.309873e+00
nfN2 3.736625e+00 5.132136e+00 6.527648e+00
nfN3 3.974809e+00 5.368513e+00 6.762216e+00
nfN4 6.070018e+00 7.464542e+00 8.859065e+00
nfN5 6.245584e+00 7.639337e+00 9.033089e+00
bv -1.469793e+00 -1.318391e+00 -1.166989e+00
topoHT -2.353450e+02 -2.049979e+02 -1.746508e+02
topoLO -2.692026e+02 -2.321431e+02 -1.950836e+02
topoW -1.436741e+02 -1.136168e+02 -8.355954e+01
year 5.414742e+00 5.818826e+00 6.222911e+00
bv:topoHT 8.547273e-01 1.027686e+00 1.200644e+00
bv:topoLO 1.165563e+00 1.383705e+00 1.601846e+00
bv:topoW 4.264933e-01 5.998379e-01 7.731826e-01
attr(,"label")
[1] "Fixed effects:"
Syntax with lme4
Another popular package to perform mixed-effects models we could also use the package lme4 and the function lmer.
For example, to fit the model with random intercept (what we called lme1) we would use the following syntax in lme4:
> lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML ['lmerMod']
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups Name Variance Std.Dev.
rep (Intercept) 0.6375 0.7984
Residual 178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error t value
(Intercept) 327.33043 14.78252 22.143
nfN1 3.96433 0.78805 5.031
nfN2 5.23400 0.79010 6.624
nfN3 5.44980 0.78908 6.906
nfN4 7.52862 0.78955 9.535
nfN5 7.72537 0.78911 9.790
bv -1.46846 0.08551 -17.174
topoHT -233.36750 17.14396 -13.612
topoLO -251.97500 20.96700 -12.018
topoW -146.40655 16.96845 -8.628
bv:topoHT 1.19446 0.09770 12.226
bv:topoLO 1.49609 0.12342 12.122
bv:topoW 0.78727 0.09786 8.044
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
There are several differences between nlme and lme4 and I am not sure which is actually better. What I found is that probably lme4 is the most popular, but nlme is used for example to fit generalized addictive mixed effects models in the package mgcv. For this reason probably the best thing would be to know how to use both packages.
As you can see from the summary above, in this table there are no p-values, so it is a bit difficult to know which levels are significant for the model. We can solve this by installing and/or loading the package lmerTest. If we load lmerTest and run again the same function we would obtain the following summary table:
> lmer1 = lmer(yield ~ nf + bv * topo + (1|rep), data=dat)
>
> summary(lmer1)
Linear mixed model fit by REML t-tests use Satterthwaite approximations to degrees of freedom [
lmerMod]
Formula: yield ~ nf + bv * topo + (1 | rep)
Data: dat
REML criterion at convergence: 27618.4
Scaled residuals:
Min 1Q Median 3Q Max
-3.4267 -0.7767 -0.1109 0.7196 3.6892
Random effects:
Groups Name Variance Std.Dev.
rep (Intercept) 0.6375 0.7984
Residual 178.4174 13.3573
Number of obs: 3443, groups: rep, 3
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 327.33043 14.78252 3411.00000 22.143 < 2e-16 ***
nfN1 3.96433 0.78805 3428.00000 5.031 5.14e-07 ***
nfN2 5.23400 0.79010 3428.00000 6.624 4.03e-11 ***
nfN3 5.44980 0.78908 3428.00000 6.906 5.90e-12 ***
nfN4 7.52862 0.78955 3428.00000 9.535 < 2e-16 ***
nfN5 7.72537 0.78911 3428.00000 9.790 < 2e-16 ***
bv -1.46846 0.08551 3428.00000 -17.174 < 2e-16 ***
topoHT -233.36750 17.14396 3429.00000 -13.612 < 2e-16 ***
topoLO -251.97500 20.96700 3430.00000 -12.018 < 2e-16 ***
topoW -146.40655 16.96845 3430.00000 -8.628 < 2e-16 ***
bv:topoHT 1.19446 0.09770 3429.00000 12.226 < 2e-16 ***
bv:topoLO 1.49609 0.12342 3430.00000 12.122 < 2e-16 ***
bv:topoW 0.78727 0.09786 3430.00000 8.044 1.33e-15 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation matrix not shown by default, as p = 13 > 12.
Use print(x, correlation=TRUE) or
vcov(x) if you need it
As you can see now the p-values are showing and we can assess the significance for each term.
Clearly, all the functions we used above for the function lme can be used also with the package lme4 and lmerTest. For example, we can produce the anova table with the function anova or compute the R Squared with the function r.squaredGLMM.
When we are dealing with random slope and intercept we would use the following syntax:
lmer2 = lmer(yield ~ nf + bv * topo + year + (year|rep), data=dat)