Assessing the accuracy of our model
There are several ways to check
the accuracy of our models, some are printed directly in R within the summary
output, others are just as easy to calculate with specific functions. Please take a look at my previous post for more info on the code.
R Squared
This is probably the most
commonly used statistics and allows us to understand the percentage of variance
in the target variable explained by the model. It can be computed as a ratio of
the regression sum of squares and the total sum of squares. This is one of the
standard measures of accuracy that R prints out, through the function summary, for linear models and ANOVAs.
Adjusted R Squared
This is a form of R-squared that
is adjusted for the number of predictors in the model. It can be computed as
follows:
Where R2 is the R squared of the model, n is the sample size and p is the number of terms (or predictors) in the model. This index is extremely useful to determine whether our model is overfitting the data. This happens particularly when the sample size is small, in such cases if we fill the model with more predictors we may end up increasing the R squared simply because the model starts adapting to the noise (or random error) and not properly describing the data. It is a generally good indication if the adjusted R squared is similar to the standard R squared.
Root Mean Squared Deviation or Root Mean Squared Error
The previous indexes measure the
amount of variance in the target variable that can be explained by our model.
This is a good indication but in some cases we are more interested in
quantifying the error in the same measuring unit of the variable. In such cases
we need to compute indexes that average the residuals of the model. The problem
is residuals are both positive and negative and their distribution should
be fairly symmetrical (this is actually one of the assumptions in most linear models, so if this is not the case we should be worried). This means that their average will always be zero. So we
need to find other indexes to quantify the average residuals, for example by
averaging the squared residuals:
This is the square root
of the mean of the squared residuals, with Yhat_t being the estimated value at point t, Y_t being the observed value in t and n being the sample size. The RMSE has the same
measuring unit of the variable y.
Mean Squared Deviation or Mean Squared Error
This is simply the numerator of the previous equation, but it is not used often. The issue with both the RMSE and the MSE is that, since they square the residuals, they tend to be more affected by extreme values. This means that even if our model explains the large majority of the variation in the data very well, with few exceptions; these exceptions will inflate the value of RMSE if the discrepancy between observed and predicted is large. Since this large residuals may be caused by potential outliers, this issue may cause overestimation of the error.
Mean Absolute Deviation or Mean Absolute Error
To solve the problem with potential outliers, we can use the mean absolute error, where we average the absolute
value of the residuals:
This index is more
robust against large residuals. Since RMSE is still widely used, even though
its problems are well known, it is always better to calculate and present both in
a research paper.
Akaike Information Criterion
This is another popular
index we have used in previous posts to compare different models. It is very
popular because it corrects the RMSE for the number of predictors in the model,
thus allowing to account for overfitting. It can be simply computed as follows:
Where again p is the number of terms in
the model.