Extras#
Extracting Summary Statistics From a Model Fit in R#
If you are new to R, here we will just run through some details on the type of objects these data are stored in and how to access specific elements. This can be helpful for writing automated analysis scripts. Due to the need to contain different types of data in different formats and structures, the output of the regression model fit is stored in a bespoke object, with slots for the different parts of the output. These slots are named and can be assessed using the $
. For example, to extract just the table of estimated regression coefficients which are named coefficients
, we use the following command:
model <- lm(bmi ~ age + sex, demoDat)
summary(model)$coefficients
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | 24.22402808 | 2.26589809 | 10.6906962 | 3.581804e-22 |
age | 0.04090023 | 0.04560467 | 0.8968430 | 3.706760e-01 |
sexmale | 0.07188731 | 0.52593207 | 0.1366856 | 8.913907e-01 |
We can determine the type of object the coefficients table is stored in, using the function class()
.
class(summary(model)$coefficients)
mode(summary(model)$coefficients)
- 'matrix'
- 'array'
The output of the command tells us it is stored in a matrix, which is a data-type in R, where you have rows and columns. A similar data-type is called a data.frame. The difference between these two data-types is that matrices can only contain one data type, which we can determine with the function mode()
. Here it contains exclusively numeric values. In contrast, in a data frame each column can be a different data type. Our demoDat data is stored in a data.frame and the output of the str()
function, tells us the data type assigned to each column.
Let’s say we wanted to extract a single value from this matrix, there are a number of commands we can use. For example, let’s extract the p-value for the age regression slope parameter using the slicing function [
.
We can provide the row (2nd) and column (4th) number of the matrix entry we want:
summary(model)$coefficients[2,4]
Alternatively we can specify either the column or row name:
summary(model)$coefficients["age",4]
We can see a list of all components we can extract from the output of lm()
by running names()
on the lm object. All of these can be extracted with the $
.
names(model)
model$call ## example of how to extract any of the components listed by the previous command.
- 'coefficients'
- 'residuals'
- 'effects'
- 'rank'
- 'fitted.values'
- 'assign'
- 'qr'
- 'df.residual'
- 'contrasts'
- 'xlevels'
- 'call'
- 'terms'
- 'model'
lm(formula = bmi ~ age + sex, data = demoDat)
Similarly we can run names()
on the summary(lm)
object as showed here to get a list of all the slots available from that object.
names(summary(model))
- 'call'
- 'terms'
- 'residuals'
- 'coefficients'
- 'aliased'
- 'sigma'
- 'df'
- 'r.squared'
- 'adj.r.squared'
- 'fstatistic'
- 'cov.unscaled'
Note these are different to those available for the model object, for example the $R^{2}$ and $\overline{R}^{2}$ are only extractable from the summary(model)
object.
summary(model)$r.squared
summary(model)$adj.r.squared
Note that as well as directly assessing these slots using the $
command, there also exist some predefined functions to extract the commonly requested outputs from the model fit. We have already taken advantage of one of these, summary()
, others include coef()
, effects()
, residuals()
, fitted()
and predict.lm()
.
Simple Linear Regression Link between F-test and T-test#
We can also use an F-test to test a single predictor variable in our model.
model<-lm(weight ~ height)
summary(model)
anova(model)
Call:
lm(formula = weight ~ height)
Residuals:
Min 1Q Median 3Q Max
-33.487 -8.135 0.541 7.116 35.621
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -39.52197 7.54114 -5.241 3.42e-07 ***
height 0.69876 0.04063 17.197 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 12.51 on 248 degrees of freedom
Multiple R-squared: 0.5439, Adjusted R-squared: 0.5421
F-statistic: 295.7 on 1 and 248 DF, p-value: < 2.2e-16
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
<int> | <dbl> | <dbl> | <dbl> | <dbl> | |
height | 1 | 46302.06 | 46302.0633 | 295.7402 | 3.618647e-44 |
Residuals | 248 | 38827.71 | 156.5633 | NA | NA |
In the summary(model)
output, we can see at the bottom the results of testing the full model with an F-test. If we want to see the full table of sums of squares statistics, we can use the anova()
function on our fitted regression model.
Comparing this table with the coefficients table, we can see that the p-value from the t-test of the age regression parameter and the F-test for the full model are identical. This is not a coincidence and is always true for the specific case of simple linear regression.
Extracting Variance Explained Statistics#
Finally, we will look at the $R^{2}$ and $\overline{R}^{2}$ statistics. We can see from the summary(model)
output above these are automatically calculated. For the simple linear regression model we have fitted
$R^{2}$ = r summary(model)$r.squared
$\overline{R}^{2}$ = r summary(model)$adj.r.squared