Multiple Linear Regression#

Learning Objectives#

  • Learn how regression analysis can be extended beyond simple linear regression to handle multiple predictor variables and different types of relationships.

  • Understand the concept and applications of multiple linear regression, including how to include multiple predictor variables in a regression model.

  • Gain proficiency in using the lm() function in R to fit multiple linear regression models and interpret the output.

  • Learn how to assess the effect and significance of multiple predictor variables simultaneously on an outcome variable.

  • Identify and understand the assumptions that must be met for multiple linear regression, including the linear relationship, normal distribution of errors, independence of samples, equal variances, and non-collinearity of predictor variables.

Expanding the regression framework#

So far we have considered a single type of regression analysis with one continuous predictor variable and one continuous outcome variable and fitting a straight line between these. This has enabled us to get to grips with the core concepts but regression can do so much more than this. It is an incredibly flexible framework that can handle

  • different types of variables

  • multiple variables

  • different relationships between variables

We will now look at how we extend the methodology to allow more complex analysis designs. You should think of regression as a modular approach in which you select the necessary components depending on

  • the properties of the outcome variable(s)

  • the properties of the predictor variable(s)

  • the relationship that you want to model between each predictor variable and outcome variable.

Different types of regression analysis you may be familar with include:

Simple Linear Regression

  • 1 continuous outcome variable

  • 1 predictor variable

Multiple Linear Regression

  • 1 continuous outcome variable

  • \(> 1\) predictor variable

Multivariate Linear Regression

  • multiple correlated dependent variables

  • \(> 0\) predictor variable

Next, we are going to discuss Multiple Linear Regression which allows us to look at the effect of multiple predictor variables simultaneously on an outcome variable.

Multiple Linear Regression#

To include multiple predictor variables in our existing regression framework, we simply need to add these additional terms to our model/equation.

For example, if we have two variables \(X_1\) and \(X_2\), and we want to use these together to predict Y, we can fit the model

\[Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 \]

To accommodate the extra predictor variable, we need to include an extra regression coefficient. This is still considered a linear regression model because we are combining the effects of these predictor variables in a linear (additive) manner.

In R, we don’t need any new functions to fit a multiple regression model, we can fit these models with the same lm() function. As before, R will automatically add the right number of regression coefficients.

Let’s look at an example where we model the effect of both age and height on weight.

%%R
modelM<-lm(weight ~ age + height, data = demoDat)
summary(modelM)
Call:
lm(formula = weight ~ age + height, data = demoDat)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.294  -8.691   0.535   6.920  34.761 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -49.53577    9.89400  -5.007 1.05e-06 ***
age           0.21382    0.13730   1.557    0.121    
height        0.69564    0.04057  17.148  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 12.48 on 247 degrees of freedom
Multiple R-squared:  0.5483,	Adjusted R-squared:  0.5447 
F-statistic: 149.9 on 2 and 247 DF,  p-value: < 2.2e-16

To report the results of this extra regression parameter we have a third row in the coefficient’s table. Each regression coefficient included in this table, regardless of how many there are, are tested under the same hypothesis framework we discussed for simple linear regression. The results for each coefficient are interpreted in the same way. From this analysis we can see that height is significantly associated with weight (p-value < 0.05) but age is not (p-value > 0.05).

All of the assumptions from simple linear regression hold for multiple linear regression with the addition of one more:

  • The dependent variable Y has a linear relationship with the independent variables X.

  • Errors (residuals) are normally distributed.

  • Samples are independent.

  • Variances in each group are equal.

  • Independent variables are not highly correlated with each other.

We can again use the plot() function to inspect these for a multiple regression model fit.

Multiple Regression Analysis Exercise#

Let’s practice fitting a regression model with multiple predictor variables.

Fit and interpret the following regress models.

  1. bmi predicted by age and exercise_hours.

  2. bmi predicted by age, exercise_hours and alcohol_units.


Assessing the Effect of Multiple Predictor Variables Simultaneously#

There are many reasons why you might want to model multiple predictor variables simultaneously.

  1. You are interested in understanding the effects of multiple different factors.

  2. You think there are some variables that might bias or confound your analysis and you want to adjust for this.

In the second scenario, some predictor variables are just included so that their effect is captured, but you are not explicitly interested in their results.

In the first scenario, you might be interested in quantifying the effect of each predictor term individually. This can be achieved by looking at the regression coefficients for each term in turn, as we did for the example above. Alternatively you might be interested in the combined effect of multiple terms in predicting an outcome. We can do this from our regression analysis by reframing the null and alternative hypothesis.

Let’s define our regression model for two predictor variables \(X_1\) and \(X_2\) as:

\[Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 \]

Previously we were interested if a single regression coefficient was non-zero as if that was the case that regression parameter would cause some quantity to be added to our prediction. The null hypothesis for testing the joint effect of two predictors is based on the same underlying concept, that there is no effect on the outcome from the predictor variables. The mathematical definition of this needs to be changed though to reflect that we have multiple predictor variables and regression coefficients.

If there is no effect of \(X_1\) and \(X_2\) on Y, then the regression coefficients for both terms will be zero. Such that they both get cancelled out of the equation and it can be cimplified to just the intercept. For there to be an improvement in the predictive capability of the model, at least one of the two predictive variables needs to have a non-zero coefficient.

This gives us the null hypothesis of

\[H_{null}: \theta_1 = \theta_2 = 0\]

and the alternative hypothesis of

\[H_{alternative}: \theta_1 \neq 0\text{ or }\theta_2 \neq 0\]

Another way of thinking about this analysis is that we are comparing two models with and without the terms of interest. Our statistical question boils down to which of these models is a better fit to the data?

The more complex model 1:

\[Y = \theta_0 + \theta_1 X_1 + \theta_2 X_2 \]

or

the simpler model 2:

\[Y = \theta_0 \]

To perform this statistical test we need to use the F-distribution rather than the T-distribution. Our test statistic is now based around the idea of variance explained. Mention of either the F-distribution or an analysis of variance explained might trigger an association with ANOVA, analysis of variance. Indeed, you are right, there is a link here. In fact the R function you need to run this analysis is anova(). But first we need to define and fit the two models we want to compare. We use just a 1 on the right-hand side of the formula to specify we just want to include an intercept term.

%%R
## the more complex "full" model:
model.full <- lm(weight ~ age + height, data = demoDat)
## the simpler "null" model:
model.null <- lm(weight ~ 1, data = demoDat)


anova(model.null, model.full)
Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ age + height
  Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
1    249 85130                                  
2    247 38450  2     46680 149.93 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can see in the output a table of test statistics where we can see an F-value is reported along with its p-value. In this example we have a (very) significant p-value. Therefore, we conclude that a model with both age and height is significantly better than a model with just the intercept. From our analysis of the individual regression coefficients, we know that this is predominantly due to the height variable.

Now it may be that, apart from the use of the F-distribution, this overlap with ANOVA does not fit with your current understanding of when you use ANOVA (typically with a categorical variable with > 2 groups). In the next section we will draw out this relationship further.

Modelling Categorical Predictor Variables#

So far we have only considered continuous predictor variables such as height and age, or variables not strictly continuous but can fudge to be considered as continuous variables. Next we will look at variables that it is harder to analyse as continuous variables, that is categorical variables.

We will consider a categorical variable as any scenario where the responses can be grouped into a finite number of categories or classes. In R, this type of variable is called a factor.

First, let’s consider the simplest type of categorical variable to model, one with only two options or groups. Sometimes this specific instance is called a binary variable. While in your data collection each group might have a text label to describe what the response means for that observation, for mathematical modelling we need to represent this binary variable numerically. This is acheived by recoding the variable such that one group is represented by 0 and one group by 1. With this coding it might be refered to as an indicator or dummy variable, where the 1 is used to specify the presence of some feature (like an exposure) and 0 the absence of that feature.

For example, for the variable sex, we could represent females with 0 and males with 1 and the variable therefore indicates who is male. Note that in R if you include a factor variable in your regression model which has text labels to indicate the levels, it will automatically do this numerical recoding for you without you needing to know about it. The default behaviour is for the first category alphabetically to be assigned 0 and the second category to be assigned 1. This is important to remember for interpretation later.

We then define our regression model as we did for continuous variables. For example, to test for a relationship between sex and weight we can use the following code:

%%R
lm(weight ~ sex, data = demoDat)
Call:
lm(formula = weight ~ sex, data = demoDat)

Coefficients:
(Intercept)      sexmale  
     85.525        8.383  

As before, R will automatically add the relevant regression parameters so this model can be written mathematically as:

\[weight = \theta_0 + \theta_1 sex\]

We have our intercept regression coefficient; our regression “slope” coefficient for the sex variable. In this equation, the variable sex takes either the value 0 or the value 1 for each observation, depending if it is male or female. We also use this coding strategy to make predictions from this equation. Let’s demonstrate this - for males and females we can derive an equation that will represent their prediction.

For females, \(sex\) = 0. If we substitute this in we get

\[weight = \theta_0 + \theta_1 0 = \theta_0\]

Because the regression parameter for the sex variable is multiplied by 0, it will always equal 0, regardless of the value of \(\theta_1\). So the prediction for females is the value of the intercept only.

Let’s do the same thing for males, where sex = 1:

\[weight = \theta_0 + \theta_1 1 = \theta_0 + \theta_1\]

In this equation we multiply \(\theta_1\) by 1 which can be simplified to \(\theta_1\). So the equation for males is the intercept plus the slope coefficient for the sex variable.

This helps us understand how we interpret the value of the regression coefficient for a binary variable. \(\theta_1\) is the only part of the equation that differs between the predictions for males and females. Therefore, it has to capture the necessary (mean) difference between the groups. More than that, it is not present in the equation for females but is present in the equation for males, so it specifically captures the difference in males relative to females. That means if it is positive, males have a higher mean than females. If it is negative, males have a lower mean than females.

The choice of how we code this variable is academic, the magnitude of the estimate will be the same, but the direction (i.e. the sign) would switch if instead we had males = 0 and females = 1.

Let’s fit this model to our data:

%%R 
lm(weight ~ sex, data = demoDat)
Call:
lm(formula = weight ~ sex, data = demoDat)

Coefficients:
(Intercept)      sexmale  
     85.525        8.383  

We can see the estimated slope coefficient is r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4). It is positive so we can intepret this as males have a mean increased weight of r signif(coef(lm(weight ~ sex, data = demoDat))[2], 4) kg.

R tells us which group the coefficient refers to by appending to the name of the variable, the level that is coded as 1. In this example it is the level male.

But does this mean that there is a significant relationship between sex and weight? Significance testing of regression coefficients for binary variables is performed in exactly the same way as it was for continuous variables. Testing if the coefficient is non-zero. We can display these results by saving the model and using the summary() function.

%%R 
model <- lm(weight ~ sex, data = demoDat)
summary(model)
Call:
lm(formula = weight ~ sex, data = demoDat)

Residuals:
    Min      1Q  Median      3Q     Max 
-40.581 -12.899  -0.832  10.759  56.348 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   85.525      1.565  54.658  < 2e-16 ***
sexmale        8.383      2.287   3.665 0.000302 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.05 on 248 degrees of freedom
Multiple R-squared:  0.05138,	Adjusted R-squared:  0.04756 
F-statistic: 13.43 on 1 and 248 DF,  p-value: 0.0003024

If we look at the p-value column for the second row in the coefficients table, we can see that it is less < 0.05. Therefore we can reject the null hypothesis that the regression coefficient is 0 and report that based on these data, sex does have a significant effect on weight.

Recall here that we have used the t-distribution to test for a significant relationship between sex and weight. If you wanted to test for a difference in means between two groups, what test would you use? The t-test. If you did that here, you would get an identical result.

%%R 
t.test(weight ~ sex, data = demoDat, var.equal = TRUE)
	Two Sample t-test

data:  weight by sex
t = -3.6651, df = 248, p-value = 0.0003024
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -12.887784  -3.877993
sample estimates:
mean in group female   mean in group male 
            85.52452             93.90741 

That is because these are mathematically equivalent. If your regression model has a continuous outcome variable and a single binary predictor variable, then you have implemented a t-test. It is worth bearing in mind that regression is a much more flexible framework than a t-test. It can handle the inclusion of more than one predictor variable. So if you think a t-test is what you need but you also want to be able to control for other confounders, regression will allow you to do this.

Categorical Variables with > 2 Groups#

Regardless of how many groups your categorical variable has, it needs to be recoded numerically for it to be included in any regression framework. The fundamental principle here is that for a categorical variable with m different groups, we need m-1 binary variables to parsimoniously code enough unique combinations so that each group can be differentiated in the model. We saw this already for binary categorical variables with two groups, needing the addition of one binary variable.

Let’s consider a slightly more complicated example - a categorical variable with three options. Here we will use participant ethnicity as an example. In our dataset there are three groups: “African American”, “Asian” and “European”. To represent this numerically as a series of binary variables we would need two indicator variables. The table below shows how, for each of the three options, across these two indicator variables (“Asian” and “European”), we can uniquely code for any of the three options. For each participant of these ethnicities, we replace the categorical variable with these values for the two new indicator variables. A third indicator variable for the third group (“African American”) would be redundant, and introduce a problem for hypothesis testing whereby we assume that the predictor variables are dependent.

Eye Colour

Asian

European

African American

0

0

Asian

1

0

European

0

1

To add this three-level factor into our model we need to add both of these dummy variables. Each variable will also then need its own regression coefficient. So if we were interested in the effect on weight of ethnicity, our regression model would look like this:

\[weight = \theta_0 + \theta_1 Asian + \theta_2 European\]

As before, when we specify the model we want to fit in R, we don’t need to explicitly state the regression coefficients. In a similar way, we don’t need to manually add the relevant dummy variables, if you tell R to fit a regression model with a categorical variable, it will automatically add the relevant number of dummy variables. It will do this, even if it is not necessarily coded as a factor variable.

For this example the following code is sufficient for this model.

%%R
model <- lm(weight ~ ethnicity, data = demoDat)
summary(model)
Call:
lm(formula = weight ~ ethnicity, data = demoDat)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.511 -11.603  -0.206  12.561  53.830 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)         86.179      3.428  25.143   <2e-16 ***
ethnicityAsian       6.366      4.142   1.537    0.126    
ethnicityEuropean    2.634      3.729   0.706    0.481    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 18.46 on 247 degrees of freedom
Multiple R-squared:  0.01149,	Adjusted R-squared:  0.003481 
F-statistic: 1.435 on 2 and 247 DF,  p-value: 0.2401

We can see from the coefficients table in the output, there are three rows for the three regression parameters: the intercept and the two slope parameters for the two dummy variables. This is despite only having one variable on the right hand side when we coded the model. We know which coefficient is which as R has appended which group that variable is an indicator for onto the row name.

As these are dummy variables, the interpretation of the estimated effect (i.e. regression coefficient) for each of these is the same as for the binary variable. It captures the mean difference for that group relative to the baseline or reference group. Remember R sets the first group alphabetically as the baseline, by default. Therefore, the ethnicityAsian row is estimating the mean difference between Asians and African Americans, while the ethnicityEuropean row is estimating the mean difference between the Europeans and African Americans. In this model it is also worth remembering that the intercept gives you the mean value of the outcome (i.e. weight) when the predictor variables are equal to 0. In the table above we can see that when both predictor variables are 0, the observation is an African American. Therefore we can intepret the value of the intercept as the mean of the African Americians (i.e. the reference group). By adding on the values of either regression coefficient to the intercept we can get the mean value of the other two ethnic groups.

Regression with Categorical Variables Exercise#

Let’s practise fitting a regression model with a categorical variable.

Fit a linear regression model to test for differences in weight by socioeconomic_ class_status.


We can do significance testing for each dummy variable in the same way as we have done up to this point. Determining if the regression coefficient is non-zero means that that term significantly influences the outcome and a relationship between that variable and the outcome can be concluded. In the example above between weight and socioeconomic group, both the p-values for both dummy variables are < 0.05. Therefore we can conclude that there is a significant difference in the mean between the high socioeconomic group and the low socioeconomic groups AND there is a significant difference in the mean weight between the middle and high socioeconomic groups. In this situation it is fairly straightforward to draw a conclusion.

But often when we have a categorical variable with more than two groups, our question is more broadly ‘is there a difference across the groups?’. In this situation we are interested in whether any of the dummy variables have an effect on the model. We have a slightly different null hypothesis for this statistical test.

For the regression model

\[weight = \theta_0 + \theta_1 Middle + \theta_2 High\]

if there was no relationship between socieconomic status and weight then neither regression parameter for either dummy variable would be non-zero. To remove them from the prediction equation both need to be equal to zero, which gives us the null hypothesis of:

\[H_{null}: \theta_1 = \theta_2 = 0\]

For there to be any benefit to the prediction equation of weight by knowing socioeconomic status, then we would need one of these slope coefficients to be non-zero.

So our alternative hypothesis becomes

\[H_{alternative}: \text{there exists } \theta_i \neq 0,\text{ for i = 1,2.}\]

As before we are interested in the joint effect of multiple predictor variables.

This approach can also be thought of as testing these two models:

Model 1:

\[weight = \theta_0\]

Model 2:

\[weight = \theta_0 + \theta_1 Middle + \theta_2 High\]

And asking whether Model 2 (the more complex model) is a significantly better predictor than Model 1 (which we sometimes refer to as the null model).

We can code this statistical comparision in R as follows:

%%R
model.null <- lm(weight ~ 1, data = demoDat)
model.full <- lm(weight ~ socioeconomic_status, data = demoDat)
anova(model.null, model.full)
Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ socioeconomic_status
  Res.Df   RSS Df Sum of Sq      F Pr(>F)
1    249 85130                           
2    247 84912  2    218.23 0.3174 0.7283

We have done an ANOVA and used the F-distribution. We can see in the output a table of test statistics where we can see an F-value is reported along with its p-value. In this example we have a non-significant p-value, that is greater than 0.05.

When we are testing a categorical variable with more than two groups, we are asking the question ‘is the mean of the outcome different across these groups?’. Posing the questions like this might remind you of the situation you may have been told you use an ANOVA for. And indeed, we did. However, this is one very specific example where an ANOVA can be used. Reframing your understanding of ANOVA for comparing regression models makes it a much more widely usable concept. Again, like with the t-test, if you think of it in this context you can then use it in combination with the other benefits of regression analysis, such as different types of outcome variables or multiple predictor variables.

We can extend this principle of testing the joint effect of multiple dummy variables to test for joint effects of any set of multiple predictor variables. For example, we could test if the addition of height and ethnicity together improves the model for weight:

%%R
model.full <- lm(weight ~ height + ethnicity, data = demoDat)
model.null <- lm(weight ~ 1, data = demoDat)
anova(model.null, model.full)
Analysis of Variance Table

Model 1: weight ~ 1
Model 2: weight ~ height + ethnicity
  Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
1    249 85130                                 
2    246 37429  3     47700 104.5 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

In fact, the null model does not have to be as simple as including just the intercept. It just needs to contain a subset of the variables in the full model. This can be a really powerful strategy if you have some variables you want to adjust for but aren’t interested in their relationship to the outcome. Maybe in our example of socioeconomic status, we know that age and height might confound the effect of socioeconomic status. So in our significance testing we want to be able to adjust for them while evaluating the effect of socioeconomic status. We can do this as follows:

%%R
model.null <- lm(weight ~ age + height, data = demoDat)
model.full <- lm(weight ~ socioeconomic_status + age + height, data = demoDat)
anova(model.null, model.full)
Analysis of Variance Table

Model 1: weight ~ age + height
Model 2: weight ~ socioeconomic_status + age + height
  Res.Df   RSS Df Sum of Sq  F  Pr(>F)  
1    247 38450                          
2    245 37531  2    919.12  3 0.05162 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We can see here that, having adjusted for the potential confounders we now have a significant relationship between socioeconomic status and weight.

Comparing Regression Models Exercise#

Use an F-test to evaluate the relationship between BMI and ethnicity.