Statistical Analysis#
R is synomous with data analysis. Here we will learn how to perform a number of common statistical tests with R. Please note that the focus here is on how to perform a given test in R, it is not to discuss the merits of or scenarios in which you would use a specific test.
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Learning Objectives#
Understand the process of loading and organizing datasets
Learn how to calculate summary statistics, including mean, median, standard deviation, and variance
Analyze data subsets using the
tapply
andaggregate
functionsCreate frequency tables and cross-tabulations for categorical data
Conduct and interpret common statistical tests such as t-tests and ANOVA
Loading the dataset#
To do statistics we need some observed data. We will use a dataset contained in the csv file bp_dataset.csv. This dataset is based on a clinical trial to test the effect of two drugs (Drug A and Drug B), compared with an inactive control drug (placebo), on blood pressure in people who have high blood pressure. The purpose of the drugs is to reduce blood pressure. Some additional information about the people in the trial, such as their age and sex, is also provided (see codebook for further information about the variables).
Let’s start by loading the dataset. You need to ensure that the dataset is located in your current working directory.
We will read in the csv file and assign it to a variable called bp_dataset
. The header=TRUE
argument means R will take the entries in the first row
and use these to set the column headings.
%%R
bp_dataset<-read.csv(file="data/bp_dataset_session4.csv", header=TRUE)
head(bp_dataset)
ptid male age intervention bp_baseline bp_3m bp_6m
1 1 1 41 Drug B 218 153 163
2 2 0 46 Drug B 200 162 177
3 3 0 37 Drug A 198 122 166
4 4 0 46 Drug A 202 148 128
5 5 1 44 Control 142 196 231
6 6 1 36 Drug B 148 191 164
Let’s find out how many rows (observations) and variables (columns) are in this dataset using the dim
command (for dimensions).
%%R
dim(bp_dataset)
[1] 100 7
Data Organisation#
When it comes to programming how you organise/store your data becomes important to facilitate efficient processing. This is beyond the remit of this course - but you can find out more about the principles of tidy data here.
We also want to check the types of variable in the dataframe. The str()
command allows us to check the structure of the dataframe and tells us about the types of variables in our dataset.
%%R
str(bp_dataset)
'data.frame': 100 obs. of 7 variables:
$ ptid : int 1 2 3 4 5 6 7 8 9 10 ...
$ male : int 1 0 0 0 1 1 0 0 1 1 ...
$ age : int 41 46 37 46 44 36 65 69 58 65 ...
$ intervention: chr "Drug B" "Drug B" "Drug A" "Drug A" ...
$ bp_baseline : int 218 200 198 202 142 148 166 201 206 140 ...
$ bp_3m : int 153 162 122 148 196 191 140 141 173 125 ...
$ bp_6m : int 163 177 166 128 231 164 152 113 164 115 ...
In this dataset we have both integer and character variables. Gender is provided as a binary variable
(also known as an indicator or dummy variable) called male
coded as 0
(for female) and 1 (for male). This is not necessary, if it was coded as a factor with the levels
“Female” and “Male” R would be happy to use in the
statistical functions. Coding it in this way however, is more aligned with how it is used in statistics,
and may make interpretation easier.
The variable intervention
is a character variable, although arguably this should be coded as a factor. However,
in most cases R will convert it to a factor within the function calls if it is needed.
Summary statistics#
In this dataset, there are 4 numeric variables. These are age
, and blood pressure measured at three timepoints (bp_baseline
,
bp_3m
, bp_6m
).
We want to find the mean, median, standard deviation and variance for these variables. Let’s start with the mean. To find the mean for age
we can use:
%%R
mean(bp_dataset$age)
[1] 49.03
We can attach the dataframe bp_dataset
, so we don’t need to use the $
notation for each command. Remember to use the detach()
command when you wish to detach the dataframe.
%%R
attach(bp_dataset)
mean(age)
[1] 49.03
We can also calculate the median, standard deviation, variance, minimum, maximum, range and sum fairly easily using base R functions.
%%R
sd(age)
[1] 12.31116
%%R
median(age)
[1] 47
%%R
var(age)
[1] 151.5647
%%R
min(age)
[1] 26
%%R
max(age)
[1] 81
%%R
range(age)
[1] 26 81
%%R
sum(age)
[1] 4903
Note The commands ‘sd’ and ‘var’ calculate the sample sd and variance, not the population sd and variance.
We can also calculate percentiles. Using the quantile()
command we need to specify which quantiles ( as proportions) that we want to calculate, where the 0.5 quantile would represent the median.
%%R
quantile(age, probs = c(.25, .5, .75))
25% 50% 75%
40.00 47.00 56.25
To get the inter-quartile range (75th percentile minus 25th percentile) we can do this with the IQR()
function.
%%R
IQR(age)
[1] 16.25
We can also calculate multiple descriptive summary statistics of a numeric variable simultaneously using the function summary()
.
%%R
summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max.
26.00 40.00 47.00 49.03 56.25 81.00
Activity#
Find the means, medians and range for the variable bp_baseline
and bp_3m
.
Summary statistics by groups#
We can calculate statistics for different subsets or groups of the data by using R command tapply()
. The tapply()
command requires 3 arguments :
a numeric variable which we want to summarise (in the example below this is
age
)a categorical variable indicating the subgroups which we want to group by (in the example below this is
male
)the function we wish to call on each subgroup (in the example below this is
mean
)
%%R
tapply(age, male, mean)
0 1
51.94595 47.31746
If we want to calculate a summary statistics for a sub group, we do this by subsetting it and providing the subset to the mean()
function. For example,
to calculate the mean of just the means (i.e where male==1).
Note the use of ==
for specifying an equality condition.
%%R
mean(age[male==1])
[1] 47.31746
We can use the tapply()
function to calculate the medians for the intervention groups.
%%R
tapply(age, intervention, median)
Control Drug A Drug B
49.5 49.5 42.5
If we want to calculate summary statistics for combinations of groups - for example by sex and intervention group, - we can use the aggregate()
command. We use the formula method to specify which variables we want to summarise (on the left hand side of the ~) and which we want to group by
(on the right hand side of the ~). If all the variables are included in
single data.frame, we can construct the formula using just the column names, and include the argument data
to specify which object these are found
in. The FUN
argument specifies which function you want to apply to these data, which is the mean in this example.
Note that we need to include the dataset as an argument, even though we have attached bp_dataset
.
%%R
aggregate(age ~ male + intervention, data=bp_dataset, FUN=mean)
male intervention age
1 0 Control 53.84615
2 1 Control 49.33333
3 0 Drug A 56.63636
4 1 Drug A 47.73913
5 0 Drug B 46.07692
6 1 Drug B 45.72000
Activity#
Calculate the mean, SD, median, 10th and 90th percentiles for
bp_baseline
for each intervention group.Calculate the 25th centile, 50th centile (median) and 75th centile for
age
, for each combination of sex and intervention group.
Summary statistics for categorical data#
We can create a frequency table for categorical variables using the table()
command.
%%R
table(male)
male
0 1
37 63
We can also produce cross-tabulations for two categorical variables. The first variable will form the rows, and the second variable the columns.
%%R
table(male, intervention)
intervention
male Control Drug A Drug B
0 13 11 13
1 15 23 25
In fact, we can produce tables for more than 2 categorical variables. R will print these as a series of 2-dimensional tables for fixed values of the other variables.
If we also wish to calculate proportions or percentages, we can use the prop.table()
command. We first need to create the
table and then pass to prop.table()
. You can either do this in two stages:
%%R
table.male<-table(male)
prop.table(table.male)
male
0 1
0.37 0.63
Or as a nested function call:
%%R
prop.table(table(male, intervention))
intervention
male Control Drug A Drug B
0 0.13 0.11 0.13
1 0.15 0.23 0.25
We can also create a table for a subgroup of the data by providing just a subset of the data to the table()
function. For example, to count the number of each sex, only for people aged over 50:
%%R
table(male[age>50])
0 1
17 19
Activity#
Create table of frequencies of the number of individuals in each intervention group. Convert this into and a table of percentages.
Create a table of frequencies of each intervention group stratified by whether the individual’s baseline blood pressure is greater than or equal to 180.
Common statistical tests: One-sample t-test#
There are several types of t-test. We will start with the simplest: a one-sample two-sided t-test to test the null hypothesis that the true mean value of a continuous variable is equal to a pre-specified value. The default behaviour, and the most common application, is to compare to a value of 0.
%%R
t.test(age)
One Sample t-test
data: age
t = 39.826, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
46.5872 51.4728
sample estimates:
mean of x
49.03
We can see in the console, we get a more verbose output than we have seen before. Mainly because the result of a statistical test often includes multiple
statistics, and the original writers of the t.test()
function have made an effort to present these back to the user in an easy to interpret way.
We can see there is a statement at the top of the output reminding or confirming which statistical test we have performed, and underneath this is a confirmation of which variable/data this was performed on.
We then have a line of multiple test statistics, including the p-value. Here we can see our test result is highly significant with p < 2.2e-16. Given we are analysing a population of adults, they are all a lot older than 0 so it is not surprising. The function writers have take the executive decision to report the p-value as < 2.2e-16 rather than give the specific value. In some fields/journals/research group, this approximation is not good enough. Later we will show you how to extract a more precise p-value.
Underneath this we have a confirmation of the alternative hypothesis the statistic was considered against, we then have the confidence interval and the estimated mean of the sample.
We can vary the width of the confidence interval provided as part of the t-test output, by including the argument conf.int
. The default is the 95% CI.
For the 90% confidence interval, we can run
%%R
t.test(age, conf.level=0.90)
One Sample t-test
data: age
t = 39.826, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
46.98587 51.07413
sample estimates:
mean of x
49.03
Here we can see that the majority of the output is unchanged, it is just the confidence interval which is different.
The default behaviour is to perform a two-sided t-test. We can specify a one-sided t-test testing whether the true mean of the variable is greater than
or less than the specified value by including the argument alternative
and setting it to either greater
or less
. To test
whether the true mean age
is greater than 0 we use:
%%R
t.test(age, alternative="greater")
One Sample t-test
data: age
t = 39.826, df = 99, p-value < 2.2e-16
alternative hypothesis: true mean is greater than 0
95 percent confidence interval:
46.98587 Inf
sample estimates:
mean of x
49.03
Although this is a very uncommon application (due to the need to justify the choice of value), you can perform a one sample t-test against a value
other than 0 by including the argument mu
. For example, we can use a t-test to test if true mean age
=50.
%%R
t.test(age, mu=50)
One Sample t-test
data: age
t = -0.7879, df = 99, p-value = 0.4326
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
46.5872 51.4728
sample estimates:
mean of x
49.03
If you are only performing one (or a few) statistical tests, and you are working iteratively then you might be able to manually copy the result from the console. However, there are likely times when you want to extract the result from the test for further processing, for example enter it into a table to save to your computer. We can save the output of a t.test (or indeed any other statistical test or function) to a variable, meaning we can manipulate it further.
%%R
t.out<-t.test(age, mu=50)
t.out
One Sample t-test
data: age
t = -0.7879, df = 99, p-value = 0.4326
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
46.5872 51.4728
sample estimates:
mean of x
49.03
As before, when we defined or created a variable, there is no output to the console. We can see the output by entering the name of the object. Let’s
explore the object that we have created. If we use class()
we can learn what type of object it is.
%%R
class(t.out)
[1] "htest"
htest
- that’s a new one. This is a type of object that has been specifically defined to hold the result of a t-test. It consists of different objects
or slots where different parts of the result are stored. We can get a list of this elements with the function names()
.
%%R
names(t.out)
[1] "statistic" "parameter" "p.value" "conf.int" "estimate"
[6] "null.value" "stderr" "alternative" "method" "data.name"
We can see 10 items listed. All of these are named elements stored within our htest
object which we can extract by name using the $
. For example,
we can get just the p-value as follows:
%%R
t.out$p.value
[1] 0.432636
We can get the estimated mean:
%%R
t.out$estimate
mean of x
49.03
We can get the confidence interval:
%%R
t.out$conf.int
[1] 46.5872 51.4728
attr(,"conf.level")
[1] 0.95
While the p-value and estimated mean were single values, the confidence interval has returned a vector of length 2. The htest
object contains a range of different elements.
Common statistical tests: Two-sample unpaired t-test#
The two-sample unpaired t-test, also known as an independent sample t-test, is used to compare the mean values of a continuous variable for two independent groups where the data points across the two groups are not matched or paired in any way.
In this example, we want to compare the mean age between males and females, or in other words we want to test whether the true mean age is equal for males and females.
As we have all the data for our response variable (also called outcome variable or dependent variable), age in one object, and we have a second object
which indicates which entries are female and which are male, we will use the formula method (signalled by the ~
)
for specifying the comparison we want to make.
%%R
t.test(age~male)
Welch Two Sample t-test
data: age by male
t = 1.739, df = 63.594, p-value = 0.08687
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.6891826 9.9461538
sample estimates:
mean in group 0 mean in group 1
51.94595 47.31746
The output looks very similar to the output for the one sample t-test, with a couple of differences.
The name of the test has changed to “Welch Two Sample t-test”.
The alternative hypothesis is different.
There are two sample estimates, one for each group.
Instead of using the formula method, we can code a two sample test where the data for each group is stored in a separate object. To demonstrate this
we will create two numeric vectors, one with the ages of the female participants, and one with the ages of the male partipicants. We then provide the
two vectors as the first two argument in the t.test()
function.
%%R
age.males<-age[male==1]
age.females<-age[male==0]
t.test(age.males, age.females)
Welch Two Sample t-test
data: age.males and age.females
t = -1.739, df = 63.594, p-value = 0.08687
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-9.9461538 0.6891826
sample estimates:
mean of x mean of y
47.31746 51.94595
We can perform a one-side test using the alternative
argument as shown above.
%%R
t.test(age.males, age.females, alternative = "less")
Welch Two Sample t-test
data: age.males and age.females
t = -1.739, df = 63.594, p-value = 0.04343
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
-Inf -0.1859434
sample estimates:
mean of x mean of y
47.31746 51.94595
The default behaviour of the t.test()
function is to assume unequal variance. If we wish to repeat the t-test using the assumption of equal variance
we can include the argument var.equal
and set it to TRUE.
%%R
t.test(age~male, var.equal=TRUE)
Two Sample t-test
data: age by male
t = 1.8368, df = 98, p-value = 0.06927
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-0.3721683 9.6291395
sample estimates:
mean in group 0 mean in group 1
51.94595 47.31746
If you are unsure whether you can assume equal variance, you can run a statistical test called an F test to confirm using the function var.test()
.
The null hypothesis for this test is that variances are equal. So a significant result means that the assumption of equal variances is rejected. To compare the variances in age by gender:
%%R
var.test(age ~ male)
F test to compare two variances
data: age by male
F = 1.5248, num df = 36, denom df = 62, p-value = 0.1431
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.8663612 2.8110952
sample estimates:
ratio of variances
1.524841
An alternative statistical test to test for equal variances is Bartlett’s test (again, the null hypothesis is that the variances are equal for each group):
%%R
bartlett.test(age ~ male)
Bartlett test of homogeneity of variances
data: age by male
Bartlett's K-squared = 2.0664, df = 1, p-value = 0.1506
Common statistical tests: Paired t-test#
A paired t-test is used to compare two variables that are matched or paired in some way, for examples, measurements made on the same person at two different times. The paired t-test uses the differences between matched pairs of measurements to test whether the true means are equal or the difference between them is 0. Therefore, the length of both vectors needs to be the same.
For example, we may want to perform a paired t-test to test whether the true mean values for BP at baseline (bp_baseline
) and BP at
3 months (bp_3m
) are the same, taking into account that the blood pressure measurements come from the same individual, i.e. for every
baseline measurement, there is a ‘matched’ measurement taken at 3 months. To do a paired t-test we need to include the argument paired
and
set it to TRUE.
%%R
t.test(bp_3m,bp_baseline, paired=TRUE)
Paired t-test
data: bp_3m and bp_baseline
t = -4.5608, df = 99, p-value = 1.462e-05
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-23.707166 -9.332834
sample estimates:
mean difference
-16.52
We can see the output is similar to before but with the following differences
Test title is changed to “Paired t-test”
The alternative hypothesis is different
We have a single sample estimate which is the mean difference between the groups.
Note that the results are different if we do not take into account the paired nature of the variables.
%%R
t.test(bp_3m,bp_baseline,paired=FALSE)
Welch Two Sample t-test
data: bp_3m and bp_baseline
t = -4.8468, df = 197.48, p-value = 2.534e-06
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-23.241544 -9.798456
sample estimates:
mean of x mean of y
165.88 182.40
Activity#
Perform a single sample 2-sided t-test to test whether the true mean of baseline BP is equal to 170.
Perform an unpaired t-test to compare mean BP at 3 months between Drug A and the control group.
Perform a paired t-test to test the null hypothesis that mean difference is 0 comparing BP at 6 months with BP at baseline.
Common statistical tests: Mann-Whitney test#
A non-parametric alternative to a t-test is a Mann-Whitney U test, which is performed using function wilcox.test()
.
It works in a very similar way to the two sample t- test, and many of the arguments are shared.
%%R
wilcox.test(age~male, alternative = "greater")
Wilcoxon rank sum test with continuity correction
data: age by male
W = 1374, p-value = 0.06865
alternative hypothesis: true location shift is greater than 0
The output is shorter (as the test has less components) but it follows a similar format to that of the t-test.
We can save this output as a variable and extract the p-value in the same way as the t-test. The elements within the variable are different to the t-test output but they are accessed in the same way.
%%R
m.out<-wilcox.test(age~male, alternative = "greater")
m.out$p.value
[1] 0.06865345
%%R
names(m.out)
[1] "statistic" "parameter" "p.value" "null.value" "alternative"
[6] "method" "data.name"
Common statistical tests: ANOVA#
We perform a one-way ANOVA (ANalysis Of VAriance test) to compare means across three or more groups. The null hypothesis is that the mean is equal across groups. There are two ways to perform an ANOVA in R, both use the formula method to specify the comparision we want to make.
For example, to compare mean BP at 3 months across Drug A, Drug B, Control, first using the aov()
function:
%%R
aov(bp_3m ~ intervention)
Call:
aov(formula = bp_3m ~ intervention)
Terms:
intervention Residuals
Sum of Squares 12718.47 41840.09
Deg. of Freedom 2 97
Residual standard error: 20.76875
Estimated effects may be unbalanced
%%R
summary(aov(bp_3m ~ intervention))
Df Sum Sq Mean Sq F value Pr(>F)
intervention 2 12718 6359 14.74 2.57e-06 ***
Residuals 97 41840 431
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note, that the aov()
call doesn’t output the complete test result we want. We have to additionally use the function summary()
to extract and
print the required test result. This is not an uncommon approach for performing statistical tests in R.
We can get the same result using the anova()
function (give a rounding difference). This version of the anova requires a linear model to be fit first using the
lm()
function.
%%R
anova(lm(bp_3m ~ intervention))
Analysis of Variance Table
Response: bp_3m
Df Sum Sq Mean Sq F value Pr(>F)
intervention 2 12718 6359.2 14.743 2.567e-06 ***
Residuals 97 41840 431.3
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Activity#
Perform a one-way ANOVA to compare age
across the three intervention groups.
Common statistical tests: Chi-square test#
Chi square tests are used to test for a relationship between two categorical variables.
The function is chisq.test()
and requires a cross tabulation of the two variables as the input.
%%R
chisq.test(table(male, intervention))
Pearson's Chi-squared test
data: table(male, intervention)
X-squared = 1.5097, df = 2, p-value = 0.4701
Common statistical tests: Correlation#
When you have two continuous variables, it is likely a correlation statistic that you want to calculate. We can do this with the function cor()
.
For example, to calculate the correlation between age and BP at baseline we can run:
%%R
cor(age, bp_baseline)
[1] 0.01138201
The output of this function is very simple compared to the tests we looked at before, it is just a simple value. If you compare the function name to
the names of the other functions to perform statistical tests, this doesn’t have the suffix .test
. This function simply calculated the value of
the correlation statistic and does not perform any hypothesis testing with it. To do that we need the cor.test()
function.
We can use cor.test
to find the correlation between age
and bp_baseline
%%R
cor.test(age, bp_baseline)
Pearson's product-moment correlation
data: age and bp_baseline
t = 0.11268, df = 98, p-value = 0.9105
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.1854507 0.2073366
sample estimates:
cor
0.01138201
The output here is now more similar to what we had before in that it reports a number of statistics relating to the test.
The default method for calculating the correlation is Pearson’s, we can instead calculate Spearman’s rank correlation coefficient, with the argument
method
:
%%R
cor.test(age, bp_baseline, method="spearman")
Spearman's rank correlation rho
data: age and bp_baseline
S = 165406, p-value = 0.9412
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.007464794
In addition: Warning message:
In cor.test.default(age, bp_baseline, method = "spearman") :
Cannot compute exact p-value with ties