Strings and Dates#
Download Rmd Version#
If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.
Exercises incomplete: Download Strings_and_dates.Rmd Exercises complete: Download Strings_and_dates_complete.Rmd
Learning Objectives#
Understand the process of cleaning and manipulating string data in R using the
stringr
packageLearn how to clean and format text data by removing whitespace and replacing substrings
Understand how to split and concatenate string columns to create new identifiers
Gain proficiency in converting strings to dates and accessing timestamp components using the
lubridate
packageApply the knowledge of string and date manipulation in practical examples
Introduction#
In this notebook we’ll be looking at ways to work with two common kinds of data: text data and dates/timestamps. When we first create and collect data, it may not be in its cleanest form, especially when it comes to working with text data and timestamps. The following packages provide lots of useful functions for working with strings / timestamps:
stringr
: working with strings (included withtidyverse
).lubridate
: working with dates/timestamps (a non-core Tidyverse package).
To demonstrate these packages, we’ll be working with two datasets:
A toy dataset consisting of some coin flipping experiments (thrilling stuff).
Weather data from the Palmer Station in Antarctica from 1989 - 2019, where data are available for each day[LTE23].
Note We’ll be making use of the pipe operator |>
throughout; see
the 04_Piping.Rmd
notebook.
Working with text - heads and tails#
We have a toy dataset for getting familiar with working with text. The
heads_and_tails.csv
file contains data on some coin flipping experiments
performed by two people, Joe Bloggs and Jane Doe. The experiments were performed
on different days and are split up into different parts (coin flipping is
tiring work).
The data is not in a clean state:
Joe and Jane’s names are recorded in multiple ways.
The results are inconsistently formatted.
The experiments look pretty consistent, but note that in some rows the date is separated from the
part n
bit by a colon:
whereas in other rows it’s separated by a space.
%%R
# Note: reading all columns as strings
experiments <- readr::read_csv("./data/heads_and_tales.csv",
col_types = list(.default = readr::col_character()))
experiments
# A tibble: 13 × 3
Experiment Technician Result
<chr> <chr> <chr>
1 2024-01-04:part 1 joe bloggs H,H,H,T,T,H,H,T,H,H,T,H,H,H,T,T,H,H,T,T
2 2024-01-04:part 2 Joe Bloggs T,T,H,T,H,T,T,H,T,T,T,T,H,T,H,H,T,H,T,T
3 2024-01-04:part 3 Joe Bloggs H,T,H,T,T,H,T,H,H,T,H,T,H,T,T,T,H,H,H,T,T,H,T,…
4 2024-01-04 part 1 jane doe HTHHTHTHTTHTHTHTTT
5 2024-01-04 part 2 jane doe HHTTTTHHHTTHTHHTHHTTT
6 2024-01-05 part 1 Jane doe THTTTHHTTTH
7 2024-01-05 part 2 Jane doe TTTTTHTHTHTTTTTHTTHT
8 2024-01-06:part 1 Joe Bloggs HTHTTTTHHHHTTTHHHHHTHH
9 2024-01-06:part 2 Joe Bloggs TTTTTTHTTTHTH
10 2024-01-06 part 1 Jane Doe HTTTTHHHTHHTTHHHTTTTTHTTH
11 2024-01-06 part 2 Jane Doe TTTTHHTHHTTHTHHHTHTTTH
12 2024-01-06 part 3 Jane Doe HHTHTTTTTHTTTHTHTHT
13 2024-01-07:part 1 joe bloggs TTHTTHHHHTHHHHTTTHHT
We’ll use some of the functions in stringr
to clean up this dataset.
Cleaning whitespace#
Before tidying up the columns, let’s first rename the current columns to indicate
they contain the ‘raw’ values. We can use the function rename
from the dplyr
package to do this:
%%R
# Rename columns to *_raw
experiments <- experiments |>
dplyr::rename(Experiment_raw = Experiment,
Technician_raw = Technician,
Result_raw = Result)
print(colnames(experiments))
[1] "Experiment_raw" "Technician_raw" "Result_raw"
Let’s first tackle cleaning up the technician’s names. Look at the unique values in the ‘Technician_raw’ column:
%%R
unique(experiments$Technician_raw)
[1] "joe bloggs" "Joe Bloggs" "jane doe" "Jane doe" "Joe Bloggs"
[6] "Jane Doe"
Let’s make the case consistent. There are several stringr
functions we could
use for this, of the form str_to_*
%%R
name <- "aLbErt EinStEin"
print(stringr::str_to_lower(name)) # lower case
print(stringr::str_to_upper(name)) # upper case
print(stringr::str_to_title(name)) # title case
print(stringr::str_to_sentence(name)) # sentence case
[1] "albert einstein"
[1] "ALBERT EINSTEIN"
[1] "Albert Einstein"
[1] "Albert einstein"
Let’s convert the technicians’ names to title case. Do we now have consistent names for Joe and Jane?
%%R
experiments <- experiments |>
dplyr::mutate(Technician = stringr::str_to_title(Technician_raw))
unique(experiments$Technician)
[1] "Joe Bloggs" "Jane Doe" "Joe Bloggs"
Not quite. We can replace the two spaces in the middle of “Joe Bloggs” with a
single space by using stringr::str_squish
:
%%R
# Remove repeated whitespace characters
experiments <- experiments |>
dplyr::mutate(Technician = stringr::str_squish(Technician))
unique(experiments$Technician)
[1] "Joe Bloggs" "Jane Doe"
Note that a closely related stringr
function, str_trim
, can be used for only
removing leading and trailing whitespace, if for some reason we wanted to keep
all the whitespace between the words:
%%R
# \t = tab character, \n = new line character
name <- "\tAda Lovelace \n"
cat(name)
cat(stringr::str_trim(name))
Ada Lovelace
Ada Lovelace
Replacing characters / substrings#
Next, let’s make the ‘Result’ entries consistently formatted. We’ll remove
the commas from the first few experiments and instead just have a string of
H
s and T
s. We can view this as replacing each comma with the empty string,
""
.
In general, to perform text replacement, we can use one of the following
stringr
functions:
%%R
vowels = "a--e--i--o--u"
# Replace the *first* instance of "--" with ", "
print(stringr::str_replace(vowels, pattern = "--", replacement = ", ")) # a, e--i--o--u
# Replace *all* instances of "--" with ", "
print(stringr::str_replace_all(vowels, pattern = "--", replacement = ", ")) # a, e, i, o, u
[1] "a, e--i--o--u"
[1] "a, e, i, o, u"
You may wonder why the parameter name ‘pattern’ is used for the string we want to replace. The reason is that we actually need to supply a regular expression to match on. We’ll come back to this.
Let’s now clean up the ‘Result’ column by removing the commas:
%%R
experiments <- experiments |>
dplyr::mutate(Result = stringr::str_replace_all(Result_raw, ",", ""))
experiments |>
dplyr::select(Result_raw, Result)
# A tibble: 13 × 2
Result_raw Result
<chr> <chr>
1 H,H,H,T,T,H,H,T,H,H,T,H,H,H,T,T,H,H,T,T HHHTTHHTHHTHHHTTHHTT
2 T,T,H,T,H,T,T,H,T,T,T,T,H,T,H,H,T,H,T,T TTHTHTTHTTTTHTHHTHTT
3 H,T,H,T,T,H,T,H,H,T,H,T,H,T,T,T,H,H,H,T,T,H,T,H,T HTHTTHTHHTHTHTTTHHHTTHTHT
4 HTHHTHTHTTHTHTHTTT HTHHTHTHTTHTHTHTTT
5 HHTTTTHHHTTHTHHTHHTTT HHTTTTHHHTTHTHHTHHTTT
6 THTTTHHTTTH THTTTHHTTTH
7 TTTTTHTHTHTTTTTHTTHT TTTTTHTHTHTTTTTHTTHT
8 HTHTTTTHHHHTTTHHHHHTHH HTHTTTTHHHHTTTHHHHHTHH
9 TTTTTTHTTTHTH TTTTTTHTTTHTH
10 HTTTTHHHTHHTTHHHTTTTTHTTH HTTTTHHHTHHTTHHHTTTTTHTTH
11 TTTTHHTHHTTHTHHHTHTTTH TTTTHHTHHTTHTHHHTHTTTH
12 HHTHTTTTTHTTTHTHTHT HHTHTTTTTHTTTHTHTHT
13 TTHTTHHHHTHHHHTTTHHT TTHTTHHHHTHHHHTTTHHT
Exercise: replacing strings#
Put the ‘Experiment’ column into a consistent format, so that the experiment is
recorded as <date>:part <n>
where <date>
is a date string and <n>
is a
number 1, 2, etc.
Solution
%%R
experiments <- experiments |>
dplyr::mutate(Experiment = stringr::str_replace(Experiment_raw, " part", ":part"))
experiments |>
dplyr::select(Experiment_raw, Experiment)
Splitting and concatenating columns#
With the columns cleaned up, we no longer need the ‘raw’ columns, so let’s remove them:
%%R
experiments <- experiments |>
dplyr::select(-Experiment_raw, -Technician_raw, -Result_raw)
It would be good to have the ‘Experiment’ column consist of a unique identifier for each experiment. We could just give it an integer label, but let’s do something more sophisticated. Each row is uniquely identifiable from the current ‘Experiment’ string and the name of the Technician. So let’s combine these together to make a new ID string, as follows:
Joe Bloggs + 2024-01-04:part 2 --> 2024-01-04:2:JoeBloggs
Note that we’ve removed the “part “ string and removed the inner space in the name. How can we accomplish this transformation in smaller steps?
First split the experiment label “2024-01-04:part 2” into two pieces: “2024-01-04” and “2”.
Remove the space from the name, “Joe Bloggs” to “JoeBloggs”.
Paste the results together using colons: “2024-01-04” + “2” + “JoeBloggs” –> “2024-01-04:2:JoeBloggs”.
We know how to do the second step (use str_replace
), but steps 1 and 3 are new.
The trick is to do this column-wise in the dataframe. Se we’ll make new columns containing the pieces we want to paste together, and then create a new column by pasting these two columns together.
To perform the splitting, we can use the separate
function from the
tidyr
package (not stringr
!) to split a column on a separator. The result is
one column for each piece of the split; we thus need to also provide names for
the new columns:
tidyr::separate(data,
col,
into = c("NewCol1", "NewCol2", ...),
sep = <string_to_split_on>)
To paste columns back together with a custom separator, we can use the
str_c
function from stringr
:
%%R
stringr::str_c("foo", "bar", "baz", sep = " -- ") # foo -- bar -- baz
[1] "foo -- bar -- baz"
Exercise: splitting and concatenating columns#
Use tidyr::separate
and stringr::str_c
to create a unique experiment ID by
pasting together:
The experiment date
The sub-experiment number
The technician’s name (without a space)
E.g.
Joe Bloggs + 2024-01-04:part 2 --> 2024-01-04:2:JoeBloggs
Hint: don’t try to do this all in one go. Instead, build it up step-by-step and
use the pipe operator |>
join the steps together.
Solution
%%R
experiments |>
tidyr::separate(col = Experiment,
into = c("Experiment_date", "Sub_experiment"),
sep = ":part ") |> # don't forget the space!
dplyr::mutate(Experiment_id = stringr::str_c(Experiment_date,
Sub_experiment,
stringr::str_remove(Technician, " "),
sep = ":"))
Finding substrings#
How many of the experiments features a run of 5 heads or 5 tails?
We can answer this as follows:
For each row, determine whether it contains a run of 5 heads or 5 tails.
Filter on the result.
To perform the first step, we can use str_detect
from stringr
. This simply
returns TRUE
or FALSE
depending on whether it finds a given pattern in
a string:
stringr::str_detect(a_string, pattern)
The pattern
argument is interpreted as a regular expression. We mentioned
this earlier when looking at str_replace
/ str_replace_all
. Regular expressions
are a mini-language for specifying matches on strings. We’re not going to go into
this much here, but they are useful to know about, especially if you work with
text data a lot; check out the great
stringr cheat sheet
for a summary, or consult the
chapter on regular expressions in R for Data Science (2e)
for an introduction. Here are some example regular expressions in action, using
str_detect
to indicate whether the given string matches or not:
%%R
# Match on a string as-is (case sensitive)
stringr::str_detect("Regex is cool.", pattern = "ex") |> print() # true
stringr::str_detect("Regex is cool.", pattern = "COOL") |> print() # false
# Match on 1 or more instances
stringr::str_detect("Regex is cool.", pattern = "o+") |> print() # true
stringr::str_detect("Regex is cool.", pattern = "coolcool+") |> print() # false
# Regex-special characters need escaping e.g. to match on a period, use \\
stringr::str_detect("Regex is cool.", pattern = "\\.") |> print() # true
# Match on either one pattern or another with pipe |
stringr::str_detect("Regex is cool.", pattern = "cool|meh") |> print() # true
stringr::str_detect("Regex is meh.", pattern = "cool|meh") |> print() # true
stringr::str_detect("Regex is cool.", pattern = "meh|lame") |> print() # false
# Match at the beginning of the string only with ^
stringr::str_detect("Regex is cool.", pattern = "^Regex") |> print() # true
stringr::str_detect("Regex is cool.", pattern = "^cool") |> print() # false
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
Exercise: finding substrings#
With the help of str_detect
and a suitable regular expression, write code to
work out how many experiments feature a run of at least 5 straight heads or 5
straight tails.
Solution
%%R
experiments |>
dplyr::mutate(Has_run_of_5 = stringr::str_detect(Result, pattern = "HHHHH|TTTTT")) |>
dplyr::filter(Has_run_of_5) |>
nrow()
Working with dates#
We have weather data from the Palmer Station in Antarctica from 1989 - 2019[LTE23], where data are available for each day.
%%R
weather <- read.csv("./data/PalmerStation_Daily_Weather.csv") # using base R read.csv here!
str(weather)
'data.frame': 12477 obs. of 21 variables:
$ Date : chr "1989-04-01" "1989-04-02" "1989-04-03" "1989-04-04" ...
$ Temperature.High..C. : num 2.8 1.1 -0.6 1.1 -0.6 2.5 -1.4 -0.8 -1 -1.5 ...
$ Temperature.Low..C. : num -1 -2.7 -3.5 -4.4 -2.9 -3.1 -3.2 -4.5 -4 -3.8 ...
$ Temperature.Average..C. : num NA NA NA NA NA NA NA NA NA NA ...
$ Sea.Surface.Temperature..C.: num NA NA NA NA NA NA NA NA NA NA ...
$ Sea.Ice..WMO.Code. : chr "" "" "" "" ...
$ Pressure.High..mbar. : num 1004 998 998 1002 1002 ...
$ Pressure.Low..mbar. : num 998 995 995 998 997 ...
$ Pressure.Average..mbar. : num NA NA NA NA NA NA NA NA NA NA ...
$ Wind.Peak..knots. : num 18 24 13 14 14 15 16 39 27 16 ...
$ Wind.5.Sec.Peak..knots. : num NA NA NA NA NA NA NA NA NA NA ...
$ Wind.2.Min.Peak..knots. : num NA NA NA NA NA NA NA NA NA NA ...
$ Wind.Average..knots. : num 4 9 8 6 4 4 8 14 11 6 ...
$ Wind.Peak.Direction : num 110 30 60 210 230 180 40 120 50 60 ...
$ Wind.5.Sec.Peak.Direction : num NA NA NA NA NA NA NA NA NA NA ...
$ Wind.2.Min.Peak.Direction : num NA NA NA NA NA NA NA NA NA NA ...
$ Wind.Prevailing.Direction : chr "SW" "NE" "NE" "NW" ...
$ Precipitation.Melted..mm. : chr "0.0" "0.0" "0.0" "0.0" ...
$ Precipitation.Snow..cm. : chr "0" "0" "0" "0" ...
$ Depth.at.Snowstake..cm. : chr "" "" "" "" ...
$ Sky.Coverage..tenths. : num 4 2 5 5 10 2 9 5 0 10 ...
However, the structure of the data above indicates that the first variable ‘Date’ is a ‘character’. This means that R is not understanding these as representing dates, and so it may be hard to manipulate them any further (e.g. extracting certain years).
%%R
head(weather$Date)
class(weather$Date)
[1] "character"
(Note: if we’d used read_csv
from the readr
package then these would
automatically have been coerced into dates. We’re avoiding this for now to
demonstrate how to convert raw strings into dates.)
We’ll use the lubridate
package (loaded in when we load in tidyverse
)
to convert these variables into dates and create a column for the year.
Converting from strings#
The lubridate
package has very intuitive function names and argument
structures. For instance, you can convert a string into a date type by
using a function with letters representing the date format. In lubridate
,
y = year, m = month, d = day, h = hour, m = minute, s = second, etc.
(You can see all of these formats by using the auto-complete box in RStudio,
e.g. to see all functions starting with ‘y’, begin typing lubridate::y
.)
We can see that our data is formatted in the year-month-day format, and so we
can use the ymd
function to indicate to R what format we want it to detect as
it converts characters to dates. If we use this function and assign the output
to a date vector, we can then evaluate the class to see that it is a date.
%%R
date_vector <- lubridate::ymd(weather$Date)
class(date_vector)
[1] "Date"
Remember that manipulating a column does not mean it automatically saves into the dataframe. Instead, we need to explicitly overwrite our dataframe with a new dataframe that contains the modified column. We can use the mutate function to achieve this:
%%R
# Change the Date column to be a date type
weather <- weather |>
dplyr::mutate(Date = lubridate::ymd(Date))
class(weather$Date)
[1] "Date"
Accessing timestamp components#
Now that R recognizes these as dates, we can extract certain date-related
features, such as the year, month, and day. The lubridate
package has functions
conveniently named year
, month
, day
that can be used with a date argument,
and that element of the date will be extracted. For example:
%%R
# Extract the year
weather <- weather |>
dplyr::mutate(Year = lubridate::year(Date))
head(weather) |>
dplyr::select(Date, Year)
Date Year
1 1989-04-01 1989
2 1989-04-02 1989
3 1989-04-03 1989
4 1989-04-04 1989
5 1989-04-05 1989
6 1989-04-06 1989
Exercise: working with dates#
Create a summary of the weather data that gives the maximum, minimum, and mean of the average temperatures for each month of the year.
Solution
%%R
weather |>
dplyr::mutate(Month = lubridate::month(Date)) |>
dplyr::group_by(Month) |>
dplyr::summarize(Average_temp = mean(Temperature.Average..C., na.rm = TRUE),
Max_temp = max(Temperature.High..C., na.rm = TRUE),
Min_temp = min(Temperature.Low..C., na.rm = TRUE))
Acknowledgement#
The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.