Strings and Dates#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Exercises incomplete: Download Strings_and_dates.Rmd Exercises complete: Download Strings_and_dates_complete.Rmd

Learning Objectives#

  • Understand the process of cleaning and manipulating string data in R using the stringr package

  • Learn how to clean and format text data by removing whitespace and replacing substrings

  • Understand how to split and concatenate string columns to create new identifiers

  • Gain proficiency in converting strings to dates and accessing timestamp components using the lubridate package

  • Apply the knowledge of string and date manipulation in practical examples

Introduction#

In this notebook we’ll be looking at ways to work with two common kinds of data: text data and dates/timestamps. When we first create and collect data, it may not be in its cleanest form, especially when it comes to working with text data and timestamps. The following packages provide lots of useful functions for working with strings / timestamps:

  • stringr: working with strings (included with tidyverse).

  • lubridate: working with dates/timestamps (a non-core Tidyverse package).

To demonstrate these packages, we’ll be working with two datasets:

  • A toy dataset consisting of some coin flipping experiments (thrilling stuff).

  • Weather data from the Palmer Station in Antarctica from 1989 - 2019, where data are available for each day[LTE23].

Note We’ll be making use of the pipe operator |> throughout; see the 04_Piping.Rmd notebook.

Working with text - heads and tails#

We have a toy dataset for getting familiar with working with text. The heads_and_tails.csv file contains data on some coin flipping experiments performed by two people, Joe Bloggs and Jane Doe. The experiments were performed on different days and are split up into different parts (coin flipping is tiring work).

The data is not in a clean state:

  • Joe and Jane’s names are recorded in multiple ways.

  • The results are inconsistently formatted.

  • The experiments look pretty consistent, but note that in some rows the date is separated from the part n bit by a colon : whereas in other rows it’s separated by a space.

%%R
# Note: reading all columns as strings
experiments <- readr::read_csv("./data/heads_and_tales.csv",
                               col_types = list(.default = readr::col_character()))
experiments
# A tibble: 13 × 3
   Experiment        Technician  Result                                         
   <chr>             <chr>       <chr>                                          
 1 2024-01-04:part 1 joe bloggs  H,H,H,T,T,H,H,T,H,H,T,H,H,H,T,T,H,H,T,T        
 2 2024-01-04:part 2 Joe Bloggs  T,T,H,T,H,T,T,H,T,T,T,T,H,T,H,H,T,H,T,T        
 3 2024-01-04:part 3 Joe Bloggs  H,T,H,T,T,H,T,H,H,T,H,T,H,T,T,T,H,H,H,T,T,H,T,…
 4 2024-01-04 part 1 jane doe    HTHHTHTHTTHTHTHTTT                             
 5 2024-01-04 part 2 jane doe    HHTTTTHHHTTHTHHTHHTTT                          
 6 2024-01-05 part 1 Jane doe    THTTTHHTTTH                                    
 7 2024-01-05 part 2 Jane doe    TTTTTHTHTHTTTTTHTTHT                           
 8 2024-01-06:part 1 Joe Bloggs  HTHTTTTHHHHTTTHHHHHTHH                         
 9 2024-01-06:part 2 Joe  Bloggs TTTTTTHTTTHTH                                  
10 2024-01-06 part 1 Jane Doe    HTTTTHHHTHHTTHHHTTTTTHTTH                      
11 2024-01-06 part 2 Jane Doe    TTTTHHTHHTTHTHHHTHTTTH                         
12 2024-01-06 part 3 Jane Doe    HHTHTTTTTHTTTHTHTHT                            
13 2024-01-07:part 1 joe bloggs  TTHTTHHHHTHHHHTTTHHT                           

We’ll use some of the functions in stringr to clean up this dataset.

Cleaning whitespace#

Before tidying up the columns, let’s first rename the current columns to indicate they contain the ‘raw’ values. We can use the function rename from the dplyr package to do this:

%%R
# Rename columns to *_raw
experiments <- experiments |>
  dplyr::rename(Experiment_raw = Experiment,
                Technician_raw = Technician,
                Result_raw = Result)
print(colnames(experiments))
[1] "Experiment_raw" "Technician_raw" "Result_raw"    

Let’s first tackle cleaning up the technician’s names. Look at the unique values in the ‘Technician_raw’ column:

%%R
unique(experiments$Technician_raw)
[1] "joe bloggs"  "Joe Bloggs"  "jane doe"    "Jane doe"    "Joe  Bloggs"
[6] "Jane Doe"   

Let’s make the case consistent. There are several stringr functions we could use for this, of the form str_to_*

%%R
name <- "aLbErt EinStEin"
print(stringr::str_to_lower(name))  # lower case
print(stringr::str_to_upper(name))  # upper case
print(stringr::str_to_title(name))  # title case
print(stringr::str_to_sentence(name))  # sentence case
[1] "albert einstein"
[1] "ALBERT EINSTEIN"
[1] "Albert Einstein"
[1] "Albert einstein"

Let’s convert the technicians’ names to title case. Do we now have consistent names for Joe and Jane?

%%R
experiments <- experiments |>
  dplyr::mutate(Technician = stringr::str_to_title(Technician_raw))

unique(experiments$Technician)
[1] "Joe Bloggs"  "Jane Doe"    "Joe  Bloggs"

Not quite. We can replace the two spaces in the middle of “Joe Bloggs” with a single space by using stringr::str_squish:

%%R
# Remove repeated whitespace characters
experiments <- experiments |>
  dplyr::mutate(Technician = stringr::str_squish(Technician))

unique(experiments$Technician)
[1] "Joe Bloggs" "Jane Doe"  

Note that a closely related stringr function, str_trim, can be used for only removing leading and trailing whitespace, if for some reason we wanted to keep all the whitespace between the words:

%%R
# \t = tab character, \n = new line character
name <- "\tAda    Lovelace  \n"
cat(name)
cat(stringr::str_trim(name))
	Ada    Lovelace  
Ada    Lovelace

Replacing characters / substrings#

Next, let’s make the ‘Result’ entries consistently formatted. We’ll remove the commas from the first few experiments and instead just have a string of Hs and Ts. We can view this as replacing each comma with the empty string, "".

In general, to perform text replacement, we can use one of the following stringr functions:

%%R
vowels = "a--e--i--o--u"

# Replace the *first* instance of "--" with ", "
print(stringr::str_replace(vowels, pattern = "--", replacement = ", "))  # a, e--i--o--u

# Replace *all* instances of "--" with  ", "
print(stringr::str_replace_all(vowels, pattern = "--", replacement = ", "))  # a, e, i, o, u
[1] "a, e--i--o--u"
[1] "a, e, i, o, u"

You may wonder why the parameter name ‘pattern’ is used for the string we want to replace. The reason is that we actually need to supply a regular expression to match on. We’ll come back to this.

Let’s now clean up the ‘Result’ column by removing the commas:

%%R
experiments <- experiments |>
  dplyr::mutate(Result = stringr::str_replace_all(Result_raw, ",", ""))

experiments |>
  dplyr::select(Result_raw, Result)
# A tibble: 13 × 2
   Result_raw                                        Result                   
   <chr>                                             <chr>                    
 1 H,H,H,T,T,H,H,T,H,H,T,H,H,H,T,T,H,H,T,T           HHHTTHHTHHTHHHTTHHTT     
 2 T,T,H,T,H,T,T,H,T,T,T,T,H,T,H,H,T,H,T,T           TTHTHTTHTTTTHTHHTHTT     
 3 H,T,H,T,T,H,T,H,H,T,H,T,H,T,T,T,H,H,H,T,T,H,T,H,T HTHTTHTHHTHTHTTTHHHTTHTHT
 4 HTHHTHTHTTHTHTHTTT                                HTHHTHTHTTHTHTHTTT       
 5 HHTTTTHHHTTHTHHTHHTTT                             HHTTTTHHHTTHTHHTHHTTT    
 6 THTTTHHTTTH                                       THTTTHHTTTH              
 7 TTTTTHTHTHTTTTTHTTHT                              TTTTTHTHTHTTTTTHTTHT     
 8 HTHTTTTHHHHTTTHHHHHTHH                            HTHTTTTHHHHTTTHHHHHTHH   
 9 TTTTTTHTTTHTH                                     TTTTTTHTTTHTH            
10 HTTTTHHHTHHTTHHHTTTTTHTTH                         HTTTTHHHTHHTTHHHTTTTTHTTH
11 TTTTHHTHHTTHTHHHTHTTTH                            TTTTHHTHHTTHTHHHTHTTTH   
12 HHTHTTTTTHTTTHTHTHT                               HHTHTTTTTHTTTHTHTHT      
13 TTHTTHHHHTHHHHTTTHHT                              TTHTTHHHHTHHHHTTTHHT     

Exercise: replacing strings#

Put the ‘Experiment’ column into a consistent format, so that the experiment is recorded as <date>:part <n> where <date> is a date string and <n> is a number 1, 2, etc.

Splitting and concatenating columns#

With the columns cleaned up, we no longer need the ‘raw’ columns, so let’s remove them:

%%R
experiments <- experiments |>
  dplyr::select(-Experiment_raw, -Technician_raw, -Result_raw)

It would be good to have the ‘Experiment’ column consist of a unique identifier for each experiment. We could just give it an integer label, but let’s do something more sophisticated. Each row is uniquely identifiable from the current ‘Experiment’ string and the name of the Technician. So let’s combine these together to make a new ID string, as follows:

Joe Bloggs  +  2024-01-04:part 2  -->  2024-01-04:2:JoeBloggs

Note that we’ve removed the “part “ string and removed the inner space in the name. How can we accomplish this transformation in smaller steps?

  • First split the experiment label “2024-01-04:part 2” into two pieces: “2024-01-04” and “2”.

  • Remove the space from the name, “Joe Bloggs” to “JoeBloggs”.

  • Paste the results together using colons: “2024-01-04” + “2” + “JoeBloggs” –> “2024-01-04:2:JoeBloggs”.

We know how to do the second step (use str_replace), but steps 1 and 3 are new.

The trick is to do this column-wise in the dataframe. Se we’ll make new columns containing the pieces we want to paste together, and then create a new column by pasting these two columns together.

To perform the splitting, we can use the separate function from the tidyr package (not stringr!) to split a column on a separator. The result is one column for each piece of the split; we thus need to also provide names for the new columns:

tidyr::separate(data,
                col,
                into = c("NewCol1", "NewCol2", ...),
                sep = <string_to_split_on>)

To paste columns back together with a custom separator, we can use the str_c function from stringr:

%%R
stringr::str_c("foo", "bar", "baz", sep = " -- ")  # foo -- bar -- baz
[1] "foo -- bar -- baz"

Exercise: splitting and concatenating columns#

Use tidyr::separate and stringr::str_c to create a unique experiment ID by pasting together:

  • The experiment date

  • The sub-experiment number

  • The technician’s name (without a space)

E.g.

Joe Bloggs  +  2024-01-04:part 2  -->  2024-01-04:2:JoeBloggs

Hint: don’t try to do this all in one go. Instead, build it up step-by-step and use the pipe operator |> join the steps together.

Finding substrings#

How many of the experiments features a run of 5 heads or 5 tails?

We can answer this as follows:

  • For each row, determine whether it contains a run of 5 heads or 5 tails.

  • Filter on the result.

To perform the first step, we can use str_detect from stringr. This simply returns TRUE or FALSE depending on whether it finds a given pattern in a string:

stringr::str_detect(a_string, pattern)

The pattern argument is interpreted as a regular expression. We mentioned this earlier when looking at str_replace / str_replace_all. Regular expressions are a mini-language for specifying matches on strings. We’re not going to go into this much here, but they are useful to know about, especially if you work with text data a lot; check out the great stringr cheat sheet for a summary, or consult the chapter on regular expressions in R for Data Science (2e) for an introduction. Here are some example regular expressions in action, using str_detect to indicate whether the given string matches or not:

%%R
# Match on a string as-is (case sensitive)
stringr::str_detect("Regex is cool.", pattern = "ex") |> print()  # true
stringr::str_detect("Regex is cool.", pattern = "COOL") |> print()  # false

# Match on 1 or more instances
stringr::str_detect("Regex is cool.", pattern = "o+") |> print()  # true
stringr::str_detect("Regex is cool.", pattern = "coolcool+") |> print()  # false

# Regex-special characters need escaping e.g. to match on a period, use \\
stringr::str_detect("Regex is cool.", pattern = "\\.") |> print()  # true

# Match on either one pattern or another with pipe |
stringr::str_detect("Regex is cool.", pattern = "cool|meh") |> print()  # true
stringr::str_detect("Regex is meh.", pattern = "cool|meh") |> print()  # true
stringr::str_detect("Regex is cool.", pattern = "meh|lame") |> print()  # false

# Match at the beginning of the string only with ^
stringr::str_detect("Regex is cool.", pattern = "^Regex") |> print()  # true
stringr::str_detect("Regex is cool.", pattern = "^cool") |> print()  # false
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] TRUE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE

Exercise: finding substrings#

With the help of str_detect and a suitable regular expression, write code to work out how many experiments feature a run of at least 5 straight heads or 5 straight tails.

Working with dates#

We have weather data from the Palmer Station in Antarctica from 1989 - 2019[LTE23], where data are available for each day.

%%R
weather <- read.csv("./data/PalmerStation_Daily_Weather.csv")  # using base R read.csv here!
str(weather)
'data.frame':	12477 obs. of  21 variables:
 $ Date                       : chr  "1989-04-01" "1989-04-02" "1989-04-03" "1989-04-04" ...
 $ Temperature.High..C.       : num  2.8 1.1 -0.6 1.1 -0.6 2.5 -1.4 -0.8 -1 -1.5 ...
 $ Temperature.Low..C.        : num  -1 -2.7 -3.5 -4.4 -2.9 -3.1 -3.2 -4.5 -4 -3.8 ...
 $ Temperature.Average..C.    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Sea.Surface.Temperature..C.: num  NA NA NA NA NA NA NA NA NA NA ...
 $ Sea.Ice..WMO.Code.         : chr  "" "" "" "" ...
 $ Pressure.High..mbar.       : num  1004 998 998 1002 1002 ...
 $ Pressure.Low..mbar.        : num  998 995 995 998 997 ...
 $ Pressure.Average..mbar.    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Wind.Peak..knots.          : num  18 24 13 14 14 15 16 39 27 16 ...
 $ Wind.5.Sec.Peak..knots.    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Wind.2.Min.Peak..knots.    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Wind.Average..knots.       : num  4 9 8 6 4 4 8 14 11 6 ...
 $ Wind.Peak.Direction        : num  110 30 60 210 230 180 40 120 50 60 ...
 $ Wind.5.Sec.Peak.Direction  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Wind.2.Min.Peak.Direction  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ Wind.Prevailing.Direction  : chr  "SW" "NE" "NE" "NW" ...
 $ Precipitation.Melted..mm.  : chr  "0.0" "0.0" "0.0" "0.0" ...
 $ Precipitation.Snow..cm.    : chr  "0" "0" "0" "0" ...
 $ Depth.at.Snowstake..cm.    : chr  "" "" "" "" ...
 $ Sky.Coverage..tenths.      : num  4 2 5 5 10 2 9 5 0 10 ...

However, the structure of the data above indicates that the first variable ‘Date’ is a ‘character’. This means that R is not understanding these as representing dates, and so it may be hard to manipulate them any further (e.g. extracting certain years).

%%R
head(weather$Date)
class(weather$Date)
[1] "character"

(Note: if we’d used read_csv from the readr package then these would automatically have been coerced into dates. We’re avoiding this for now to demonstrate how to convert raw strings into dates.)

We’ll use the lubridate package (loaded in when we load in tidyverse) to convert these variables into dates and create a column for the year.

Converting from strings#

The lubridate package has very intuitive function names and argument structures. For instance, you can convert a string into a date type by using a function with letters representing the date format. In lubridate, y = year, m = month, d = day, h = hour, m = minute, s = second, etc.

(You can see all of these formats by using the auto-complete box in RStudio, e.g. to see all functions starting with ‘y’, begin typing lubridate::y.)

We can see that our data is formatted in the year-month-day format, and so we can use the ymd function to indicate to R what format we want it to detect as it converts characters to dates. If we use this function and assign the output to a date vector, we can then evaluate the class to see that it is a date.

%%R
date_vector <- lubridate::ymd(weather$Date)
class(date_vector)
[1] "Date"

Remember that manipulating a column does not mean it automatically saves into the dataframe. Instead, we need to explicitly overwrite our dataframe with a new dataframe that contains the modified column. We can use the mutate function to achieve this:

%%R
# Change the Date column to be a date type
weather <- weather |>
  dplyr::mutate(Date = lubridate::ymd(Date))

class(weather$Date)
[1] "Date"

Accessing timestamp components#

Now that R recognizes these as dates, we can extract certain date-related features, such as the year, month, and day. The lubridate package has functions conveniently named year, month, day that can be used with a date argument, and that element of the date will be extracted. For example:

%%R
# Extract the year
weather <- weather |>
  dplyr::mutate(Year = lubridate::year(Date))

head(weather) |>
  dplyr::select(Date, Year)
        Date Year
1 1989-04-01 1989
2 1989-04-02 1989
3 1989-04-03 1989
4 1989-04-04 1989
5 1989-04-05 1989
6 1989-04-06 1989

Exercise: working with dates#

Create a summary of the weather data that gives the maximum, minimum, and mean of the average temperatures for each month of the year.

Acknowledgement#

The material in this notebook is adapted from Eliza Wood’s Tidyverse: Data wrangling & visualization course, which is licensed under Creative Commons BY-NC-SA 4.0. This in itself is based on material from UC Davis’s R-DAVIS course, which draws heavily on Carpentries R lessons.