Manipulating Datasets#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Download manipulating_data.Rmd

Often we need to manipulate or extract parts of our dataset prior to doing any analysis or plotting with it.

Learning Objectives#

  • Learn how to subset a dataset using conditional subsetting.

  • Understand different methods for comparing and matching datasets in R.

  • Learn how to combine one and two dimensional datasets using paste(), cbind(), and rbind().

  • Learn how to rename, add and remove rows and/or columns in a two dimensional dataset.

  • Learn how to use the seq() function to generate regular sequences of numbers.

Conditional Subsetting#

We have already looked at slicing subsets, where we knew the indexes of the rows or columns of the entries we wanted. There may be times when, instead, we want to select rows based on a specific condition. This would require a conditional statement. Conditional commands check if criteria are met and return either TRUE or FALSE in response.

Let’s find which rows of iris have a Sepal.Length less than 7.

iris$Sepal.Length < 6
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. TRUE
  15. TRUE
  16. TRUE
  17. TRUE
  18. TRUE
  19. TRUE
  20. TRUE
  21. TRUE
  22. TRUE
  23. TRUE
  24. TRUE
  25. TRUE
  26. TRUE
  27. TRUE
  28. TRUE
  29. TRUE
  30. TRUE
  31. TRUE
  32. TRUE
  33. TRUE
  34. TRUE
  35. TRUE
  36. TRUE
  37. TRUE
  38. TRUE
  39. TRUE
  40. TRUE
  41. TRUE
  42. TRUE
  43. TRUE
  44. TRUE
  45. TRUE
  46. TRUE
  47. TRUE
  48. TRUE
  49. TRUE
  50. TRUE
  51. FALSE
  52. FALSE
  53. FALSE
  54. TRUE
  55. FALSE
  56. TRUE
  57. FALSE
  58. TRUE
  59. FALSE
  60. TRUE
  61. TRUE
  62. TRUE
  63. FALSE
  64. FALSE
  65. TRUE
  66. FALSE
  67. TRUE
  68. TRUE
  69. FALSE
  70. TRUE
  71. TRUE
  72. FALSE
  73. FALSE
  74. FALSE
  75. FALSE
  76. FALSE
  77. FALSE
  78. FALSE
  79. FALSE
  80. TRUE
  81. TRUE
  82. TRUE
  83. TRUE
  84. FALSE
  85. TRUE
  86. FALSE
  87. FALSE
  88. FALSE
  89. TRUE
  90. TRUE
  91. TRUE
  92. FALSE
  93. TRUE
  94. TRUE
  95. TRUE
  96. TRUE
  97. TRUE
  98. FALSE
  99. TRUE
  100. TRUE
  101. FALSE
  102. TRUE
  103. FALSE
  104. FALSE
  105. FALSE
  106. FALSE
  107. TRUE
  108. FALSE
  109. FALSE
  110. FALSE
  111. FALSE
  112. FALSE
  113. FALSE
  114. TRUE
  115. TRUE
  116. FALSE
  117. FALSE
  118. FALSE
  119. FALSE
  120. FALSE
  121. FALSE
  122. TRUE
  123. FALSE
  124. FALSE
  125. FALSE
  126. FALSE
  127. FALSE
  128. FALSE
  129. FALSE
  130. FALSE
  131. FALSE
  132. FALSE
  133. FALSE
  134. FALSE
  135. FALSE
  136. FALSE
  137. FALSE
  138. FALSE
  139. FALSE
  140. FALSE
  141. FALSE
  142. FALSE
  143. TRUE
  144. FALSE
  145. FALSE
  146. FALSE
  147. FALSE
  148. FALSE
  149. FALSE
  150. TRUE

Where is says TRUE means the criteria have been met and FALSE not. We can use this to subset the rows of iris

iris[iris$Sepal.Length < 6,]
A data.frame: 83 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
605.22.73.91.4versicolor
615.02.03.51.0versicolor
625.93.04.21.5versicolor
655.62.93.61.3versicolor
675.63.04.51.5versicolor
685.82.74.11.0versicolor
705.62.53.91.1versicolor
715.93.24.81.8versicolor
805.72.63.51.0versicolor
815.52.43.81.1versicolor
825.52.43.71.0versicolor
835.82.73.91.2versicolor
855.43.04.51.5versicolor
895.63.04.11.3versicolor
905.52.54.01.3versicolor
915.52.64.41.2versicolor
935.82.64.01.2versicolor
945.02.33.31.0versicolor
955.62.74.21.3versicolor
965.73.04.21.2versicolor
975.72.94.21.3versicolor
995.12.53.01.1versicolor
1005.72.84.11.3versicolor
1025.82.75.11.9virginica
1074.92.54.51.7virginica
1145.72.55.02.0virginica
1155.82.85.12.4virginica
1225.62.84.92.0virginica
1435.82.75.11.9virginica
1505.93.05.11.8virginica

Matching#

There are several circumstances we will need to check for matching and use that information. There are several ways we can do this using R depending on what we need.

Using identical(), we can check if values or collections of values are identical.

# Checking if the first and second row values in the "Species" column of iris are identical
identical(iris$Species[1], iris$Species[2])
TRUE
# Checking if the first and 51st row values in the "Species" column of iris are identical
identical(iris$Species[1], iris$Species[51])
FALSE

Using all.equal() is similar to identical(), but allows for some tolerance in how similar values can be. For example, we may want to check two numbers with lots of decimal places, but only need them to be within 0.01 of each other. Therefore, we can give a tolerance of 0.01.

x1 <- 1.232529
x2 <- 1.23366
all.equal(x1, x2, tolerance=0.01)
TRUE
all.equal(x1, x2, tolerance=0.0001)
'Mean relative difference: 0.0009176255'

We can use “==” as a selector to pull all matching entries. We can give a numeric value or a character in quotations.

iris[iris$Species == "setosa", ]
A data.frame: 50 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
314.83.11.60.2setosa
325.43.41.50.4setosa
335.24.11.50.1setosa
345.54.21.40.2setosa
354.93.11.50.2setosa
365.03.21.20.2setosa
375.53.51.30.2setosa
384.93.61.40.1setosa
394.43.01.30.2setosa
405.13.41.50.2setosa
415.03.51.30.3setosa
424.52.31.30.3setosa
434.43.21.30.2setosa
445.03.51.60.6setosa
455.13.81.90.4setosa
464.83.01.40.3setosa
475.13.81.60.2setosa
484.63.21.40.2setosa
495.33.71.50.2setosa
505.03.31.40.2setosa

We can use objects or parts of objects to select rows and columns within [ ] using “%in%”.

select <- "versicolor"
iris[iris$Species %in% select, ]
A data.frame: 50 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
517.03.24.71.4versicolor
526.43.24.51.5versicolor
536.93.14.91.5versicolor
545.52.34.01.3versicolor
556.52.84.61.5versicolor
565.72.84.51.3versicolor
576.33.34.71.6versicolor
584.92.43.31.0versicolor
596.62.94.61.3versicolor
605.22.73.91.4versicolor
615.02.03.51.0versicolor
625.93.04.21.5versicolor
636.02.24.01.0versicolor
646.12.94.71.4versicolor
655.62.93.61.3versicolor
666.73.14.41.4versicolor
675.63.04.51.5versicolor
685.82.74.11.0versicolor
696.22.24.51.5versicolor
705.62.53.91.1versicolor
715.93.24.81.8versicolor
726.12.84.01.3versicolor
736.32.54.91.5versicolor
746.12.84.71.2versicolor
756.42.94.31.3versicolor
766.63.04.41.4versicolor
776.82.84.81.4versicolor
786.73.05.01.7versicolor
796.02.94.51.5versicolor
805.72.63.51.0versicolor
815.52.43.81.1versicolor
825.52.43.71.0versicolor
835.82.73.91.2versicolor
846.02.75.11.6versicolor
855.43.04.51.5versicolor
866.03.44.51.6versicolor
876.73.14.71.5versicolor
886.32.34.41.3versicolor
895.63.04.11.3versicolor
905.52.54.01.3versicolor
915.52.64.41.2versicolor
926.13.04.61.4versicolor
935.82.64.01.2versicolor
945.02.33.31.0versicolor
955.62.74.21.3versicolor
965.73.04.21.2versicolor
975.72.94.21.3versicolor
986.22.94.31.3versicolor
995.12.53.01.1versicolor
1005.72.84.11.3versicolor

Merging and Binding#

We will often need to bring multiple two-dimensional objects together. We can do this multiple ways.

Using rbind() and cbind(), we can combine objects together. rbind() allows us to bind together rows.

# First we look at the dimension of "iris"
dim(iris)
  1. 150
  2. 5
# Using rbind() to put together two copies of  causing double rows
rbind(iris, iris) -> iris.r
dim(iris.r)
  1. 300
  2. 5
iris.r[ ,1]
  1. 5.1
  2. 4.9
  3. 4.7
  4. 4.6
  5. 5
  6. 5.4
  7. 4.6
  8. 5
  9. 4.4
  10. 4.9
  11. 5.4
  12. 4.8
  13. 4.8
  14. 4.3
  15. 5.8
  16. 5.7
  17. 5.4
  18. 5.1
  19. 5.7
  20. 5.1
  21. 5.4
  22. 5.1
  23. 4.6
  24. 5.1
  25. 4.8
  26. 5
  27. 5
  28. 5.2
  29. 5.2
  30. 4.7
  31. 4.8
  32. 5.4
  33. 5.2
  34. 5.5
  35. 4.9
  36. 5
  37. 5.5
  38. 4.9
  39. 4.4
  40. 5.1
  41. 5
  42. 4.5
  43. 4.4
  44. 5
  45. 5.1
  46. 4.8
  47. 5.1
  48. 4.6
  49. 5.3
  50. 5
  51. 7
  52. 6.4
  53. 6.9
  54. 5.5
  55. 6.5
  56. 5.7
  57. 6.3
  58. 4.9
  59. 6.6
  60. 5.2
  61. 5
  62. 5.9
  63. 6
  64. 6.1
  65. 5.6
  66. 6.7
  67. 5.6
  68. 5.8
  69. 6.2
  70. 5.6
  71. 5.9
  72. 6.1
  73. 6.3
  74. 6.1
  75. 6.4
  76. 6.6
  77. 6.8
  78. 6.7
  79. 6
  80. 5.7
  81. 5.5
  82. 5.5
  83. 5.8
  84. 6
  85. 5.4
  86. 6
  87. 6.7
  88. 6.3
  89. 5.6
  90. 5.5
  91. 5.5
  92. 6.1
  93. 5.8
  94. 5
  95. 5.6
  96. 5.7
  97. 5.7
  98. 6.2
  99. 5.1
  100. 5.7
  101. 6.3
  102. 5.8
  103. 7.1
  104. 6.3
  105. 6.5
  106. 7.6
  107. 4.9
  108. 7.3
  109. 6.7
  110. 7.2
  111. 6.5
  112. 6.4
  113. 6.8
  114. 5.7
  115. 5.8
  116. 6.4
  117. 6.5
  118. 7.7
  119. 7.7
  120. 6
  121. 6.9
  122. 5.6
  123. 7.7
  124. 6.3
  125. 6.7
  126. 7.2
  127. 6.2
  128. 6.1
  129. 6.4
  130. 7.2
  131. 7.4
  132. 7.9
  133. 6.4
  134. 6.3
  135. 6.1
  136. 7.7
  137. 6.3
  138. 6.4
  139. 6
  140. 6.9
  141. 6.7
  142. 6.9
  143. 5.8
  144. 6.8
  145. 6.7
  146. 6.7
  147. 6.3
  148. 6.5
  149. 6.2
  150. 5.9
  151. 5.1
  152. 4.9
  153. 4.7
  154. 4.6
  155. 5
  156. 5.4
  157. 4.6
  158. 5
  159. 4.4
  160. 4.9
  161. 5.4
  162. 4.8
  163. 4.8
  164. 4.3
  165. 5.8
  166. 5.7
  167. 5.4
  168. 5.1
  169. 5.7
  170. 5.1
  171. 5.4
  172. 5.1
  173. 4.6
  174. 5.1
  175. 4.8
  176. 5
  177. 5
  178. 5.2
  179. 5.2
  180. 4.7
  181. 4.8
  182. 5.4
  183. 5.2
  184. 5.5
  185. 4.9
  186. 5
  187. 5.5
  188. 4.9
  189. 4.4
  190. 5.1
  191. 5
  192. 4.5
  193. 4.4
  194. 5
  195. 5.1
  196. 4.8
  197. 5.1
  198. 4.6
  199. 5.3
  200. 5
  201. 7
  202. 6.4
  203. 6.9
  204. 5.5
  205. 6.5
  206. 5.7
  207. 6.3
  208. 4.9
  209. 6.6
  210. 5.2
  211. 5
  212. 5.9
  213. 6
  214. 6.1
  215. 5.6
  216. 6.7
  217. 5.6
  218. 5.8
  219. 6.2
  220. 5.6
  221. 5.9
  222. 6.1
  223. 6.3
  224. 6.1
  225. 6.4
  226. 6.6
  227. 6.8
  228. 6.7
  229. 6
  230. 5.7
  231. 5.5
  232. 5.5
  233. 5.8
  234. 6
  235. 5.4
  236. 6
  237. 6.7
  238. 6.3
  239. 5.6
  240. 5.5
  241. 5.5
  242. 6.1
  243. 5.8
  244. 5
  245. 5.6
  246. 5.7
  247. 5.7
  248. 6.2
  249. 5.1
  250. 5.7
  251. 6.3
  252. 5.8
  253. 7.1
  254. 6.3
  255. 6.5
  256. 7.6
  257. 4.9
  258. 7.3
  259. 6.7
  260. 7.2
  261. 6.5
  262. 6.4
  263. 6.8
  264. 5.7
  265. 5.8
  266. 6.4
  267. 6.5
  268. 7.7
  269. 7.7
  270. 6
  271. 6.9
  272. 5.6
  273. 7.7
  274. 6.3
  275. 6.7
  276. 7.2
  277. 6.2
  278. 6.1
  279. 6.4
  280. 7.2
  281. 7.4
  282. 7.9
  283. 6.4
  284. 6.3
  285. 6.1
  286. 7.7
  287. 6.3
  288. 6.4
  289. 6
  290. 6.9
  291. 6.7
  292. 6.9
  293. 5.8
  294. 6.8
  295. 6.7
  296. 6.7
  297. 6.3
  298. 6.5
  299. 6.2
  300. 5.9

cbind() allows us to bind together columns.

# Using cbind() to put together two copies of iris causing double columns
cbind(iris, iris) -> iris.c
dim(iris.c)
  1. 150
  2. 10
iris.c[1,]
A data.frame: 1 × 10
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesSepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct><dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa5.13.51.40.2setosa

Both rbind and cbind will work, if you try to combine vectors of different lengths. In this case it will recycle the shorted vector until it matches the length of the longer vector. You will get a warning in this situation.

Using merge(), we can merge objects together assigning what we bind by using “by =”. For example, we can bind using the rownames of our objects using “by = row.names”, we can merge by a specific column present in both objects (e.g. by = “Name”), or different columns in each object (by.x = “Species”, by.y = “Name”).

# Merging iris by row names
merge(iris, iris, by = "row.names") -> iris.double
dim(iris.double)
  1. 150
  2. 11
iris.double[1, ]
A data.frame: 1 × 11
Row.namesSepal.Length.xSepal.Width.xPetal.Length.xPetal.Width.xSpecies.xSepal.Length.ySepal.Width.yPetal.Length.yPetal.Width.ySpecies.y
<I<chr>><dbl><dbl><dbl><dbl><fct><dbl><dbl><dbl><dbl><fct>
115.13.51.40.2setosa5.13.51.40.2setosa

Activity#

Create two objects, one containing numbers 1-10, one containing numbers 11-20

  • Bind them together to make an object of two rows, row 1 being 1:10, row 2 being 11-20

  • Bind them together to make an object of two columns, columns 1 being 11-20, column 2 being 1-10

Paste#

The function paste() is a way of concatenating vectors together. It can be applied to a characters or numbers, vector and column(s) of a data frame or matrix. You can define what you want the separator to be (sep =), or use paste0() or paste() with the argument sep = “” for no space. You can also provide a string as an argument to add the same component to a character or vector.

# Adding text to a value in iris
paste("Species", iris$Species[1])
'Species setosa'
# Pasting together two columns of iris
paste(iris$Species[1:10], iris$Sepal.Length[1:10], sep = ":")
  1. 'setosa:5.1'
  2. 'setosa:4.9'
  3. 'setosa:4.7'
  4. 'setosa:4.6'
  5. 'setosa:5'
  6. 'setosa:5.4'
  7. 'setosa:4.6'
  8. 'setosa:5'
  9. 'setosa:4.4'
  10. 'setosa:4.9'

Renaming columns and rows#

By using rownames() and colnames(), we can look at what the rownames and colnames of an object are. We can also use this to replace the rownames and colnames of the object by assigning using <-.

# Renaming the colnames in iris
iris -> iris2
colnames(iris2)
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'
colnames(iris2) <- c("S.Length", "S.Width", "P.Length", "P.Width", "Type")
colnames(iris2)
  1. 'S.Length'
  2. 'S.Width'
  3. 'P.Length'
  4. 'P.Width'
  5. 'Type'

Adding and removing variables#

Adding data to your objects can be very useful. Adding an extra column is very easy using the assignment operator and giving the new column a name.

# Adding a new column
iris -> iris2
iris2$new.column <- 1:nrow(iris2)

head(iris2)
A data.frame: 6 × 6
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesnew.column
<dbl><dbl><dbl><dbl><fct><int>
15.13.51.40.2setosa1
24.93.01.40.2setosa2
34.73.21.30.2setosa3
44.63.11.50.2setosa4
55.03.61.40.2setosa5
65.43.91.70.4setosa6

Removing a column can be done by assigning the relevant column the “NULL” value.

# Removing a column
iris2$new.column <- NULL

Generating a sequence of numbers#

To generate regular sequences, we can use seq(). We provide it a value to start from (from =), and where to end (to =) and then a value to increase by (by =).

# Create a sequence from 0 to 100 increasing by 5 each time
seq(from = 0, to = 100, by = 5)
  1. 0
  2. 5
  3. 10
  4. 15
  5. 20
  6. 25
  7. 30
  8. 35
  9. 40
  10. 45
  11. 50
  12. 55
  13. 60
  14. 65
  15. 70
  16. 75
  17. 80
  18. 85
  19. 90
  20. 95
  21. 100

Activity#

Create a copy of iris

  • Rename the columns of iris by prefixing with the word “flower” and the separator “_”

  • In your copy, duplicate the Species column

  • Add a column to your copy which contains the numbers from 4 to 600 increasing by 4 each time.

Summary Quiz#

What is the purpose of conditional subsetting in R?

Which function is used to concatenate strings in R?

How can you combine two datasets by rows in R?