Manipulating Datasets#

Download Rmd Version#

If you wish to engage with this course content via Rmd, then please click the link below to download the Rmd file.

Download manipulating_data.Rmd

Often we need to manipulate or extract parts of our dataset prior to doing any analysis or plotting with it.

Learning Objectives#

  • Learn how to subset a dataset using conditional subsetting.

  • Understand different methods for comparing and matching datasets in R.

  • Learn how to combine one and two dimensional datasets using paste(), cbind(), and rbind().

  • Learn how to rename, add and remove rows and/or columns in a two dimensional dataset.

  • Learn how to use the seq() function to generate regular sequences of numbers.

Conditional Subsetting#

We have already looked at slicing subsets, where we knew the indexes of the rows or columns of the entries we wanted. There may be times when, instead, we want to select rows based on a specific condition. This would require a conditional statement. Conditional commands check if criteria are met and return either TRUE or FALSE in response.

Let’s find which rows of iris have a Sepal.Length less than 6.

iris$Sepal.Length < 6
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. TRUE
  15. TRUE
  16. TRUE
  17. TRUE
  18. TRUE
  19. TRUE
  20. TRUE
  21. TRUE
  22. TRUE
  23. TRUE
  24. TRUE
  25. TRUE
  26. TRUE
  27. TRUE
  28. TRUE
  29. TRUE
  30. TRUE
  31. TRUE
  32. TRUE
  33. TRUE
  34. TRUE
  35. TRUE
  36. TRUE
  37. TRUE
  38. TRUE
  39. TRUE
  40. TRUE
  41. TRUE
  42. TRUE
  43. TRUE
  44. TRUE
  45. TRUE
  46. TRUE
  47. TRUE
  48. TRUE
  49. TRUE
  50. TRUE
  51. FALSE
  52. TRUE
  53. TRUE
  54. TRUE
  55. TRUE
  56. TRUE
  57. TRUE
  58. TRUE
  59. TRUE
  60. TRUE
  61. TRUE
  62. TRUE
  63. TRUE
  64. TRUE
  65. TRUE
  66. TRUE
  67. TRUE
  68. TRUE
  69. TRUE
  70. TRUE
  71. TRUE
  72. TRUE
  73. TRUE
  74. TRUE
  75. TRUE
  76. TRUE
  77. TRUE
  78. TRUE
  79. TRUE
  80. TRUE
  81. TRUE
  82. TRUE
  83. TRUE
  84. TRUE
  85. TRUE
  86. TRUE
  87. TRUE
  88. TRUE
  89. TRUE
  90. TRUE
  91. TRUE
  92. TRUE
  93. TRUE
  94. TRUE
  95. TRUE
  96. TRUE
  97. TRUE
  98. TRUE
  99. TRUE
  100. TRUE
  101. TRUE
  102. TRUE
  103. FALSE
  104. TRUE
  105. TRUE
  106. FALSE
  107. TRUE
  108. FALSE
  109. TRUE
  110. FALSE
  111. TRUE
  112. TRUE
  113. TRUE
  114. TRUE
  115. TRUE
  116. TRUE
  117. TRUE
  118. FALSE
  119. FALSE
  120. TRUE
  121. TRUE
  122. TRUE
  123. FALSE
  124. TRUE
  125. TRUE
  126. FALSE
  127. TRUE
  128. TRUE
  129. TRUE
  130. FALSE
  131. FALSE
  132. FALSE
  133. TRUE
  134. TRUE
  135. TRUE
  136. FALSE
  137. TRUE
  138. TRUE
  139. TRUE
  140. TRUE
  141. TRUE
  142. TRUE
  143. TRUE
  144. TRUE
  145. TRUE
  146. TRUE
  147. TRUE
  148. TRUE
  149. TRUE
  150. TRUE

Where is says TRUE means the criteria have been met and FALSE not. We can use this to subset the rows of iris

iris[iris$Sepal.Length < 6,]
A data.frame: 83 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
605.22.73.91.4versicolor
615.02.03.51.0versicolor
625.93.04.21.5versicolor
655.62.93.61.3versicolor
675.63.04.51.5versicolor
685.82.74.11.0versicolor
705.62.53.91.1versicolor
715.93.24.81.8versicolor
805.72.63.51.0versicolor
815.52.43.81.1versicolor
825.52.43.71.0versicolor
835.82.73.91.2versicolor
855.43.04.51.5versicolor
895.63.04.11.3versicolor
905.52.54.01.3versicolor
915.52.64.41.2versicolor
935.82.64.01.2versicolor
945.02.33.31.0versicolor
955.62.74.21.3versicolor
965.73.04.21.2versicolor
975.72.94.21.3versicolor
995.12.53.01.1versicolor
1005.72.84.11.3versicolor
1025.82.75.11.9virginica
1074.92.54.51.7virginica
1145.72.55.02.0virginica
1155.82.85.12.4virginica
1225.62.84.92.0virginica
1435.82.75.11.9virginica
1505.93.05.11.8virginica

Matching#

There are several circumstances we will need to check for matching and use that information. There are several ways we can do this using R depending on what we need.

Using identical(), we can check if values or collections of values are identical.

# Checking if the first and second row values in the "Species" column of iris are identical
identical(iris$Species[1], iris$Species[2])
TRUE
# Checking if the first and 51st row values in the "Species" column of iris are identical
identical(iris$Species[1], iris$Species[51])
FALSE

Using all.equal() is similar to identical(), but allows for some tolerance in how similar values can be. For example, we may want to check two numbers with lots of decimal places, but only need them to be within 0.01 of each other. Therefore, we can give a tolerance of 0.01.

x1 <- 1.232529
x2 <- 1.23366
all.equal(x1, x2, tolerance=0.01)
TRUE
all.equal(x1, x2, tolerance=0.0001)
'Mean relative difference: 0.0009176255'

We can use “==” as a selector to pull all matching entries. We can give a numeric value or a character in quotations.

iris[iris$Species == "setosa", ]
A data.frame: 50 × 5
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
115.43.71.50.2setosa
124.83.41.60.2setosa
134.83.01.40.1setosa
144.33.01.10.1setosa
155.84.01.20.2setosa
165.74.41.50.4setosa
175.43.91.30.4setosa
185.13.51.40.3setosa
195.73.81.70.3setosa
205.13.81.50.3setosa
215.43.41.70.2setosa
225.13.71.50.4setosa
234.63.61.00.2setosa
245.13.31.70.5setosa
254.83.41.90.2setosa
265.03.01.60.2setosa
275.03.41.60.4setosa
285.23.51.50.2setosa
295.23.41.40.2setosa
304.73.21.60.2setosa
314.83.11.60.2setosa
325.43.41.50.4setosa
335.24.11.50.1setosa
345.54.21.40.2setosa
354.93.11.50.2setosa
365.03.21.20.2setosa
375.53.51.30.2setosa
384.93.61.40.1setosa
394.43.01.30.2setosa
405.13.41.50.2setosa
415.03.51.30.3setosa
424.52.31.30.3setosa
434.43.21.30.2setosa
445.03.51.60.6setosa
455.13.81.90.4setosa
464.83.01.40.3setosa
475.13.81.60.2setosa
484.63.21.40.2setosa
495.33.71.50.2setosa
505.03.31.40.2setosa

Merging and Binding#

We will often need to bring multiple two-dimensional objects together. We can do this multiple ways.

Using rbind() and cbind(), we can combine objects together. rbind() allows us to bind together rows.

# First we look at the dimension of "iris"
dim(iris)
  1. 150
  2. 5
# Using rbind() to put together two copies of  causing double rows
iris.r <- rbind(iris, iris)
dim(iris.r)
  1. 300
  2. 5
iris.r[ ,1]
  1. 5.1
  2. 4.9
  3. 4.7
  4. 4.6
  5. 5
  6. 5.4
  7. 4.6
  8. 5
  9. 4.4
  10. 4.9
  11. 5.4
  12. 4.8
  13. 4.8
  14. 4.3
  15. 5.8
  16. 5.7
  17. 5.4
  18. 5.1
  19. 5.7
  20. 5.1
  21. 5.4
  22. 5.1
  23. 4.6
  24. 5.1
  25. 4.8
  26. 5
  27. 5
  28. 5.2
  29. 5.2
  30. 4.7
  31. 4.8
  32. 5.4
  33. 5.2
  34. 5.5
  35. 4.9
  36. 5
  37. 5.5
  38. 4.9
  39. 4.4
  40. 5.1
  41. 5
  42. 4.5
  43. 4.4
  44. 5
  45. 5.1
  46. 4.8
  47. 5.1
  48. 4.6
  49. 5.3
  50. 5
  51. 7
  52. 6.4
  53. 6.9
  54. 5.5
  55. 6.5
  56. 5.7
  57. 6.3
  58. 4.9
  59. 6.6
  60. 5.2
  61. 5
  62. 5.9
  63. 6
  64. 6.1
  65. 5.6
  66. 6.7
  67. 5.6
  68. 5.8
  69. 6.2
  70. 5.6
  71. 5.9
  72. 6.1
  73. 6.3
  74. 6.1
  75. 6.4
  76. 6.6
  77. 6.8
  78. 6.7
  79. 6
  80. 5.7
  81. 5.5
  82. 5.5
  83. 5.8
  84. 6
  85. 5.4
  86. 6
  87. 6.7
  88. 6.3
  89. 5.6
  90. 5.5
  91. 5.5
  92. 6.1
  93. 5.8
  94. 5
  95. 5.6
  96. 5.7
  97. 5.7
  98. 6.2
  99. 5.1
  100. 5.7
  101. 6.3
  102. 5.8
  103. 7.1
  104. 6.3
  105. 6.5
  106. 7.6
  107. 4.9
  108. 7.3
  109. 6.7
  110. 7.2
  111. 6.5
  112. 6.4
  113. 6.8
  114. 5.7
  115. 5.8
  116. 6.4
  117. 6.5
  118. 7.7
  119. 7.7
  120. 6
  121. 6.9
  122. 5.6
  123. 7.7
  124. 6.3
  125. 6.7
  126. 7.2
  127. 6.2
  128. 6.1
  129. 6.4
  130. 7.2
  131. 7.4
  132. 7.9
  133. 6.4
  134. 6.3
  135. 6.1
  136. 7.7
  137. 6.3
  138. 6.4
  139. 6
  140. 6.9
  141. 6.7
  142. 6.9
  143. 5.8
  144. 6.8
  145. 6.7
  146. 6.7
  147. 6.3
  148. 6.5
  149. 6.2
  150. 5.9
  151. 5.1
  152. 4.9
  153. 4.7
  154. 4.6
  155. 5
  156. 5.4
  157. 4.6
  158. 5
  159. 4.4
  160. 4.9
  161. 5.4
  162. 4.8
  163. 4.8
  164. 4.3
  165. 5.8
  166. 5.7
  167. 5.4
  168. 5.1
  169. 5.7
  170. 5.1
  171. 5.4
  172. 5.1
  173. 4.6
  174. 5.1
  175. 4.8
  176. 5
  177. 5
  178. 5.2
  179. 5.2
  180. 4.7
  181. 4.8
  182. 5.4
  183. 5.2
  184. 5.5
  185. 4.9
  186. 5
  187. 5.5
  188. 4.9
  189. 4.4
  190. 5.1
  191. 5
  192. 4.5
  193. 4.4
  194. 5
  195. 5.1
  196. 4.8
  197. 5.1
  198. 4.6
  199. 5.3
  200. 5
  201. 7
  202. 6.4
  203. 6.9
  204. 5.5
  205. 6.5
  206. 5.7
  207. 6.3
  208. 4.9
  209. 6.6
  210. 5.2
  211. 5
  212. 5.9
  213. 6
  214. 6.1
  215. 5.6
  216. 6.7
  217. 5.6
  218. 5.8
  219. 6.2
  220. 5.6
  221. 5.9
  222. 6.1
  223. 6.3
  224. 6.1
  225. 6.4
  226. 6.6
  227. 6.8
  228. 6.7
  229. 6
  230. 5.7
  231. 5.5
  232. 5.5
  233. 5.8
  234. 6
  235. 5.4
  236. 6
  237. 6.7
  238. 6.3
  239. 5.6
  240. 5.5
  241. 5.5
  242. 6.1
  243. 5.8
  244. 5
  245. 5.6
  246. 5.7
  247. 5.7
  248. 6.2
  249. 5.1
  250. 5.7
  251. 6.3
  252. 5.8
  253. 7.1
  254. 6.3
  255. 6.5
  256. 7.6
  257. 4.9
  258. 7.3
  259. 6.7
  260. 7.2
  261. 6.5
  262. 6.4
  263. 6.8
  264. 5.7
  265. 5.8
  266. 6.4
  267. 6.5
  268. 7.7
  269. 7.7
  270. 6
  271. 6.9
  272. 5.6
  273. 7.7
  274. 6.3
  275. 6.7
  276. 7.2
  277. 6.2
  278. 6.1
  279. 6.4
  280. 7.2
  281. 7.4
  282. 7.9
  283. 6.4
  284. 6.3
  285. 6.1
  286. 7.7
  287. 6.3
  288. 6.4
  289. 6
  290. 6.9
  291. 6.7
  292. 6.9
  293. 5.8
  294. 6.8
  295. 6.7
  296. 6.7
  297. 6.3
  298. 6.5
  299. 6.2
  300. 5.9

cbind() allows us to bind together columns.

# Using cbind() to put together two copies of iris causing double columns
iris.c <- cbind(iris, iris)
dim(iris.c)
  1. 150
  2. 10
iris.c[1,]
A data.frame: 1 × 10
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesSepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct><dbl><dbl><dbl><dbl><fct>
15.13.51.40.2setosa5.13.51.40.2setosa

Both rbind and cbind will work, if you try to combine vectors of different lengths. In this case it will recycle the shortened vector until it matches the length of the longer vector. You will get a warning in this situation.

Using merge(), we can merge objects together assigning what we bind by using “by =”. For example, we can bind using the rownames of our objects using “by = row.names”, we can merge by a specific column present in both objects (e.g. by = “Name”), or different columns in each object (by.x = “Species”, by.y = “Name”).

# Merging iris by row names
iris.double <- merge(iris, iris, by = "row.names") 
dim(iris.double)
  1. 150
  2. 11
iris.double[1, ]
A data.frame: 1 × 11
Row.namesSepal.Length.xSepal.Width.xPetal.Length.xPetal.Width.xSpecies.xSepal.Length.ySepal.Width.yPetal.Length.yPetal.Width.ySpecies.y
<I<chr>><dbl><dbl><dbl><dbl><fct><dbl><dbl><dbl><dbl><fct>
115.13.51.40.2setosa5.13.51.40.2setosa

Activity#

Create two objects, one containing numbers 1-10, one containing numbers 11-20

  • Bind them together to make an object of two rows, row 1 being 1:10, row 2 being 11-20

  • Bind them together to make an object of two columns, column 1 being 11-20, column 2 being 1-10

Paste#

The function paste() is a way of concatenating vectors together. It can be applied to characters or numbers, vectors and column(s) of a data frame or matrix. You can define what you want the separator to be (sep =), or use paste0() or paste() with the argument sep = “” for no spaces. You can also provide a string as an argument to add the same component to a character or vector.

# Adding text to a value in iris
paste("Species", iris$Species[1])
'Species setosa'
# Pasting together two columns of iris
paste(iris$Species[1:10], iris$Sepal.Length[1:10], sep = ":")
  1. 'setosa:5.1'
  2. 'setosa:4.9'
  3. 'setosa:4.7'
  4. 'setosa:4.6'
  5. 'setosa:5'
  6. 'setosa:5.4'
  7. 'setosa:4.6'
  8. 'setosa:5'
  9. 'setosa:4.4'
  10. 'setosa:4.9'

Renaming columns and rows#

By using rownames() and colnames(), we can look at what the rownames and colnames of an object are. We can also use this to replace the rownames and colnames of the object by assigning using <-.

# Renaming the colnames in iris
iris2 <- iris
colnames(iris2)
  1. 'Sepal.Length'
  2. 'Sepal.Width'
  3. 'Petal.Length'
  4. 'Petal.Width'
  5. 'Species'
colnames(iris2) <- paste("Flower.",colnames(iris2),sep="")
colnames(iris2)
  1. 'Flower.Sepal.Length'
  2. 'Flower.Sepal.Width'
  3. 'Flower.Petal.Length'
  4. 'Flower.Petal.Width'
  5. 'Flower.Species'

Adding and removing variables#

Adding data to your objects can be very useful. Adding an extra column is very easy using the assignment operator and giving the new column a name.

# Adding a new column
iris -> iris2
iris2$new.column <- 1:nrow(iris2)

head(iris2)
A data.frame: 6 × 6
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpeciesnew.column
<dbl><dbl><dbl><dbl><fct><int>
15.13.51.40.2setosa1
24.93.01.40.2setosa2
34.73.21.30.2setosa3
44.63.11.50.2setosa4
55.03.61.40.2setosa5
65.43.91.70.4setosa6

Removing a column can be done by assigning the relevant column the “NULL” value.

# Removing a column
iris2$new.column <- NULL

Generating a sequence of numbers#

To generate regular sequences, we can use seq(). We provide it a value to start from (from =), and where to end (to =) and then a value to increase by (by =).

# Create a sequence from 0 to 100 increasing by 5 each time
seq(from = 0, to = 100, by = 5)
  1. 0
  2. 5
  3. 10
  4. 15
  5. 20
  6. 25
  7. 30
  8. 35
  9. 40
  10. 45
  11. 50
  12. 55
  13. 60
  14. 65
  15. 70
  16. 75
  17. 80
  18. 85
  19. 90
  20. 95
  21. 100

Activity#

Create a copy of iris

  • Rename the columns of iris by prefixing with the word “flower” and the separator “_”

  • In your copy, duplicate the Species column

  • Add a column to your copy which contains the numbers from 4 to 600 increasing by 4 each time.

Summary Quiz#

What is the purpose of conditional subsetting in R?

Which function is used to concatenate strings in R?

How can you combine two datasets by rows in R?