R for psychologists 2: data wrangling

Alejandro de la Vega
8/14/14

Review

We covered the basics last time

  • Data types (character, numeric, logical, integer)
  • Data objects (data frames are awesome)
  • Factors (categorical variables & subject IDs)
  • Indexing (data[ROW, COLUMN] & data$name)
  • Transform (adding a column of same length)
  • Subset (Selecting part of data frame based on condition)

Some loose ends

Time for some disparate but necessary loose ends

Getting help

  • help() or ?
  • ?? for package help
?scale

Column names

colnames(iris)
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width" 
[5] "Species"     
colnames(iris)[1] = "Length"
colnames(iris) = c("SLength", "SWidth", "PLength", "PWidth", "Species")
colnames(iris)
[1] "SLength" "SWidth"  "PLength" "PWidth"  "Species"

A note on missing values

  • Missing values stored as NA
  • If input has value as “NA” will get coded as missing
  SLength SWidth PLength PWidth Species
1     5.1    3.5     1.4     NA  setosa
2     4.9    3.0     1.4    0.2  setosa
3      NA    3.2     1.3    0.2  setosa
4     4.6    3.1     1.5    0.2  setosa
na.omit(iris) # removes rows with NAs
  SLength SWidth PLength PWidth Species
2     4.9    3.0     1.4    0.2  setosa
4     4.6    3.1     1.5    0.2  setosa
  • Note row numbers….

Dealing with NAs without removing

mean(iris$SLength)
[1] NA
  • Why?
mean(iris$SLength, na.rm=T)
[1] 4.867

Installing & loading packages

  • Automatically downloads from the internet
  • Simply use the name of the package
install.packages('package_name')
  • Or: Tools -> Install packages in RStudio
  • To load:
library(package_name)

Data manipulation

  • Now that you've got some fundumentals down…
  • Crucial to be able to manipulate your data in R
  • If you know the proper tools, no need to ever go to Excel!

"Tidying" your data

  • Data is commonly stored in “wide-format”
    • Easier to enter and for humans to read
    • One row per “group”
  • R likes long format data because its a computer
    • One per observation
  • Hadley Wickham - paper on “tidy data” (canonical structure)

How would you convert to long normally?

  sub angry neutral sad happy
1   1     2       5   3     7
2   2     1       4   3     9
3   3     3       6   3     7

Tidyr package

  • tidyr package makes this very easy

  • gather() - takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.

    • In other words, splits up columns into rows, making wide data long
  • spread() - takes two columns (key & value) and spreads in to multiple columns, it makes “long” data wider.

    • When you need to parse the text of the columns to split into more detail

(credit: hadley wickham)

gather()

gather(dataframe, key, value, cols_to_gather)
gather(face_ratings, type, rating, angry:happy)
   sub    type rating
1    1   angry      2
2    2   angry      1
3    3   angry      3
4    1 neutral      5
5    2 neutral      4
6    3 neutral      6
7    1     sad      3
8    2     sad      3
9    3     sad      3
10   1   happy      7
11   2   happy      9
12   3   happy      7

alternative syntax

gather(face_ratings, type, rating, -sub)
   sub    type rating
1    1   angry      2
2    2   angry      1
3    3   angry      3
4    1 neutral      5
5    2 neutral      4
6    3 neutral      6
7    1     sad      3
8    2     sad      3
9    3     sad      3
10   1   happy      7
11   2   happy      9
12   3   happy      7
gather(face_ratings, type, rating, 2:5)
   sub    type rating
1    1   angry      2
2    2   angry      1
3    3   angry      3
4    1 neutral      5
5    2 neutral      4
6    3 neutral      6
7    1     sad      3
8    2     sad      3
9    3     sad      3
10   1   happy      7
11   2   happy      9
12   3   happy      7

more complicated example

How would you tidy this data set up?

messy
  id       trt work.T1 home.T1 work.T2 home.T2
1  1 treatment 0.08514  0.6158  0.1135  0.0519
2  2   control 0.22544  0.4297  0.5959  0.2642
3  3 treatment 0.27453  0.6517  0.3580  0.3988
4  4   control 0.27231  0.5677  0.4288  0.8361

First, gather columns

gather(messy, key, time, -id, -trt)
   id       trt     key    time
1   1 treatment work.T1 0.08514
2   2   control work.T1 0.22544
3   3 treatment work.T1 0.27453
4   4   control work.T1 0.27231
5   1 treatment home.T1 0.61583
6   2   control home.T1 0.42967
7   3 treatment home.T1 0.65166
8   4   control home.T1 0.56774
9   1 treatment work.T2 0.11351
10  2   control work.T2 0.59593
11  3 treatment work.T2 0.35805
12  4   control work.T2 0.42881
13  1 treatment home.T2 0.05190
14  2   control home.T2 0.26418
15  3 treatment home.T2 0.39879
16  4   control home.T2 0.83613

Next, use seperate()

  • Split up keys using “regular expressions”
  • If more than one observation per column, will aggregate
gathered <- gather(messy, key, time, -id, -trt)
separate(gathered, key, 
  into = c("location", "time"), sep = "\\.") 
   id       trt location time    time
1   1 treatment     work   T1 0.08514
2   2   control     work   T1 0.22544
3   3 treatment     work   T1 0.27453
4   4   control     work   T1 0.27231
5   1 treatment     home   T1 0.61583
6   2   control     home   T1 0.42967
7   3 treatment     home   T1 0.65166
8   4   control     home   T1 0.56774
9   1 treatment     work   T2 0.11351
10  2   control     work   T2 0.59593
11  3 treatment     work   T2 0.35805
12  4   control     work   T2 0.42881
13  1 treatment     home   T2 0.05190
14  2   control     home   T2 0.26418
15  3 treatment     home   T2 0.39879
16  4   control     home   T2 0.83613

Back to wide format

head(gather(face_ratings, type, rating, -sub))
  sub    type rating
1   1   angry      2
2   2   angry      1
3   3   angry      3
4   1 neutral      5
5   2 neutral      4
6   3 neutral      6
spread(separated, type, rating)
  sub angry neutral sad happy
1   1     2       5   3     7
2   2     1       4   3     9
3   3     3       6   3     7

dplyr package

  • Split-Apply-Combine
  • Composed of many “verbs” or actions you can take on data frames
  • Many are incremental upgrades on base R (10-10k faster)
  • Play nice with each other + simple