Fall 2020

Agenda

  • Wrapping up Lecture 1 content

  • Importing data

  • Simple summaries of categorical and continuous data

  • Coding style

  • Review homework grading rubric

  • Lab 2

Wrapping up Lecture 1 content

Load tidyverse

  • There are many different ways to perform the data import and manipulation tasks that we will cover in today’s class
  • We won’t go full-tidyverse just yet, but we will start to preview a handful of basic functions
  • Begin by loading tidyverse
# Load tidyverse
library(tidyverse)
  • If you do not have tidyverse installed yet, run the following code in your Console:
install.packages("tidyverse")

“Base R” vs tidyverse

  • In this class we’ll learn both about “base R” and the “tidyverse”

  • The “tidyverse” describes a set of packages developed by RStudio in an attempt to streamline and unify data import, export, manipulation, summarization, and visualization tasks

  • To be able to write custom functions and work with packages outside of the tidyverse (most R packages are not “tidy”!) it’s worth learning some base R

Importing data

  • Start with survey results from “Homework 0”

  • To import tabular data into R, we use the read.table() command

survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv",
                     header=TRUE, sep=",")
  • Let’s parse this command one component at a time
    • The data is in a file called survey_data2020.csv, which is a file on the course website
    • The file contains a header as its first row
    • The csv format means that the data is comma-separated, so sep=","
  • Could’ve also used read.csv(), which is just read.table() with the preset sep=","

Exploring the data

  • R imports data into a data.frame object. This is the standard base R data object.
class(survey)
## [1] "data.frame"
  • To view the first few rows of the data, use head()
head(survey, 3)
##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1     PPM         Some experience       Never used         Windows    10.5
## 2   Other    Extensive experience Basic competence        Mac OS X     3.0
## 3    MISM Never programmed before Basic competence         Windows     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word
  • head(data.frame, n) returns the first n rows of the data frame

  • In the Console, you can also use View(survey) to get a spreadsheet view

Simple summary

  • Use the str() function to get a simple summary of your data frame object
str(survey)
## 'data.frame':    57 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 2 1 3 3 3 3 3 3 2 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 1 2 2 2 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 4 1 1 4 4 1 4 3 1 1 ...
##  $ OperatingSystem: Factor w/ 3 levels "Linux/Unix","Mac OS X",..: 3 2 3 3 3 2 2 2 3 3 ...
##  $ TVhours        : num  10.5 3 0 10 4 0 2 20 4 0 ...
##  $ Editor         : Factor w/ 5 levels "Excel","LaTeX",..: 4 3 3 1 3 3 3 4 3 3 ...


  • This says that TVhours is a numeric variable, while all the rest are factors (categorical)

Another simple summary

summary(survey)
##   Program                      PriorExp                Rexperience
##  MISM : 9   Extensive experience   : 8   Basic competence    :24  
##  Other:10   Never programmed before: 8   Experienced         : 6  
##  PPM  :38   Some experience        :41   Installed on machine: 7  
##                                          Never used          :20  
##                                                                   
##                                                                   
##    OperatingSystem    TVhours                  Editor  
##  Linux/Unix: 2     Min.   : 0.000   Excel         : 1  
##  Mac OS X  :19     1st Qu.: 3.000   LaTeX         : 5  
##  Windows   :36     Median : 5.000   Microsoft Word:40  
##                    Mean   : 6.763   Other         : 8  
##                    3rd Qu.:10.000   R Markdown    : 3  
##                    Max.   :21.000

Data frame basics

  • We will talk more about lists and data frames (and their “tidy” variants, tibbles) next week, but here are a few basics

  • To see what an R object is made up of, you can use attributes()

attributes(survey)
## $names
## [1] "Program"         "PriorExp"        "Rexperience"     "OperatingSystem"
## [5] "TVhours"         "Editor"         
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57

An R data frame is a list whose columns you can refer to by name or index

  • Those $ symbols are what tell you it’s a list of some kind

Data frame dimensions

  • We can use nrow() and ncol to determine the number of survey responses and the number of survey questions
nrow(survey) # Number of rows (responses)
## [1] 57
ncol(survey) # Number of columns (questions)
## [1] 6
  • When writing reports, you will often want to say how large your sample size was
  • To do this inline, use the syntax:
`r nrow(survey)`
  • This allows us to write “57 students responded to the survey”, and have the number displayed automatically change when the value of nrow(survey) changes.

Inline code chunks example

Here’s a more complex example of inline code use.

We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents.  
Respondents represented `r length(unique(survey[["Program"]]))` CMU programs.  
`r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM.

Which results in:

We collected data on 6 survey questions from 57 respondents. Respondents represented 3 CMU programs. 38 of the repondents were from PPM.


IMPORTANT: You are expected to use inline code chunks instead of copying and pasting output whenever possible.

Indexing data frames

  • There are many different ways of indexing the same piece of a data frame
    • Each vector below contains 57 entries. For display purposes, the settings have been adjusted so that only the first 22 are shown below
survey[["Program"]]  # "Program" element
##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM
survey$Program # "Program" element
##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM
survey[,1] # Data from 1st column
##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM

More indexing

  • Note that single brackets and double brackets have different effects
survey[["Program"]]  # Returns the Program column as a vector
##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM
survey["Program"] #  single column data frame containing only "Program"
##    Program
## 1      PPM
## 2    Other
## 3     MISM
## 4      PPM
## 5      PPM
## 6      PPM
## 7      PPM
## 8      PPM
## 9      PPM
## 10   Other
## 11     PPM
## 12     PPM
## 13     PPM
## 14     PPM
## 15     PPM
## 16     PPM
## 17     PPM
## 18    MISM
## 19    MISM
## 20     PPM
## 21     PPM
## 22   Other
## 23     PPM
## 24     PPM
## 25   Other
## 26   Other
## 27   Other
## 28    MISM
## 29     PPM
## 30     PPM
## 31     PPM
## 32    MISM
## 33     PPM
## 34     PPM
## 35     PPM
## 36     PPM
## 37    MISM
## 38    MISM
## 39   Other
## 40     PPM
## 41     PPM
## 42   Other
## 43     PPM
## 44     PPM
## 45     PPM
## 46     PPM
## 47     PPM
## 48     PPM
## 49    MISM
## 50     PPM
## 51   Other
## 52     PPM
## 53     PPM
## 54     PPM
## 55     PPM
## 56   Other
## 57    MISM

Bar plot (categorical data)

Here we’ll use qplot() from the ggplot2 library (part of tidyverse)

qplot(survey$Program)

Histogram (continuous data)

qplot(survey[["TVhours"]], binwidth = 3, fill = I("steelblue"))

Indexing multiple columns

# Data from 1st and 5th columns
survey[, c(1,5)] 
##    Program TVhours
## 1      PPM    10.5
## 2    Other     3.0
## 3     MISM     0.0
## 4      PPM    10.0
## 5      PPM     4.0
## 6      PPM     0.0
## 7      PPM     2.0
## 8      PPM    20.0
## 9      PPM     4.0
## 10   Other     0.0
## 11     PPM    15.0
## 12     PPM     5.0
## 13     PPM    20.0
## 14     PPM    10.0
## 15     PPM     5.0
## 16     PPM     2.0
## 17     PPM    14.0
## 18    MISM    10.0
## 19    MISM     4.0
## 20     PPM     3.0
## 21     PPM     6.0
## 22   Other    10.0
## 23     PPM     2.0
## 24     PPM     3.0
## 25   Other     3.0
## 26   Other     1.0
## 27   Other     1.0
## 28    MISM     3.0
## 29     PPM     5.0
## 30     PPM    10.0
## 31     PPM    20.0
## 32    MISM     0.0
## 33     PPM     9.0
## 34     PPM     3.0
## 35     PPM     4.0
## 36     PPM     8.0
## 37    MISM     7.0
## 38    MISM     8.0
## 39   Other    10.0
## 40     PPM    10.0
## 41     PPM     4.0
## 42   Other    10.0
## 43     PPM     4.0
## 44     PPM     0.0
## 45     PPM     1.0
## 46     PPM     7.0
## 47     PPM     2.0
## 48     PPM    15.0
## 49    MISM     8.0
## 50     PPM    10.0
## 51   Other     2.0
## 52     PPM     3.0
## 53     PPM     4.0
## 54     PPM    21.0
## 55     PPM    10.0
## 56   Other    20.0
## 57    MISM     0.0
# Data from "Program" and "Editor"
survey[c("Program", "Editor")] 
##    Program         Editor
## 1      PPM          Other
## 2    Other Microsoft Word
## 3     MISM Microsoft Word
## 4      PPM          Excel
## 5      PPM Microsoft Word
## 6      PPM Microsoft Word
## 7      PPM Microsoft Word
## 8      PPM          Other
## 9      PPM Microsoft Word
## 10   Other Microsoft Word
## 11     PPM Microsoft Word
## 12     PPM          LaTeX
## 13     PPM Microsoft Word
## 14     PPM Microsoft Word
## 15     PPM          Other
## 16     PPM          Other
## 17     PPM Microsoft Word
## 18    MISM          LaTeX
## 19    MISM Microsoft Word
## 20     PPM Microsoft Word
## 21     PPM Microsoft Word
## 22   Other Microsoft Word
## 23     PPM Microsoft Word
## 24     PPM          Other
## 25   Other          LaTeX
## 26   Other Microsoft Word
## 27   Other Microsoft Word
## 28    MISM Microsoft Word
## 29     PPM     R Markdown
## 30     PPM     R Markdown
## 31     PPM Microsoft Word
## 32    MISM Microsoft Word
## 33     PPM Microsoft Word
## 34     PPM Microsoft Word
## 35     PPM Microsoft Word
## 36     PPM Microsoft Word
## 37    MISM Microsoft Word
## 38    MISM     R Markdown
## 39   Other Microsoft Word
## 40     PPM Microsoft Word
## 41     PPM Microsoft Word
## 42   Other Microsoft Word
## 43     PPM Microsoft Word
## 44     PPM Microsoft Word
## 45     PPM          Other
## 46     PPM Microsoft Word
## 47     PPM          Other
## 48     PPM Microsoft Word
## 49    MISM          LaTeX
## 50     PPM Microsoft Word
## 51   Other Microsoft Word
## 52     PPM Microsoft Word
## 53     PPM Microsoft Word
## 54     PPM Microsoft Word
## 55     PPM          LaTeX
## 56   Other Microsoft Word
## 57    MISM          Other

tidy column selection: select()

It is preferable to use the select() function to select subsets of columns

select(survey, Program, Editor)
##    Program         Editor
## 1      PPM          Other
## 2    Other Microsoft Word
## 3     MISM Microsoft Word
## 4      PPM          Excel
## 5      PPM Microsoft Word
## 6      PPM Microsoft Word
## 7      PPM Microsoft Word
## 8      PPM          Other
## 9      PPM Microsoft Word
## 10   Other Microsoft Word
## 11     PPM Microsoft Word
## 12     PPM          LaTeX
## 13     PPM Microsoft Word
## 14     PPM Microsoft Word
## 15     PPM          Other
## 16     PPM          Other
## 17     PPM Microsoft Word
## 18    MISM          LaTeX
## 19    MISM Microsoft Word
## 20     PPM Microsoft Word
## 21     PPM Microsoft Word
## 22   Other Microsoft Word
## 23     PPM Microsoft Word
## 24     PPM          Other
## 25   Other          LaTeX
## 26   Other Microsoft Word
## 27   Other Microsoft Word
## 28    MISM Microsoft Word
## 29     PPM     R Markdown
## 30     PPM     R Markdown
## 31     PPM Microsoft Word
## 32    MISM Microsoft Word
## 33     PPM Microsoft Word
## 34     PPM Microsoft Word
## 35     PPM Microsoft Word
## 36     PPM Microsoft Word
## 37    MISM Microsoft Word
## 38    MISM     R Markdown
## 39   Other Microsoft Word
## 40     PPM Microsoft Word
## 41     PPM Microsoft Word
## 42   Other Microsoft Word
## 43     PPM Microsoft Word
## 44     PPM Microsoft Word
## 45     PPM          Other
## 46     PPM Microsoft Word
## 47     PPM          Other
## 48     PPM Microsoft Word
## 49    MISM          LaTeX
## 50     PPM Microsoft Word
## 51   Other Microsoft Word
## 52     PPM Microsoft Word
## 53     PPM Microsoft Word
## 54     PPM Microsoft Word
## 55     PPM          LaTeX
## 56   Other Microsoft Word
## 57    MISM          Other

Indexing rows and columns

  • Data frames have two dimensions to index across
  • You can use square bracket notation df[rows, cols] to extract specified rows and cols from a data frame df.
  • There are also other approaches, as illustrated below
survey[6, 5] # row 6, column 5
## [1] 0
survey[6, "Program"] # Program of 6th survey respondent 
## [1] PPM
## Levels: MISM Other PPM
survey[["Program"]][6]  # Program of 6th survey respondent 
## [1] PPM
## Levels: MISM Other PPM

More indexing rows and columns

  • If you leave e.g., the rows value blank in df[rows, cols], it will pull all of the rows for the specified cols

  • Leaving cols blank pulls all the columns for the specified rows

survey[6,] # 6th row
##   Program        PriorExp      Rexperience OperatingSystem TVhours
## 6     PPM Some experience Basic competence        Mac OS X       0
##           Editor
## 6 Microsoft Word
survey[,2] # 2nd column
##  [1] Some experience         Extensive experience   
##  [3] Never programmed before Never programmed before
##  [5] Never programmed before Some experience        
##  [7] Never programmed before Some experience        
##  [9] Some experience         Some experience        
## [11] Never programmed before Extensive experience   
## [13] Extensive experience    Some experience        
## [15] Some experience         Some experience        
## [17] Some experience         Some experience        
## [19] Some experience         Some experience        
## [21] Extensive experience    Extensive experience   
## [23] Some experience         Some experience        
## [25] Some experience         Some experience        
## [27] Some experience         Some experience        
## [29] Some experience         Some experience        
## [31] Some experience         Some experience        
## [33] Some experience         Some experience        
## [35] Some experience         Some experience        
## [37] Extensive experience    Some experience        
## [39] Never programmed before Some experience        
## [41] Some experience         Some experience        
## [43] Some experience         Some experience        
## [45] Extensive experience    Never programmed before
## [47] Some experience         Some experience        
## [49] Some experience         Some experience        
## [51] Never programmed before Some experience        
## [53] Some experience         Some experience        
## [55] Extensive experience    Some experience        
## [57] Some experience        
## 3 Levels: Extensive experience ... Some experience

More indexing

In Lab 1, you were introduced to the colon operator :

We can use this operator for indexing

survey[1:3,]  # equivalent to head(survey, 3)
##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1     PPM         Some experience       Never used         Windows    10.5
## 2   Other    Extensive experience Basic competence        Mac OS X     3.0
## 3    MISM Never programmed before Basic competence         Windows     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word
survey[3:5, c(1,5)] 
##   Program TVhours
## 3    MISM       0
## 4     PPM      10
## 5     PPM       4

Subsets of data

We are often interested in learning something a specific subset of the data

survey[survey$Program=="MISM", ] # Data from the MISM students
survey[which(survey$Program=="MISM"), ] # Does the same thing
##    Program                PriorExp      Rexperience OperatingSystem
## 3     MISM Never programmed before Basic competence         Windows
## 18    MISM         Some experience      Experienced         Windows
## 19    MISM         Some experience Basic competence        Mac OS X
## 28    MISM         Some experience Basic competence        Mac OS X
## 32    MISM         Some experience      Experienced         Windows
## 37    MISM    Extensive experience      Experienced        Mac OS X
## 38    MISM         Some experience      Experienced         Windows
## 49    MISM         Some experience       Never used      Linux/Unix
## 57    MISM         Some experience Basic competence        Mac OS X
##    TVhours         Editor
## 3        0 Microsoft Word
## 18      10          LaTeX
## 19       4 Microsoft Word
## 28       3 Microsoft Word
## 32       0 Microsoft Word
## 37       7 Microsoft Word
## 38       8     R Markdown
## 49       8          LaTeX
## 57       0          Other

More subset examples

Let’s pull all of the PPM students who have never used R before

survey[survey$Program=="PPM" & survey$Rexperience=="Never used", ]
##    Program                PriorExp Rexperience OperatingSystem TVhours
## 1      PPM         Some experience  Never used         Windows    10.5
## 4      PPM Never programmed before  Never used         Windows    10.0
## 5      PPM Never programmed before  Never used         Windows     4.0
## 7      PPM Never programmed before  Never used        Mac OS X     2.0
## 11     PPM Never programmed before  Never used         Windows    15.0
## 12     PPM    Extensive experience  Never used      Linux/Unix     5.0
## 15     PPM         Some experience  Never used         Windows     5.0
## 17     PPM         Some experience  Never used         Windows    14.0
## 24     PPM         Some experience  Never used         Windows     3.0
## 31     PPM         Some experience  Never used         Windows    20.0
## 43     PPM         Some experience  Never used         Windows     4.0
## 44     PPM         Some experience  Never used         Windows     0.0
## 45     PPM    Extensive experience  Never used        Mac OS X     1.0
## 46     PPM Never programmed before  Never used         Windows     7.0
## 53     PPM         Some experience  Never used         Windows     4.0
##            Editor
## 1           Other
## 4           Excel
## 5  Microsoft Word
## 7  Microsoft Word
## 11 Microsoft Word
## 12          LaTeX
## 15          Other
## 17 Microsoft Word
## 24          Other
## 31 Microsoft Word
## 43 Microsoft Word
## 44 Microsoft Word
## 45          Other
## 46 Microsoft Word
## 53 Microsoft Word

“tidy” subsetting with filter()

  • In general, it is preferable to use the filter() function

  • Here’s an example of selecting all responses from students who are either in PPM or Other and who listed their R experience as “Basic competence”.

filter(survey, 
       (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")
##    Program             PriorExp      Rexperience OperatingSystem TVhours
## 1    Other Extensive experience Basic competence        Mac OS X       3
## 2      PPM      Some experience Basic competence        Mac OS X       0
## 3      PPM      Some experience Basic competence         Windows       4
## 4    Other      Some experience Basic competence         Windows       0
## 5      PPM Extensive experience Basic competence        Mac OS X      20
## 6      PPM      Some experience Basic competence         Windows      10
## 7      PPM      Some experience Basic competence         Windows       3
## 8      PPM Extensive experience Basic competence         Windows       6
## 9    Other Extensive experience Basic competence        Mac OS X      10
## 10     PPM      Some experience Basic competence        Mac OS X       2
## 11   Other      Some experience Basic competence        Mac OS X       3
## 12     PPM      Some experience Basic competence        Mac OS X       5
## 13     PPM      Some experience Basic competence         Windows      10
## 14     PPM      Some experience Basic competence         Windows       9
## 15     PPM      Some experience Basic competence         Windows      10
## 16   Other      Some experience Basic competence         Windows      10
## 17     PPM      Some experience Basic competence        Mac OS X       2
## 18     PPM      Some experience Basic competence         Windows      15
## 19     PPM      Some experience Basic competence         Windows       3
## 20     PPM      Some experience Basic competence         Windows      21
##            Editor
## 1  Microsoft Word
## 2  Microsoft Word
## 3  Microsoft Word
## 4  Microsoft Word
## 5  Microsoft Word
## 6  Microsoft Word
## 7  Microsoft Word
## 8  Microsoft Word
## 9  Microsoft Word
## 10 Microsoft Word
## 11          LaTeX
## 12     R Markdown
## 13     R Markdown
## 14 Microsoft Word
## 15 Microsoft Word
## 16 Microsoft Word
## 17          Other
## 18 Microsoft Word
## 19 Microsoft Word
## 20 Microsoft Word

filter() allows you to split conditions across lines

  • On the previous slide we had
filter(survey, 
       (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")
  • This is equivalent to the easier to parse call:
filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence")
##    Program             PriorExp      Rexperience OperatingSystem TVhours
## 1    Other Extensive experience Basic competence        Mac OS X       3
## 2      PPM      Some experience Basic competence        Mac OS X       0
## 3      PPM      Some experience Basic competence         Windows       4
## 4    Other      Some experience Basic competence         Windows       0
## 5      PPM Extensive experience Basic competence        Mac OS X      20
## 6      PPM      Some experience Basic competence         Windows      10
## 7      PPM      Some experience Basic competence         Windows       3
## 8      PPM Extensive experience Basic competence         Windows       6
## 9    Other Extensive experience Basic competence        Mac OS X      10
## 10     PPM      Some experience Basic competence        Mac OS X       2
## 11   Other      Some experience Basic competence        Mac OS X       3
## 12     PPM      Some experience Basic competence        Mac OS X       5
## 13     PPM      Some experience Basic competence         Windows      10
## 14     PPM      Some experience Basic competence         Windows       9
## 15     PPM      Some experience Basic competence         Windows      10
## 16   Other      Some experience Basic competence         Windows      10
## 17     PPM      Some experience Basic competence        Mac OS X       2
## 18     PPM      Some experience Basic competence         Windows      15
## 19     PPM      Some experience Basic competence         Windows       3
## 20     PPM      Some experience Basic competence         Windows      21
##            Editor
## 1  Microsoft Word
## 2  Microsoft Word
## 3  Microsoft Word
## 4  Microsoft Word
## 5  Microsoft Word
## 6  Microsoft Word
## 7  Microsoft Word
## 8  Microsoft Word
## 9  Microsoft Word
## 10 Microsoft Word
## 11          LaTeX
## 12     R Markdown
## 13     R Markdown
## 14 Microsoft Word
## 15 Microsoft Word
## 16 Microsoft Word
## 17          Other
## 18 Microsoft Word
## 19 Microsoft Word
## 20 Microsoft Word

“tidy” selection of rows and columns

  • What if we wanted to select a subset of rows and columns with a combinatino of filter and select?
  • Here’s one strategy
# First, get the desired rows
row.subset <- filter(survey, 
                     Program == "PPM" | Program == "Other",
                     Rexperience == "Basic competence")
# Then, get the right columns
select(row.subset, TVhours, Editor)
##    TVhours         Editor
## 1        3 Microsoft Word
## 2        0 Microsoft Word
## 3        4 Microsoft Word
## 4        0 Microsoft Word
## 5       20 Microsoft Word
## 6       10 Microsoft Word
## 7        3 Microsoft Word
## 8        6 Microsoft Word
## 9       10 Microsoft Word
## 10       2 Microsoft Word
## 11       3          LaTeX
## 12       5     R Markdown
## 13      10     R Markdown
## 14       9 Microsoft Word
## 15      10 Microsoft Word
## 16      10 Microsoft Word
## 17       2          Other
## 18      15 Microsoft Word
## 19       3 Microsoft Word
## 20      21 Microsoft Word

Piping with %>%

  • Here’s a better strategy, which uses “piping” to supply the output of one computation as an argument into the next
  • Most of our data processing/summarization pipelines in R will involve lots of piping
  • The symbol %>% is pronounced “pipe”
filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence") %>%
  select(TVhours, Editor)
##    TVhours         Editor
## 1        3 Microsoft Word
## 2        0 Microsoft Word
## 3        4 Microsoft Word
## 4        0 Microsoft Word
## 5       20 Microsoft Word
## 6       10 Microsoft Word
## 7        3 Microsoft Word
## 8        6 Microsoft Word
## 9       10 Microsoft Word
## 10       2 Microsoft Word
## 11       3          LaTeX
## 12       5     R Markdown
## 13      10     R Markdown
## 14       9 Microsoft Word
## 15      10 Microsoft Word
## 16      10 Microsoft Word
## 17       2          Other
## 18      15 Microsoft Word
## 19       3 Microsoft Word
## 20      21 Microsoft Word

Piping: preferred style

When piping, it is best to pipe right from the start

# OK:
filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence") %>%
  select(TVhours, Editor)

# Better:
survey %>%
  filter(Program == "PPM" | Program == "Other",
         Rexperience == "Basic competence") %>%
  select(TVhours, Editor)

Splitting a long expression

  • As your function calls get longer and more complicated, you may find it useful to split them over multiple lines

  • Suppose you had something like this:

survey[(survey$Program == "PPM" | survey$Program == "Other") & survey$Rexperience == "Basic competence", ]
  • You can split this across multiple lines by putting a line break after an operator
survey[(survey$Program == "PPM" | survey$Program == "Other") & 
         survey$Rexperience == "Basic competence", ]
  • Note that the line break occurs after the & operator

Some simple calculations

mean(survey$TVhours[survey$Program == "PPM"]) # Average time PPM's spent watching TV
## [1] 7.513158
mean(survey$TVhours[survey$Program == "MISM"]) # Average time MISM's spent watching TV 
## [1] 4.444444
mean(survey$TVhours[survey$Program == "Other"]) # Average time "Others" spent watching TV
## [1] 6

(Preview of) “tidy” data summaries with group_by and summarize

Here’s a much easier and cleaner way of getting the average TV hours watched by students in each program. We use group_by and summarize

survey %>%
  group_by(Program) %>%
  summarize(mean(TVhours))
## # A tibble: 3 x 2
##   Program `mean(TVhours)`
##   <fct>             <dbl>
## 1 MISM               4.44
## 2 Other              6   
## 3 PPM                7.51

Defining variables

  • If we wanted to focus on a particular column of the data frame, we could always define it as a new variable (e.g., if you want to easily experiment on a column)
tv.hours <- survey$TVhours  # Vector of TVhours watched
mean(tv.hours)              # Average time spent watching TV
## [1] 6.763158
sd(tv.hours)                # Standard deviation of TV watching time
## [1] 5.778737
sum(tv.hours >= 5)   # How many people watched 5 or more hours of TV?
## [1] 29

R coding style

  • Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks

  • Borrowing Hadley Wickham’s words:
    You don’t have to use my style, but you really should use a consistent style.

  • This style guide is short and easy to follow

  • We’ll revisit the question of coding style several times over the course of the class

Enforced style: assignment operator

Assignment operator. Use <- not =

  • Style guides uniformly promote the use of <- instead of = as the assignment operator
student.names <- c("Eric", "Hao", "Jennifer")  # Good
student.names = c("Eric", "Hao", "Jennifer") # Bad
  • Use = when specifying function arguments,
sort(tv.hours, decreasing=TRUE) # Good
sort(tv.hours, decreasing<-TRUE) # Works, but not what you want

Enforced style: Spacing

  • Binary operators should have spaces around them

  • Commas should have a space after, but not before (just like in writing)

3 * 4 # Good
3*4 # Bad
which(student.names == "Eric") # Good
which(student.names=="Eric") # Bad
  • For specifying arguments, spacing around = is optional
sort(tv.hours, decreasing=TRUE) # Accepted
sort(tv.hours, decreasing = FALSE) # Accepted

Enforced style: Variable names

  • To make code easy to read, debug, and maintain, you should use concise but descriptive variable names

  • Terms in variable names should be separated by _ or .

# Accepted
day_one   day.one   day_1   day.1   day1

# Bad
d1   DayOne   dayone   

# Can be made more concise:
first.day.of.the.month
  • Avoid using variable names that are already pre-defined variables or functions in R
# EXTREMELY bad:
c   T   pi   sum   mean   

Assignments

  • Homework 1 will be posted today
    • Due: Wednesday, November 4, 1:30PM ET
    • Submit your .Rmd and .html files on Canvas (do not zip them together)
    • Course website contains grading rubric
  • Lab 2 is now available
    • Remember: To earn a participation point for today’s class, you must submit Lab 1 and Lab 2 by Friday evening (ET)