Lecture 2: Importing data and more basics

Fall 2020

Agenda

Wrapping up Lecture 1 content
Importing data
Simple summaries of categorical and continuous data
Coding style
Review homework grading rubric
Lab 2

Wrapping up Lecture 1 content

Let’s go back to where we left off in the lecture 1 slides.

Load tidyverse

There are many different ways to perform the data import and manipulation tasks that we will cover in today’s class
We won’t go full-tidyverse just yet, but we will start to preview a handful of basic functions
Begin by loading tidyverse

# Load tidyverse
library(tidyverse)

If you do not have tidyverse installed yet, run the following code in your Console:

install.packages("tidyverse")

“Base R” vs tidyverse

In this class we’ll learn both about “base R” and the “tidyverse”
The “tidyverse” describes a set of packages developed by RStudio in an attempt to streamline and unify data import, export, manipulation, summarization, and visualization tasks
To be able to write custom functions and work with packages outside of the tidyverse (most R packages are not “tidy”!) it’s worth learning some base R

Importing data

Start with survey results from “Homework 0”
To import tabular data into R, we use the read.table() command

survey <- read.table("http://www.andrew.cmu.edu/user/achoulde/94842/data/survey_data2020.csv",
                     header=TRUE, sep=",")

Let’s parse this command one component at a time
- The data is in a file called survey_data2020.csv, which is a file on the course website
- The file contains a header as its first row
- The csv format means that the data is comma-separated, so sep=","
Could’ve also used read.csv(), which is just read.table() with the preset sep=","

Exploring the data

R imports data into a data.frame object. This is the standard base R data object.

class(survey)

## [1] "data.frame"

To view the first few rows of the data, use head()

head(survey, 3)

##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1     PPM         Some experience       Never used         Windows    10.5
## 2   Other    Extensive experience Basic competence        Mac OS X     3.0
## 3    MISM Never programmed before Basic competence         Windows     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word

head(data.frame, n) returns the first n rows of the data frame
In the Console, you can also use View(survey) to get a spreadsheet view

Simple summary

Use the str() function to get a simple summary of your data frame object

str(survey)

## 'data.frame':    57 obs. of  6 variables:
##  $ Program        : Factor w/ 3 levels "MISM","Other",..: 3 2 1 3 3 3 3 3 3 2 ...
##  $ PriorExp       : Factor w/ 3 levels "Extensive experience",..: 3 1 2 2 2 3 2 3 3 3 ...
##  $ Rexperience    : Factor w/ 4 levels "Basic competence",..: 4 1 1 4 4 1 4 3 1 1 ...
##  $ OperatingSystem: Factor w/ 3 levels "Linux/Unix","Mac OS X",..: 3 2 3 3 3 2 2 2 3 3 ...
##  $ TVhours        : num  10.5 3 0 10 4 0 2 20 4 0 ...
##  $ Editor         : Factor w/ 5 levels "Excel","LaTeX",..: 4 3 3 1 3 3 3 4 3 3 ...

This says that TVhours is a numeric variable, while all the rest are factors (categorical)

Another simple summary

summary(survey)

##   Program                      PriorExp                Rexperience
##  MISM : 9   Extensive experience   : 8   Basic competence    :24  
##  Other:10   Never programmed before: 8   Experienced         : 6  
##  PPM  :38   Some experience        :41   Installed on machine: 7  
##                                          Never used          :20  
##                                                                   
##                                                                   
##    OperatingSystem    TVhours                  Editor  
##  Linux/Unix: 2     Min.   : 0.000   Excel         : 1  
##  Mac OS X  :19     1st Qu.: 3.000   LaTeX         : 5  
##  Windows   :36     Median : 5.000   Microsoft Word:40  
##                    Mean   : 6.763   Other         : 8  
##                    3rd Qu.:10.000   R Markdown    : 3  
##                    Max.   :21.000

Data frame basics

We will talk more about lists and data frames (and their “tidy” variants, tibbles) next week, but here are a few basics
To see what an R object is made up of, you can use attributes()

attributes(survey)

## $names
## [1] "Program"         "PriorExp"        "Rexperience"     "OperatingSystem"
## [5] "TVhours"         "Editor"         
## 
## $class
## [1] "data.frame"
## 
## $row.names
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
## [47] 47 48 49 50 51 52 53 54 55 56 57

An R data frame is a list whose columns you can refer to by name or index

Those $ symbols are what tell you it’s a list of some kind

Data frame dimensions

We can use nrow() and ncol to determine the number of survey responses and the number of survey questions

nrow(survey) # Number of rows (responses)

## [1] 57

ncol(survey) # Number of columns (questions)

## [1] 6

When writing reports, you will often want to say how large your sample size was
To do this inline, use the syntax:

`r nrow(survey)`

This allows us to write “57 students responded to the survey”, and have the number displayed automatically change when the value of nrow(survey) changes.

Inline code chunks example

Here’s a more complex example of inline code use.

We collected data on `r ncol(survey)` survey questions from `r nrow(survey)` respondents.  
Respondents represented `r length(unique(survey[["Program"]]))` CMU programs.  
`r sum(survey[["Program"]] == "PPM")` of the repondents were from PPM.

Which results in:

We collected data on 6 survey questions from 57 respondents. Respondents represented 3 CMU programs. 38 of the repondents were from PPM.

IMPORTANT: You are expected to use inline code chunks instead of copying and pasting output whenever possible.

Indexing data frames

There are many different ways of indexing the same piece of a data frame
- Each vector below contains 57 entries. For display purposes, the settings have been adjusted so that only the first 22 are shown below

survey[["Program"]]  # "Program" element

##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM

survey$Program # "Program" element

##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM

survey[,1] # Data from 1st column

##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM

More indexing

Note that single brackets and double brackets have different effects

survey[["Program"]]  # Returns the Program column as a vector

##  [1] PPM   Other MISM  PPM   PPM   PPM   PPM   PPM   PPM   Other PPM  
## [12] PPM   PPM   PPM   PPM   PPM   PPM   MISM  MISM  PPM   PPM   Other
## Levels: MISM Other PPM

survey["Program"] #  single column data frame containing only "Program"

##    Program
## 1      PPM
## 2    Other
## 3     MISM
## 4      PPM
## 5      PPM
## 6      PPM
## 7      PPM
## 8      PPM
## 9      PPM
## 10   Other
## 11     PPM
## 12     PPM
## 13     PPM
## 14     PPM
## 15     PPM
## 16     PPM
## 17     PPM
## 18    MISM
## 19    MISM
## 20     PPM
## 21     PPM
## 22   Other
## 23     PPM
## 24     PPM
## 25   Other
## 26   Other
## 27   Other
## 28    MISM
## 29     PPM
## 30     PPM
## 31     PPM
## 32    MISM
## 33     PPM
## 34     PPM
## 35     PPM
## 36     PPM
## 37    MISM
## 38    MISM
## 39   Other
## 40     PPM
## 41     PPM
## 42   Other
## 43     PPM
## 44     PPM
## 45     PPM
## 46     PPM
## 47     PPM
## 48     PPM
## 49    MISM
## 50     PPM
## 51   Other
## 52     PPM
## 53     PPM
## 54     PPM
## 55     PPM
## 56   Other
## 57    MISM

Bar plot (categorical data)

Here we’ll use qplot() from the ggplot2 library (part of tidyverse)

qplot(survey$Program)

Histogram (continuous data)

qplot(survey[["TVhours"]], binwidth = 3, fill = I("steelblue"))

Indexing multiple columns

# Data from 1st and 5th columns
survey[, c(1,5)]

##    Program TVhours
## 1      PPM    10.5
## 2    Other     3.0
## 3     MISM     0.0
## 4      PPM    10.0
## 5      PPM     4.0
## 6      PPM     0.0
## 7      PPM     2.0
## 8      PPM    20.0
## 9      PPM     4.0
## 10   Other     0.0
## 11     PPM    15.0
## 12     PPM     5.0
## 13     PPM    20.0
## 14     PPM    10.0
## 15     PPM     5.0
## 16     PPM     2.0
## 17     PPM    14.0
## 18    MISM    10.0
## 19    MISM     4.0
## 20     PPM     3.0
## 21     PPM     6.0
## 22   Other    10.0
## 23     PPM     2.0
## 24     PPM     3.0
## 25   Other     3.0
## 26   Other     1.0
## 27   Other     1.0
## 28    MISM     3.0
## 29     PPM     5.0
## 30     PPM    10.0
## 31     PPM    20.0
## 32    MISM     0.0
## 33     PPM     9.0
## 34     PPM     3.0
## 35     PPM     4.0
## 36     PPM     8.0
## 37    MISM     7.0
## 38    MISM     8.0
## 39   Other    10.0
## 40     PPM    10.0
## 41     PPM     4.0
## 42   Other    10.0
## 43     PPM     4.0
## 44     PPM     0.0
## 45     PPM     1.0
## 46     PPM     7.0
## 47     PPM     2.0
## 48     PPM    15.0
## 49    MISM     8.0
## 50     PPM    10.0
## 51   Other     2.0
## 52     PPM     3.0
## 53     PPM     4.0
## 54     PPM    21.0
## 55     PPM    10.0
## 56   Other    20.0
## 57    MISM     0.0

# Data from "Program" and "Editor"
survey[c("Program", "Editor")]

##    Program         Editor
## 1      PPM          Other
## 2    Other Microsoft Word
## 3     MISM Microsoft Word
## 4      PPM          Excel
## 5      PPM Microsoft Word
## 6      PPM Microsoft Word
## 7      PPM Microsoft Word
## 8      PPM          Other
## 9      PPM Microsoft Word
## 10   Other Microsoft Word
## 11     PPM Microsoft Word
## 12     PPM          LaTeX
## 13     PPM Microsoft Word
## 14     PPM Microsoft Word
## 15     PPM          Other
## 16     PPM          Other
## 17     PPM Microsoft Word
## 18    MISM          LaTeX
## 19    MISM Microsoft Word
## 20     PPM Microsoft Word
## 21     PPM Microsoft Word
## 22   Other Microsoft Word
## 23     PPM Microsoft Word
## 24     PPM          Other
## 25   Other          LaTeX
## 26   Other Microsoft Word
## 27   Other Microsoft Word
## 28    MISM Microsoft Word
## 29     PPM     R Markdown
## 30     PPM     R Markdown
## 31     PPM Microsoft Word
## 32    MISM Microsoft Word
## 33     PPM Microsoft Word
## 34     PPM Microsoft Word
## 35     PPM Microsoft Word
## 36     PPM Microsoft Word
## 37    MISM Microsoft Word
## 38    MISM     R Markdown
## 39   Other Microsoft Word
## 40     PPM Microsoft Word
## 41     PPM Microsoft Word
## 42   Other Microsoft Word
## 43     PPM Microsoft Word
## 44     PPM Microsoft Word
## 45     PPM          Other
## 46     PPM Microsoft Word
## 47     PPM          Other
## 48     PPM Microsoft Word
## 49    MISM          LaTeX
## 50     PPM Microsoft Word
## 51   Other Microsoft Word
## 52     PPM Microsoft Word
## 53     PPM Microsoft Word
## 54     PPM Microsoft Word
## 55     PPM          LaTeX
## 56   Other Microsoft Word
## 57    MISM          Other

tidy column selection: select()

It is preferable to use the select() function to select subsets of columns

select(survey, Program, Editor)

##    Program         Editor
## 1      PPM          Other
## 2    Other Microsoft Word
## 3     MISM Microsoft Word
## 4      PPM          Excel
## 5      PPM Microsoft Word
## 6      PPM Microsoft Word
## 7      PPM Microsoft Word
## 8      PPM          Other
## 9      PPM Microsoft Word
## 10   Other Microsoft Word
## 11     PPM Microsoft Word
## 12     PPM          LaTeX
## 13     PPM Microsoft Word
## 14     PPM Microsoft Word
## 15     PPM          Other
## 16     PPM          Other
## 17     PPM Microsoft Word
## 18    MISM          LaTeX
## 19    MISM Microsoft Word
## 20     PPM Microsoft Word
## 21     PPM Microsoft Word
## 22   Other Microsoft Word
## 23     PPM Microsoft Word
## 24     PPM          Other
## 25   Other          LaTeX
## 26   Other Microsoft Word
## 27   Other Microsoft Word
## 28    MISM Microsoft Word
## 29     PPM     R Markdown
## 30     PPM     R Markdown
## 31     PPM Microsoft Word
## 32    MISM Microsoft Word
## 33     PPM Microsoft Word
## 34     PPM Microsoft Word
## 35     PPM Microsoft Word
## 36     PPM Microsoft Word
## 37    MISM Microsoft Word
## 38    MISM     R Markdown
## 39   Other Microsoft Word
## 40     PPM Microsoft Word
## 41     PPM Microsoft Word
## 42   Other Microsoft Word
## 43     PPM Microsoft Word
## 44     PPM Microsoft Word
## 45     PPM          Other
## 46     PPM Microsoft Word
## 47     PPM          Other
## 48     PPM Microsoft Word
## 49    MISM          LaTeX
## 50     PPM Microsoft Word
## 51   Other Microsoft Word
## 52     PPM Microsoft Word
## 53     PPM Microsoft Word
## 54     PPM Microsoft Word
## 55     PPM          LaTeX
## 56   Other Microsoft Word
## 57    MISM          Other

Indexing rows and columns

Data frames have two dimensions to index across
You can use square bracket notation df[rows, cols] to extract specified rows and cols from a data frame df.
There are also other approaches, as illustrated below

survey[6, 5] # row 6, column 5

## [1] 0

survey[6, "Program"] # Program of 6th survey respondent

## [1] PPM
## Levels: MISM Other PPM

survey[["Program"]][6]  # Program of 6th survey respondent

## [1] PPM
## Levels: MISM Other PPM

More indexing rows and columns

If you leave e.g., the rows value blank in df[rows, cols], it will pull all of the rows for the specified cols
Leaving cols blank pulls all the columns for the specified rows

survey[6,] # 6th row

##   Program        PriorExp      Rexperience OperatingSystem TVhours
## 6     PPM Some experience Basic competence        Mac OS X       0
##           Editor
## 6 Microsoft Word

survey[,2] # 2nd column

##  [1] Some experience         Extensive experience   
##  [3] Never programmed before Never programmed before
##  [5] Never programmed before Some experience        
##  [7] Never programmed before Some experience        
##  [9] Some experience         Some experience        
## [11] Never programmed before Extensive experience   
## [13] Extensive experience    Some experience        
## [15] Some experience         Some experience        
## [17] Some experience         Some experience        
## [19] Some experience         Some experience        
## [21] Extensive experience    Extensive experience   
## [23] Some experience         Some experience        
## [25] Some experience         Some experience        
## [27] Some experience         Some experience        
## [29] Some experience         Some experience        
## [31] Some experience         Some experience        
## [33] Some experience         Some experience        
## [35] Some experience         Some experience        
## [37] Extensive experience    Some experience        
## [39] Never programmed before Some experience        
## [41] Some experience         Some experience        
## [43] Some experience         Some experience        
## [45] Extensive experience    Never programmed before
## [47] Some experience         Some experience        
## [49] Some experience         Some experience        
## [51] Never programmed before Some experience        
## [53] Some experience         Some experience        
## [55] Extensive experience    Some experience        
## [57] Some experience        
## 3 Levels: Extensive experience ... Some experience

More indexing

In Lab 1, you were introduced to the colon operator :

We can use this operator for indexing

survey[1:3,]  # equivalent to head(survey, 3)

##   Program                PriorExp      Rexperience OperatingSystem TVhours
## 1     PPM         Some experience       Never used         Windows    10.5
## 2   Other    Extensive experience Basic competence        Mac OS X     3.0
## 3    MISM Never programmed before Basic competence         Windows     0.0
##           Editor
## 1          Other
## 2 Microsoft Word
## 3 Microsoft Word

survey[3:5, c(1,5)]

##   Program TVhours
## 3    MISM       0
## 4     PPM      10
## 5     PPM       4

Subsets of data

We are often interested in learning something a specific subset of the data

survey[survey$Program=="MISM", ] # Data from the MISM students
survey[which(survey$Program=="MISM"), ] # Does the same thing

##    Program                PriorExp      Rexperience OperatingSystem
## 3     MISM Never programmed before Basic competence         Windows
## 18    MISM         Some experience      Experienced         Windows
## 19    MISM         Some experience Basic competence        Mac OS X
## 28    MISM         Some experience Basic competence        Mac OS X
## 32    MISM         Some experience      Experienced         Windows
## 37    MISM    Extensive experience      Experienced        Mac OS X
## 38    MISM         Some experience      Experienced         Windows
## 49    MISM         Some experience       Never used      Linux/Unix
## 57    MISM         Some experience Basic competence        Mac OS X
##    TVhours         Editor
## 3        0 Microsoft Word
## 18      10          LaTeX
## 19       4 Microsoft Word
## 28       3 Microsoft Word
## 32       0 Microsoft Word
## 37       7 Microsoft Word
## 38       8     R Markdown
## 49       8          LaTeX
## 57       0          Other

More subset examples

Let’s pull all of the PPM students who have never used R before

survey[survey$Program=="PPM" & survey$Rexperience=="Never used", ]

##    Program                PriorExp Rexperience OperatingSystem TVhours
## 1      PPM         Some experience  Never used         Windows    10.5
## 4      PPM Never programmed before  Never used         Windows    10.0
## 5      PPM Never programmed before  Never used         Windows     4.0
## 7      PPM Never programmed before  Never used        Mac OS X     2.0
## 11     PPM Never programmed before  Never used         Windows    15.0
## 12     PPM    Extensive experience  Never used      Linux/Unix     5.0
## 15     PPM         Some experience  Never used         Windows     5.0
## 17     PPM         Some experience  Never used         Windows    14.0
## 24     PPM         Some experience  Never used         Windows     3.0
## 31     PPM         Some experience  Never used         Windows    20.0
## 43     PPM         Some experience  Never used         Windows     4.0
## 44     PPM         Some experience  Never used         Windows     0.0
## 45     PPM    Extensive experience  Never used        Mac OS X     1.0
## 46     PPM Never programmed before  Never used         Windows     7.0
## 53     PPM         Some experience  Never used         Windows     4.0
##            Editor
## 1           Other
## 4           Excel
## 5  Microsoft Word
## 7  Microsoft Word
## 11 Microsoft Word
## 12          LaTeX
## 15          Other
## 17 Microsoft Word
## 24          Other
## 31 Microsoft Word
## 43 Microsoft Word
## 44 Microsoft Word
## 45          Other
## 46 Microsoft Word
## 53 Microsoft Word

“tidy” subsetting with `filter()`

In general, it is preferable to use the filter() function
Here’s an example of selecting all responses from students who are either in PPM or Other and who listed their R experience as “Basic competence”.

filter(survey, 
       (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")

##    Program             PriorExp      Rexperience OperatingSystem TVhours
## 1    Other Extensive experience Basic competence        Mac OS X       3
## 2      PPM      Some experience Basic competence        Mac OS X       0
## 3      PPM      Some experience Basic competence         Windows       4
## 4    Other      Some experience Basic competence         Windows       0
## 5      PPM Extensive experience Basic competence        Mac OS X      20
## 6      PPM      Some experience Basic competence         Windows      10
## 7      PPM      Some experience Basic competence         Windows       3
## 8      PPM Extensive experience Basic competence         Windows       6
## 9    Other Extensive experience Basic competence        Mac OS X      10
## 10     PPM      Some experience Basic competence        Mac OS X       2
## 11   Other      Some experience Basic competence        Mac OS X       3
## 12     PPM      Some experience Basic competence        Mac OS X       5
## 13     PPM      Some experience Basic competence         Windows      10
## 14     PPM      Some experience Basic competence         Windows       9
## 15     PPM      Some experience Basic competence         Windows      10
## 16   Other      Some experience Basic competence         Windows      10
## 17     PPM      Some experience Basic competence        Mac OS X       2
## 18     PPM      Some experience Basic competence         Windows      15
## 19     PPM      Some experience Basic competence         Windows       3
## 20     PPM      Some experience Basic competence         Windows      21
##            Editor
## 1  Microsoft Word
## 2  Microsoft Word
## 3  Microsoft Word
## 4  Microsoft Word
## 5  Microsoft Word
## 6  Microsoft Word
## 7  Microsoft Word
## 8  Microsoft Word
## 9  Microsoft Word
## 10 Microsoft Word
## 11          LaTeX
## 12     R Markdown
## 13     R Markdown
## 14 Microsoft Word
## 15 Microsoft Word
## 16 Microsoft Word
## 17          Other
## 18 Microsoft Word
## 19 Microsoft Word
## 20 Microsoft Word

`filter()` allows you to split conditions across lines

On the previous slide we had

filter(survey, 
       (Program == "PPM" | Program == "Other") & Rexperience == "Basic competence")

This is equivalent to the easier to parse call:

filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence")

##    Program             PriorExp      Rexperience OperatingSystem TVhours
## 1    Other Extensive experience Basic competence        Mac OS X       3
## 2      PPM      Some experience Basic competence        Mac OS X       0
## 3      PPM      Some experience Basic competence         Windows       4
## 4    Other      Some experience Basic competence         Windows       0
## 5      PPM Extensive experience Basic competence        Mac OS X      20
## 6      PPM      Some experience Basic competence         Windows      10
## 7      PPM      Some experience Basic competence         Windows       3
## 8      PPM Extensive experience Basic competence         Windows       6
## 9    Other Extensive experience Basic competence        Mac OS X      10
## 10     PPM      Some experience Basic competence        Mac OS X       2
## 11   Other      Some experience Basic competence        Mac OS X       3
## 12     PPM      Some experience Basic competence        Mac OS X       5
## 13     PPM      Some experience Basic competence         Windows      10
## 14     PPM      Some experience Basic competence         Windows       9
## 15     PPM      Some experience Basic competence         Windows      10
## 16   Other      Some experience Basic competence         Windows      10
## 17     PPM      Some experience Basic competence        Mac OS X       2
## 18     PPM      Some experience Basic competence         Windows      15
## 19     PPM      Some experience Basic competence         Windows       3
## 20     PPM      Some experience Basic competence         Windows      21
##            Editor
## 1  Microsoft Word
## 2  Microsoft Word
## 3  Microsoft Word
## 4  Microsoft Word
## 5  Microsoft Word
## 6  Microsoft Word
## 7  Microsoft Word
## 8  Microsoft Word
## 9  Microsoft Word
## 10 Microsoft Word
## 11          LaTeX
## 12     R Markdown
## 13     R Markdown
## 14 Microsoft Word
## 15 Microsoft Word
## 16 Microsoft Word
## 17          Other
## 18 Microsoft Word
## 19 Microsoft Word
## 20 Microsoft Word

“tidy” selection of rows and columns

What if we wanted to select a subset of rows and columns with a combinatino of filter and select?
Here’s one strategy

# First, get the desired rows
row.subset <- filter(survey, 
                     Program == "PPM" | Program == "Other",
                     Rexperience == "Basic competence")
# Then, get the right columns
select(row.subset, TVhours, Editor)

##    TVhours         Editor
## 1        3 Microsoft Word
## 2        0 Microsoft Word
## 3        4 Microsoft Word
## 4        0 Microsoft Word
## 5       20 Microsoft Word
## 6       10 Microsoft Word
## 7        3 Microsoft Word
## 8        6 Microsoft Word
## 9       10 Microsoft Word
## 10       2 Microsoft Word
## 11       3          LaTeX
## 12       5     R Markdown
## 13      10     R Markdown
## 14       9 Microsoft Word
## 15      10 Microsoft Word
## 16      10 Microsoft Word
## 17       2          Other
## 18      15 Microsoft Word
## 19       3 Microsoft Word
## 20      21 Microsoft Word

Piping with `%>%`

Here’s a better strategy, which uses “piping” to supply the output of one computation as an argument into the next
Most of our data processing/summarization pipelines in R will involve lots of piping
The symbol %>% is pronounced “pipe”

filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence") %>%
  select(TVhours, Editor)

##    TVhours         Editor
## 1        3 Microsoft Word
## 2        0 Microsoft Word
## 3        4 Microsoft Word
## 4        0 Microsoft Word
## 5       20 Microsoft Word
## 6       10 Microsoft Word
## 7        3 Microsoft Word
## 8        6 Microsoft Word
## 9       10 Microsoft Word
## 10       2 Microsoft Word
## 11       3          LaTeX
## 12       5     R Markdown
## 13      10     R Markdown
## 14       9 Microsoft Word
## 15      10 Microsoft Word
## 16      10 Microsoft Word
## 17       2          Other
## 18      15 Microsoft Word
## 19       3 Microsoft Word
## 20      21 Microsoft Word

Piping: preferred style

When piping, it is best to pipe right from the start

# OK:
filter(survey, 
       Program == "PPM" | Program == "Other",
       Rexperience == "Basic competence") %>%
  select(TVhours, Editor)

# Better:
survey %>%
  filter(Program == "PPM" | Program == "Other",
         Rexperience == "Basic competence") %>%
  select(TVhours, Editor)

Splitting a long expression

As your function calls get longer and more complicated, you may find it useful to split them over multiple lines
Suppose you had something like this:

survey[(survey$Program == "PPM" | survey$Program == "Other") & survey$Rexperience == "Basic competence", ]

You can split this across multiple lines by putting a line break after an operator

survey[(survey$Program == "PPM" | survey$Program == "Other") & 
         survey$Rexperience == "Basic competence", ]

Note that the line break occurs after the & operator

Some simple calculations

mean(survey$TVhours[survey$Program == "PPM"]) # Average time PPM's spent watching TV

## [1] 7.513158

mean(survey$TVhours[survey$Program == "MISM"]) # Average time MISM's spent watching TV

## [1] 4.444444

mean(survey$TVhours[survey$Program == "Other"]) # Average time "Others" spent watching TV

## [1] 6

(Preview of) “tidy” data summaries with `group_by` and `summarize`

Here’s a much easier and cleaner way of getting the average TV hours watched by students in each program. We use group_by and summarize

survey %>%
  group_by(Program) %>%
  summarize(mean(TVhours))

## # A tibble: 3 x 2
##   Program `mean(TVhours)`
##   <fct>             <dbl>
## 1 MISM               4.44
## 2 Other              6   
## 3 PPM                7.51

Defining variables

If we wanted to focus on a particular column of the data frame, we could always define it as a new variable (e.g., if you want to easily experiment on a column)

tv.hours <- survey$TVhours  # Vector of TVhours watched
mean(tv.hours)              # Average time spent watching TV

## [1] 6.763158

sd(tv.hours)                # Standard deviation of TV watching time

## [1] 5.778737

sum(tv.hours >= 5)   # How many people watched 5 or more hours of TV?

## [1] 29

R coding style

Coding style (and code commenting) will become increasingly more important as we get into more advanced and involved programming tasks
Borrowing Hadley Wickham’s words:
You don’t have to use my style, but you really should use a consistent style.
This style guide is short and easy to follow
We’ll revisit the question of coding style several times over the course of the class

Enforced style: assignment operator

Assignment operator. Use <- not =

Style guides uniformly promote the use of <- instead of = as the assignment operator

student.names <- c("Eric", "Hao", "Jennifer")  # Good
student.names = c("Eric", "Hao", "Jennifer") # Bad

Use = when specifying function arguments,

sort(tv.hours, decreasing=TRUE) # Good
sort(tv.hours, decreasing<-TRUE) # Works, but not what you want

Enforced style: Spacing

Binary operators should have spaces around them
Commas should have a space after, but not before (just like in writing)

3 * 4 # Good
3*4 # Bad
which(student.names == "Eric") # Good
which(student.names=="Eric") # Bad

For specifying arguments, spacing around = is optional

sort(tv.hours, decreasing=TRUE) # Accepted
sort(tv.hours, decreasing = FALSE) # Accepted

Enforced style: Variable names

To make code easy to read, debug, and maintain, you should use concise but descriptive variable names
Terms in variable names should be separated by _ or .

# Accepted
day_one   day.one   day_1   day.1   day1

# Bad
d1   DayOne   dayone   

# Can be made more concise:
first.day.of.the.month

Avoid using variable names that are already pre-defined variables or functions in R

# EXTREMELY bad:
c   T   pi   sum   mean

Assignments

Homework 1 will be posted today
- Due: Wednesday, November 4, 1:30PM ET
- Submit your .Rmd and .html files on Canvas (do not zip them together)
- Course website contains grading rubric
Lab 2 is now available
- Remember: To earn a participation point for today’s class, you must submit Lab 1 and Lab 2 by Friday evening (ET)

Agenda

Wrapping up Lecture 1 content

Load tidyverse

“Base R” vs tidyverse

Importing data

Exploring the data

Simple summary

Another simple summary

Data frame basics

Data frame dimensions

Inline code chunks example

Indexing data frames

More indexing

Bar plot (categorical data)

Histogram (continuous data)

Indexing multiple columns

tidy column selection: select()

Indexing rows and columns

More indexing rows and columns

More indexing

Subsets of data

More subset examples

“tidy” subsetting with filter()

filter() allows you to split conditions across lines

“tidy” selection of rows and columns

Piping with %>%

Piping: preferred style

Splitting a long expression

Some simple calculations

(Preview of) “tidy” data summaries with group_by and summarize

Defining variables

R coding style

Enforced style: assignment operator

Enforced style: Spacing

Enforced style: Variable names

Assignments

“tidy” subsetting with `filter()`

`filter()` allows you to split conditions across lines

Piping with `%>%`

(Preview of) “tidy” data summaries with `group_by` and `summarize`