2019 Semester 2 Statistics Tutor R Demo

# 2019 Semester 2 Statistics Tutor R Demo
## <a href="https://kevinwang09.github.io/tutor_demo/" class="uri">https://kevinwang09.github.io/tutor_demo/</a>
### <a href="https://kevinwang09.github.io/">Kevin Y. X. Wang</a>
### Initiated on 2019 Feb 17, compiled on 2019 Jul 29

---

### Disclaimer: these materials were written based on my teaching experiences.

Your experiences may differ.

If "symptoms" persist, consult your lecturers or other senior tutors.

---

2018 tutor training notes

https://github.com/kevinwang09/tutor_demo/blob/master/2018_TutorTraining.pdf

2019 Semester 1 tutor training notes

https://github.com/kevinwang09/tutor_demo/2019S1.html

R Guide

http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html

---

# Submit your answer

http://bit.ly/2019_tutor_training

---

## Outline
.large[
+ Course changes

+ Tutorial format changes

+ Demo on folder structure

+ Demo for data import and cleaning

+ Two examples from my own tutorials
]
---

## More semantic learning & real applied skills

1.  Students will learn how to solve problems with **real** data, with strong emphasis on statistical thinking and computational skills.
1.  They will also develop essential soft skills of collaboration and communication.

> Formula and coding should only be there to reinforce the understanding of statistical concepts.

+ We want our students to learn the concepts of statistics (semantics) and the practice of data science.

+ We do not want them to be drowned by formulas (syntax) they don't understand.

.footnote[Associate Professor David Easdown has a great [paper](http://www.maths.usyd.edu.au/u/pubs/publist/preprints/2009/easdown-12.pdf) on syntactic and semantic reasoning in mathematical learning. It took me a great deal of time to appreciate the ideas.]
---

### These courses will be different to your 1st year statistics class

---

## Kevin: why do these courses seem 'softer'?

Example 1: what is the definition of sample correlation coefficient between `$x$` and `$y$`?

Submit your answer at http://bit.ly/2019_tutor_training.

+ When I did 1st year statistics:

`$$r_{xy} =  \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} = \frac{n \sum x_i y_i - \sum x_i \sum y_i }{\sqrt{n \sum x_i^2 - (\sum x_i)^2} \sqrt{n \sum y_i^2 - (\sum y_i)^2}}.$$`

+ Linear algebra: it is the .brand-blue[inner product] between the .brand-blue[centered vectors] `$x$` and `$y$`, divided by the products of their `$L_2$` .brand-blue[norms].

+ Geometry: it is the .brand-blue[cosine of the angle] between the .brand-blue[centered vectors] `$x$` and `$y$`.

> Correlation is a **measure of similarity** between `$x$` and `$y$` which informs us something about the data. But the intuitive geometric interpretation takes a bit more mathematical training.

---

## More intuitions and more inter-linked contents

+ Visual: How **tightly clustered** are `$x$` and `$y$` around the trend line? 
+ Numerical: How to measure similarity `$x$` and `$y$` **using a number**?

+ .brand-red[Populations] correlation is the mean of the product of the .brand-blue[z-scores] of `$x$` and `$y$`.

`$$r_{xy}=\frac{1}{n-1}\sum_{i=1}^{n} \left(\frac {x_{i} - {\bar {x}}}{s_{x}}\right) \left(\frac {y_{i} - {\bar {y}}}{s_{y}}\right).$$`

+ .brand-blue[z-score] was already introduced in the course. The difference between .brand-red[population] and .brand-red[sample] is emphasised again.

---

## Format of DATA1001 tutorials

+ .brand-blue[Old tutorials:] Kevin walks into the room, explains the key calculation for the first 5-10 minutes. Students do some questions on pen and paper to make sure they will know how to do this in the assignment and the final exam.

+ .brand-red[New tutorials:] Students actually get a quiz for the first 10 minutes. The tutorial motivations are the data and the thinking questions in textbook. Students then complete the rest of the worksheet using R Markdown.

---

## Tutorial preparation

> You can't solve all questions by declaring maths formulas.

+ Please read the [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html) and the tutorial worksheet & the data before the class.
+ **Discuss** the solutions (not the numerical answer) with the students, so they can see how you make **statistical arguments**.

.footnote[
.font150[You have all demonstrated abilities in making good statistical arguments during your interview.]]

<center>
.font90[.brand-blue[Having more children causes higher blood pressure]]
<img src="figures/histBloodPressure.png", width="80%">
</center>

]

---
class: segue

We aim to build statistical intuitions and applied skills in 1st year. 
We will formalise concepts under a rigorous framework in 2nd year.

---
class: segue

### What was the hardest thing about using `R` in your undergraduate studies?

---

## By the end of this course, students should:

+ Use RStudio and RMarkdown.
+ Read in an Excel spreadsheet.
+ Produce basic numerical and graphical summaries.
+ Write up a basic exploratory analysis report that .brand-blue[addresses a research question].

1. Where are my **files and folders**?
1. How do I read in this **Excel sheet**? How can I **clean** my data?
1. How do I do this **graph**? How do I calculate this **number**?

> Notice that the priority is often reversed.

> Students rarely ask the question "What is the **purpose** of this graph/number".

---

# Demo 1: setting up a folder system: download files into a folder with subfolders, `setwd()`.

---

## What I suppose to teach you

+ [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html), section 1.5 to 1.8. We are aiming for **one single folder** for the entire course.

+ Every week, students will download a `.Rmd` file and data files into the folder, and work on those questions.

+ Everyone should have the same folder structure.

+ My experience tells me that no one ever does this.

- They will create new folders every week
  - Create subfolders within the main folder
  - Mis-spell folder names
  - Windows uses backward-slashes which doesn't help either.

---

## An alternative using RStudio Project

> The working directory of a `.Rmd` file is not the same as the working directory of the RStudio.

> That inability to recognise where the Rmd/Excel files are located in which folder will be the biggest obstacle for our student in the first 4 weeks.

My solution to this is to go to RStudio - New Project:

+ Click `New Directory` if they haven't set up a folder yet. 
+ Click `Existing Directory` if they already set up a folder.

(I am not suppose to teach you this as it is not on the R Guide, but if they follow this procedure correctly, they should never ask this question ever again.)

---

## Some tips for RStudio + RMarkdown

+ The biggest source of confusion is the **chunks**. Reinforce this idea that anything in a chunk is R code, everything else are comments on the R outputs.

+ Sloooooooooooooooooooowwwwwwww down for your students.

+ Use the `tab` key!!

+ The R Guide is the basic assessable material, there is no need to do complex coding in class.

+ But in projects, students are encouraged to explore new coding options, e.g. `tidyverse`.

---

# Demo 2: reading and cleaning data

---

## Two simple functions

+ `readr::read_csv()` has some advantages over `utils::read.csv()`, especially with factors. I strongly recommend against RStudio's point-and-click option.

+ `janitor::clean_names()` is perhaps the simplest function to reduce student frustrations when completing projects.

```r
dirtyIris = readr::read_csv("data/dirtyIris.csv")
```

```
## Parsed with column specification:
## cols(
##   Sepal....Length = col_double(),
##   `Sepal.?    Width` = col_double(),
##   `petal.Length(*&^` = col_double(),
##   `petal.$^&Width` = col_double(),
##   `SPECIES^` = col_character(),
##   allEmpty = col_logical()
## )
```

```r
colnames(dirtyIris)
```

```
## [1] "Sepal....Length"  "Sepal.?    Width" "petal.Length(*&^"
## [4] "petal.$^&Width"   "SPECIES^"         "allEmpty"
```

```r
janitor::clean_names(dirtyIris)
```

```
## # A tibble: 650 x 6
##    sepal_length sepal_width petal_length petal_width species    all_empty
##           <dbl>       <dbl>        <dbl>       <dbl> <chr>      <lgl>    
##  1       7.7          3.8          6.7         2.2   virginica  NA       
##  2      -0.184       NA           NA           1.10  setosa     NA       
##  3       7.2          3.6          6.1         2.5   virginica  NA       
##  4       6.3          2.3          4.4         1.3   versicolor NA       
##  5       5.6          2.9          3.6         1.3   versicolor NA       
##  6      NA           -0.906        0.793      NA     setosa     NA       
##  7       0.0187      -0.906        0.793      NA     virginica  NA       
##  8       6.3          2.5          4.9         1.5   versicolor NA       
##  9      NA           NA            1.69       NA     setosa     NA       
## 10      -0.184       -1.46         1.69        0.320 virginica  NA       
## # … with 640 more rows
```
]

---

## Tips on reading and cleaning data

+ Don't try to be (too) perfect. Students are gentle creatures, they do not like the sight of error/warnings messages in sharp red text.

```r
ggplot(iris) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width))
```

<pre style="color: red;"><code>## Error in ggplot(iris): could not find function "ggplot"
</code></pre>

---

## Tips on reading and cleaning data

+ Showing your programming mistakes and then correcting them can send a positive signal to your students.

```r
library(ggplot2)
ggplot(iris) +
  geom_point(aes(x = Sepal.Length, y = Sepal.Width))
```

![](index_files/figure-html/unnamed-chunk-5-1.png)

.footnote[Garth would do live Google searches and copy StackExchange codes even though he knew the answer.]

---
class: segue

# Additional content for advanced R users

---
## Students struggle with subsetting data (1)

`R` is partially blamed for this. This [base R cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf) helps.
.scroll-box-20[

```r
x = 1:5; names(x) = letters[1:5]
x[c(2)] ## Index subsetting
```

```
## b 
## 2
```

```r
x[c(FALSE, TRUE, FALSE, FALSE, FALSE)] ## Logical subsetting
```

```
## b 
## 2
```

```r
x[c("b")] ## Names subsetting 
```

```
## b 
## 2
```

```r
x[-c(1, 3, 4, 5)] ## Eliminiation using negative indexing
```

```
## b 
## 2
```

```r
head(iris$Species) ## Unique only to data.frame column subsetting
```

```
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica
```
]
---

## Students struggle with subsetting data (2)

+ Logical vector must be the same length as the `vector`, or number of rows/columns as the `data.frame`.
+ `$` doesn't work if the column name contain spaces*, for example `data$Sepal Length`. 
+ It is hard, and I haven't worked out the best way to teach this. Sarah suggested [this](http://adv-r.had.co.nz/Subsetting.html).

---
class: segue

# Demo: two real examples from my tutorial

---

## Example 1

Student asks: how can I do a **dot plot** or the **line plot** for the median of this data for the `species` categories?

1. The question is not well-formulated. Be patient with them.

2. Students always asks "how". Don't be trapped by that question. Ask them why they want to do such a plot?

> First, could you show me the data please?

The `iris` dataset.

> .brand-red[Why] do you want to see the median?

So I can say that there is an *increasing trend* across the three categories for my report.

---

> This data has four continuous variables. When you say increasing trend, which one of the four are you referring to?

Ummmm.... the `Sepal.Length`?

--
 
> So given one coninuous variable (`Sepal.Length`), and a discrete variable (`Species`), how would you draw a line plot or dot plot?

I can't do a line plot. But maybe a dot plot like this?

![](index_files/figure-html/unnamed-chunk-8-1.png)
---
> Can you tell the trends is actually "increasing" from this "dot plot"? Especially considering there are ranges of points that overlaps?

No, I probably can't.

> If you check you lecture notes, can you find out what is the recommended visualisation for cases where we have one coninuous variable and a discrete variable? And what the function?

```r
boxplot(iris$Sepal.Length ~ iris$Species)
```

![](index_files/figure-html/unnamed-chunk-9-1.png)
---
## Example 1 extended

Most students will have trouble subsetting/summarising data. I strongly recommend the `dplyr` package for this: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html.

```r
# library(dplyr)
iris %>% 
  group_by(Species) %>% 
  summarise_all(median)
```

```
## # A tibble: 3 x 5
##   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
##   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
## 1 setosa              5           3.4         1.5          0.2
## 2 versicolor          5.9         2.8         4.35         1.3
## 3 virginica           6.5         3           5.55         2
```

---

## Example 2: Simpson's paradox

True or false?

> In the `iris` dataset, we know the variable `Sepal.Length` is negatively correlated with `Sepal.Width`. Therefore, it is necessary that these two variables must also be negatively correlated in any arbitrarily chosen `Species`.

```r
cor(iris$Sepal.Length, iris$Sepal.Width)
```

```
## [1] -0.1175698
```

Yes, because the overall trend is an aggregation of the individual trends. And therefore each trends themselves must be negative.

---

## Example 2: calculating the correlation

> Why don't you try to find out by actually calculating the correlation values?

```r
cor(iris[iris$Species == "setosa",]$Sepal.Length, iris[iris$Species == "setosa",]$Sepal.Width)
```

```
## [1] 0.7425467
```

```r
cor(iris[iris$Species == "versicolor",]$Sepal.Length, iris[iris$Species == "versicolor",]$Sepal.Width)
```

```
## [1] 0.5259107
```

```r
cor(iris[iris$Species == "virginica",]$Sepal.Length, iris[iris$Species == "virginica",]$Sepal.Width)
```

```
## [1] 0.4572278
```

---

## Example 2: calculating the correlation

> Good! Let me teach you a much easier way

```r
iris %>% 
  group_by(Species) %>% 
  summarise(cor = cor(Sepal.Length, Sepal.Width))
```

```
## # A tibble: 3 x 2
##   Species      cor
##   <fct>      <dbl>
## 1 setosa     0.743
## 2 versicolor 0.526
## 3 virginica  0.457
```

---

## Example 2: plotting

```r
ggplot(iris,
       aes(x = Sepal.Length, y = Sepal.Width)) +
  geom_point() +
  geom_smooth(method = "lm") +
  facet_grid(~Species, margins = TRUE)
```

![](index_files/figure-html/unnamed-chunk-14-1.png)

---
## Summary on the course itself
.large[

+ There is a stronger emphasis on statistical thinking and coding.

+ You don't have to impress your students by showing you know more maths symbols. Impress them by showing more insights of the concept.

+ Ask leading questions. Don't give out free answers.

+ The courses are different, so do the necessary preparations before tutorial. The [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html) is your friend. 
]

---

## Summary on the coding component
.large[
+ There is an infinite number of functions/packages that I can go through with you, but given the time we have I can only recommend that you start with `readr::read_csv`, `janitor::clean_names`, `dplyr` and `ggplot2`.

+ Use the R [cheatsheets](https://www.rstudio.com/resources/cheatsheets/).

+ Be prepared for non-standard questions in their projects.

+ Approach your lecturers and other senior tutors for advice, we are all in this together.

]
---
## Reference
+ Easdown, D., 2009. Syntactic and semantic reasoning in mathematics teaching and learning. Int. J. Math. Educ. Sci. Technol. 40, 941–949. https://doi.org/10.1080/00207390903205488

+ Freedman, David & Pisani, Robert & Purves, Roger. Statistics (4th ed). Norton, New York.

+ Menzies, A.M., Haydu, L.E., Visintin, L., Carlino, M.S., Howle, J.R., Thompson, J.F., Kefford, R.F., Scolyer, R.A., Long, G. V., 2012. Distinguishing clinicopathologic features of patients with V600E and V600K BRAF-mutant metastatic melanoma. Clin. Cancer Res. 18, 3242–3249. https://doi.org/10.1158/1078-0432.CCR-12-0052