class: center, middle, inverse, title-slide # 2019 Semester 2 Statistics Tutor R Demo ##
https://kevinwang09.github.io/tutor_demo/
###
Kevin Y. X. Wang
### Initiated on 2019 Feb 17, compiled on 2019 Jul 29 --- class: segue ### Disclaimer: these materials were written based on my teaching experiences. Your experiences may differ. If "symptoms" persist, consult your lecturers or other senior tutors. --- class: segue-yellow 2018 tutor training notes https://github.com/kevinwang09/tutor_demo/blob/master/2018_TutorTraining.pdf 2019 Semester 1 tutor training notes https://github.com/kevinwang09/tutor_demo/2019S1.html R Guide http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html --- class: segue-yellow # Submit your answer http://bit.ly/2019_tutor_training --- ## Outline .large[ + Course changes + Tutorial format changes + Demo on folder structure + Demo for data import and cleaning + Two examples from my own tutorials ] --- ## More semantic learning & real applied skills 1. Students will learn how to solve problems with **real** data, with strong emphasis on statistical thinking and computational skills. 1. They will also develop essential soft skills of collaboration and communication. > Formula and coding should only be there to reinforce the understanding of statistical concepts. + We want our students to learn the concepts of statistics (semantics) and the practice of data science. + We do not want them to be drowned by formulas (syntax) they don't understand. .footnote[Associate Professor David Easdown has a great [paper](http://www.maths.usyd.edu.au/u/pubs/publist/preprints/2009/easdown-12.pdf) on syntactic and semantic reasoning in mathematical learning. It took me a great deal of time to appreciate the ideas.] --- class: segue ### These courses will be different to your 1st year statistics class --- ## Kevin: why do these courses seem 'softer'? Example 1: what is the definition of sample correlation coefficient between `\(x\)` and `\(y\)`? Submit your answer at http://bit.ly/2019_tutor_training. -- + When I did 1st year statistics: `$$r_{xy} = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2} \sqrt{\sum (y_i - \bar{y})^2}} = \frac{n \sum x_i y_i - \sum x_i \sum y_i }{\sqrt{n \sum x_i^2 - (\sum x_i)^2} \sqrt{n \sum y_i^2 - (\sum y_i)^2}}.$$` -- + Linear algebra: it is the .brand-blue[inner product] between the .brand-blue[centered vectors] `\(x\)` and `\(y\)`, divided by the products of their `\(L_2\)` .brand-blue[norms]. -- + Geometry: it is the .brand-blue[cosine of the angle] between the .brand-blue[centered vectors] `\(x\)` and `\(y\)`. > Correlation is a **measure of similarity** between `\(x\)` and `\(y\)` which informs us something about the data. But the intuitive geometric interpretation takes a bit more mathematical training. --- ## More intuitions and more inter-linked contents + Visual: How **tightly clustered** are `\(x\)` and `\(y\)` around the trend line? + Numerical: How to measure similarity `\(x\)` and `\(y\)` **using a number**? <center> <img src="figures/two_scatter_plots.png", width="40%"> </center> <!-- + We look at the *centre and spread* of each paired observation in `\(x\)` and `\(y\)` using z-scores. Then, we get a *joint summary* of the two z-scores using multiplication and averaging. --> + .brand-red[Populations] correlation is the mean of the product of the .brand-blue[z-scores] of `\(x\)` and `\(y\)`. -- `$$r_{xy}=\frac{1}{n-1}\sum_{i=1}^{n} \left(\frac {x_{i} - {\bar {x}}}{s_{x}}\right) \left(\frac {y_{i} - {\bar {y}}}{s_{y}}\right).$$` + .brand-blue[z-score] was already introduced in the course. The difference between .brand-red[population] and .brand-red[sample] is emphasised again. --- ## Format of DATA1001 tutorials + .brand-blue[Old tutorials:] Kevin walks into the room, explains the key calculation for the first 5-10 minutes. Students do some questions on pen and paper to make sure they will know how to do this in the assignment and the final exam. -- + .brand-red[New tutorials:] Students actually get a quiz for the first 10 minutes. The tutorial motivations are the data and the thinking questions in textbook. Students then complete the rest of the worksheet using R Markdown. <center> <img src="figures/rmsError.png", width="60%"> </center> --- ## Tutorial preparation > You can't solve all questions by declaring maths formulas. + Please read the [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html) and the tutorial worksheet & the data before the class. + **Discuss** the solutions (not the numerical answer) with the students, so they can see how you make **statistical arguments**. .footnote[ .font150[You have all demonstrated abilities in making good statistical arguments during your interview.]] -- .pull-left[ <center> .font90[.brand-blue[Having more children causes higher blood pressure]] <img src="figures/histBloodPressure.png", width="80%"> </center> ] -- .pull-right[ <center> <img src="figures/melanomaSurvival2.png", width="60%"> </center> ] --- class: segue We aim to build statistical intuitions and applied skills in 1st year. We will formalise concepts under a rigorous framework in 2nd year. --- class: segue ### What was the hardest thing about using `R` in your undergraduate studies? --- ## By the end of this course, students should: + Use RStudio and RMarkdown. + Read in an Excel spreadsheet. + Produce basic numerical and graphical summaries. + Write up a basic exploratory analysis report that .brand-blue[addresses a research question]. -- .brand-red[.font170[Common questions]] 1. Where are my **files and folders**? 1. How do I read in this **Excel sheet**? How can I **clean** my data? 1. How do I do this **graph**? How do I calculate this **number**? > Notice that the priority is often reversed. > Students rarely ask the question "What is the **purpose** of this graph/number". --- class: segue # Demo 1: setting up a folder system: download files into a folder with subfolders, `setwd()`. <center> <img src="figures/folderStr.jpeg", width="70%"> </center> --- ## What I suppose to teach you + [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html), section 1.5 to 1.8. We are aiming for **one single folder** for the entire course. + Every week, students will download a `.Rmd` file and data files into the folder, and work on those questions. + Everyone should have the same folder structure. + My experience tells me that no one ever does this. - They will create new folders every week - Create subfolders within the main folder - Mis-spell folder names - Windows uses backward-slashes which doesn't help either. --- ## An alternative using RStudio Project > The working directory of a `.Rmd` file is not the same as the working directory of the RStudio. > That inability to recognise where the Rmd/Excel files are located in which folder will be the biggest obstacle for our student in the first 4 weeks. My solution to this is to go to RStudio - New Project: + Click `New Directory` if they haven't set up a folder yet. + Click `Existing Directory` if they already set up a folder. (I am not suppose to teach you this as it is not on the R Guide, but if they follow this procedure correctly, they should never ask this question ever again.) --- ## Some tips for RStudio + RMarkdown + The biggest source of confusion is the **chunks**. Reinforce this idea that anything in a chunk is R code, everything else are comments on the R outputs. + Sloooooooooooooooooooowwwwwwww down for your students. + Use the `tab` key!! + The R Guide is the basic assessable material, there is no need to do complex coding in class. + But in projects, students are encouraged to explore new coding options, e.g. `tidyverse`. <center> <img src="figures/base_tidyverse.jpg", width="40%"> </center> .footnote[https://twitter.com/thinkR_fr/status/882604987381055489] --- class: segue # Demo 2: reading and cleaning data --- ## Two simple functions + `readr::read_csv()` has some advantages over `utils::read.csv()`, especially with factors. I strongly recommend against RStudio's point-and-click option. + `janitor::clean_names()` is perhaps the simplest function to reduce student frustrations when completing projects. .scroll-box-20[ ```r dirtyIris = readr::read_csv("data/dirtyIris.csv") ``` ``` ## Parsed with column specification: ## cols( ## Sepal....Length = col_double(), ## `Sepal.? Width` = col_double(), ## `petal.Length(*&^` = col_double(), ## `petal.$^&Width` = col_double(), ## `SPECIES^` = col_character(), ## allEmpty = col_logical() ## ) ``` ```r colnames(dirtyIris) ``` ``` ## [1] "Sepal....Length" "Sepal.? Width" "petal.Length(*&^" ## [4] "petal.$^&Width" "SPECIES^" "allEmpty" ``` ```r janitor::clean_names(dirtyIris) ``` ``` ## # A tibble: 650 x 6 ## sepal_length sepal_width petal_length petal_width species all_empty ## <dbl> <dbl> <dbl> <dbl> <chr> <lgl> ## 1 7.7 3.8 6.7 2.2 virginica NA ## 2 -0.184 NA NA 1.10 setosa NA ## 3 7.2 3.6 6.1 2.5 virginica NA ## 4 6.3 2.3 4.4 1.3 versicolor NA ## 5 5.6 2.9 3.6 1.3 versicolor NA ## 6 NA -0.906 0.793 NA setosa NA ## 7 0.0187 -0.906 0.793 NA virginica NA ## 8 6.3 2.5 4.9 1.5 versicolor NA ## 9 NA NA 1.69 NA setosa NA ## 10 -0.184 -1.46 1.69 0.320 virginica NA ## # … with 640 more rows ``` ] --- ## Tips on reading and cleaning data + Don't try to be (too) perfect. Students are gentle creatures, they do not like the sight of error/warnings messages in sharp red text. ```r ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width)) ``` <pre style="color: red;"><code>## Error in ggplot(iris): could not find function "ggplot" </code></pre> <center> <img src="figures/deer.jpg", width="40%"> </center> --- ## Tips on reading and cleaning data + Showing your programming mistakes and then correcting them can send a positive signal to your students. ```r library(ggplot2) ggplot(iris) + geom_point(aes(x = Sepal.Length, y = Sepal.Width)) ``` <!-- --> .footnote[Garth would do live Google searches and copy StackExchange codes even though he knew the answer.] --- class: segue # Additional content for advanced R users --- ## Students struggle with subsetting data (1) `R` is partially blamed for this. This [base R cheatsheet](https://www.rstudio.com/wp-content/uploads/2016/05/base-r.pdf) helps. .scroll-box-20[ ```r x = 1:5; names(x) = letters[1:5] x[c(2)] ## Index subsetting ``` ``` ## b ## 2 ``` ```r x[c(FALSE, TRUE, FALSE, FALSE, FALSE)] ## Logical subsetting ``` ``` ## b ## 2 ``` ```r x[c("b")] ## Names subsetting ``` ``` ## b ## 2 ``` ```r x[-c(1, 3, 4, 5)] ## Eliminiation using negative indexing ``` ``` ## b ## 2 ``` ```r head(iris$Species) ## Unique only to data.frame column subsetting ``` ``` ## [1] setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica ``` ] --- ## Students struggle with subsetting data (2) | | vector | data.frame (row) | data.frame (column) | |:----------------------:|:----------------------------------------:|:---------------------------------------------:|:---------------------------------------------:| | Index | `x[c(2)]` | `data[c(2), ]` | `data[, c(2)]` | | Names | `x[c("b")]` | `data[c("b"), ]` | `data[, c("b")]` | | Logical | `x[c(FALSE, TRUE, FALSE, FALSE, FALSE)]` | `data[c(FALSE, TRUE, FALSE, FALSE, FALSE), ]` | `data[, c(FALSE, TRUE, FALSE, FALSE, FALSE)]` | | | | | | | Elimination of index | `x[-c(1, 3, 4, 5)]` | `data[-c(1, 3, 4, 5), ]` | `data[, -c(1, 3, 4, 5)]` | | $ | - | - | `data$b` | + Logical vector must be the same length as the `vector`, or number of rows/columns as the `data.frame`. + `$` doesn't work if the column name contain spaces*, for example `data$Sepal Length`. + It is hard, and I haven't worked out the best way to teach this. Sarah suggested [this](http://adv-r.had.co.nz/Subsetting.html). .footnote[.font150[ *This is the reason I recommend `janitor::clean_names()`. ] ] --- class: segue # Demo: two real examples from my tutorial --- ## Example 1 Student asks: how can I do a **dot plot** or the **line plot** for the median of this data for the `species` categories? -- 1. The question is not well-formulated. Be patient with them. 2. Students always asks "how". Don't be trapped by that question. Ask them why they want to do such a plot? -- > First, could you show me the data please? -- The `iris` dataset. -- > .brand-red[Why] do you want to see the median? -- So I can say that there is an *increasing trend* across the three categories for my report. --- > This data has four continuous variables. When you say increasing trend, which one of the four are you referring to? -- Ummmm.... the `Sepal.Length`? -- > So given one coninuous variable (`Sepal.Length`), and a discrete variable (`Species`), how would you draw a line plot or dot plot? -- I can't do a line plot. But maybe a dot plot like this? <!-- --> --- > Can you tell the trends is actually "increasing" from this "dot plot"? Especially considering there are ranges of points that overlaps? -- No, I probably can't. > If you check you lecture notes, can you find out what is the recommended visualisation for cases where we have one coninuous variable and a discrete variable? And what the function? -- ```r boxplot(iris$Sepal.Length ~ iris$Species) ``` <!-- --> --- ## Example 1 extended Most students will have trouble subsetting/summarising data. I strongly recommend the `dplyr` package for this: https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html. ```r # library(dplyr) iris %>% group_by(Species) %>% summarise_all(median) ``` ``` ## # A tibble: 3 x 5 ## Species Sepal.Length Sepal.Width Petal.Length Petal.Width ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 setosa 5 3.4 1.5 0.2 ## 2 versicolor 5.9 2.8 4.35 1.3 ## 3 virginica 6.5 3 5.55 2 ``` --- ## Example 2: Simpson's paradox True or false? > In the `iris` dataset, we know the variable `Sepal.Length` is negatively correlated with `Sepal.Width`. Therefore, it is necessary that these two variables must also be negatively correlated in any arbitrarily chosen `Species`. ```r cor(iris$Sepal.Length, iris$Sepal.Width) ``` ``` ## [1] -0.1175698 ``` Yes, because the overall trend is an aggregation of the individual trends. And therefore each trends themselves must be negative. --- ## Example 2: calculating the correlation > Why don't you try to find out by actually calculating the correlation values? ```r cor(iris[iris$Species == "setosa",]$Sepal.Length, iris[iris$Species == "setosa",]$Sepal.Width) ``` ``` ## [1] 0.7425467 ``` ```r cor(iris[iris$Species == "versicolor",]$Sepal.Length, iris[iris$Species == "versicolor",]$Sepal.Width) ``` ``` ## [1] 0.5259107 ``` ```r cor(iris[iris$Species == "virginica",]$Sepal.Length, iris[iris$Species == "virginica",]$Sepal.Width) ``` ``` ## [1] 0.4572278 ``` --- ## Example 2: calculating the correlation > Good! Let me teach you a much easier way ```r iris %>% group_by(Species) %>% summarise(cor = cor(Sepal.Length, Sepal.Width)) ``` ``` ## # A tibble: 3 x 2 ## Species cor ## <fct> <dbl> ## 1 setosa 0.743 ## 2 versicolor 0.526 ## 3 virginica 0.457 ``` --- ## Example 2: plotting ```r ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point() + geom_smooth(method = "lm") + facet_grid(~Species, margins = TRUE) ``` <!-- --> --- ## Summary on the course itself .large[ + There is a stronger emphasis on statistical thinking and coding. + You don't have to impress your students by showing you know more maths symbols. Impress them by showing more insights of the concept. + Ask leading questions. Don't give out free answers. + The courses are different, so do the necessary preparations before tutorial. The [R Guide](http://www.maths.usyd.edu.au/u/UG/JM/DATA1001/r/current/guides/RGuide.html) is your friend. ] --- ## Summary on the coding component .large[ + There is an infinite number of functions/packages that I can go through with you, but given the time we have I can only recommend that you start with `readr::read_csv`, `janitor::clean_names`, `dplyr` and `ggplot2`. + Use the R [cheatsheets](https://www.rstudio.com/resources/cheatsheets/). + Be prepared for non-standard questions in their projects. + Approach your lecturers and other senior tutors for advice, we are all in this together. ] --- ## Reference + Easdown, D., 2009. Syntactic and semantic reasoning in mathematics teaching and learning. Int. J. Math. Educ. Sci. Technol. 40, 941–949. https://doi.org/10.1080/00207390903205488 + Freedman, David & Pisani, Robert & Purves, Roger. Statistics (4th ed). Norton, New York. + Menzies, A.M., Haydu, L.E., Visintin, L., Carlino, M.S., Howle, J.R., Thompson, J.F., Kefford, R.F., Scolyer, R.A., Long, G. V., 2012. Distinguishing clinicopathologic features of patients with V600E and V600K BRAF-mutant metastatic melanoma. Clin. Cancer Res. 18, 3242–3249. https://doi.org/10.1158/1078-0432.CCR-12-0052