class: center, middle, inverse, title-slide # mcvis: multicollinearity visualisation ##
https://kevinwang09.github.io/pres/mcvis_talk
### Kevin Y. X. Wang ### 5 December 2019, Adelaide --- ## Acknowledgement This is joint work with Chen Lin (Fudan Univeristy) and Prof Samuel Mueller (University of Sydney). <br> .pull-left[ <center> <img src="figures/chen3.jpg", width="40%"> </center> <center> <img src="figures/fudan.png", width="30%"> </center> ] .pull-right[ <center> <img src="figures/samuel.jpg", width="40%"> </center> <center> <img src="figures/usyd.png", width="40%"> </center> ] --- ## Cricketers' career batting statistics + Cricket is a bat-and-ball game. + The aim of a batsman is to score as many **runs** as possible before getting **out**. ```r glimpse(X) ``` ``` ## Observations: 810 ## Variables: 8 ## $ log_runs <dbl> 2.20, 1.56, 2.84, 2.68, 2.01, 3.21, 2.03, 2.65, 3.13, 2.68,… ## $ log_outs <dbl> 1.040, 0.778, 1.410, 1.320, 1.230, 1.610, 1.000, 1.490, 1.7… ## $ log_ave <dbl> 1.160, 0.778, 1.430, 1.360, 0.778, 1.600, 1.030, 1.160, 1.4… ## $ log_fours <dbl> 1.280, 0.301, 1.830, 1.830, 1.040, 2.160, 1.110, 1.650, 2.0… ## $ log_sixes <dbl> 0.000, 0.000, 0.477, 0.845, 0.301, 0.602, 0.000, 0.477, 0.6… ## $ log_ducks <dbl> 0.699, 0.477, 0.602, 0.602, 1.040, 0.778, 0.602, 0.845, 0.6… ## $ log_hs <dbl> 2.07, 1.26, 2.02, 2.00, 1.41, 2.10, 1.48, 1.52, 2.10, 1.95,… ## $ log_100 <dbl> 0.301, 0.000, 0.301, 0.301, 0.000, 0.699, 0.000, 0.000, 0.4… ``` --- ## Interesting feature in this data There is a causal relationship: `$$\text{batting ave} = \frac{\text{runs}}{\text{no. of outs}}, \qquad \text{or equivalently, } \qquad \texttt{log_runs} = \texttt{log_ave} + \texttt{log_outs}.$$` <br> .center[ <img src="index_files/figure-html/unnamed-chunk-3-1.png" width="360" /> ] --- ## What is multi-collinearity (MC)? MC occurs when columns of `\(X\)` are linear dependent (exactly or approximately). ```r M1 = lm(log_100 ~ ., data = X) broom::tidy(M1) ``` ``` ## # A tibble: 8 x 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) -0.365 0.0902 -4.05 5.67e- 5 ## 2 log_runs -1.92 1.95 -0.984 3.25e- 1 ## 3 log_outs 1.61 1.96 0.826 4.09e- 1 ## 4 log_ave 1.84 1.96 0.943 3.46e- 1 ## 5 log_fours 0.647 0.0969 6.68 4.58e-11 ## 6 log_sixes 0.131 0.0264 4.96 8.57e- 7 ## 7 log_ducks 0.00357 0.0497 0.0718 9.43e- 1 ## 8 log_hs -0.0187 0.0753 -0.248 8.04e- 1 ``` --- ## Consequence of multi-collinearity + We will proceed with rounding all variables to 3 significant figures. <table style="border-collapse:collapse; border:none;"> <tr> <th style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; text-align:left; "> </th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Include all</th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Remove log_runs</th> <th colspan="3" style="border-top: double; text-align:center; font-style:normal; font-weight:bold; padding:0.2cm; ">Remove log_ave</th> </tr> <tr> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; text-align:left; ">Predictors</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">std. Error</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; ">std. Error</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col7">p</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col8">Estimates</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; col9">std. Error</td> <td style=" text-align:center; border-bottom:1px solid; font-style:italic; font-weight:normal; 0">p</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">(Intercept)</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.37</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.09</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.37</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.09</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">-0.36</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.09</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_runs</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-1.92</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.95</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.325</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">-0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.12</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">0.491</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_outs</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.61</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.96</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.409</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.31</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.11</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong>0.004</strong></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">-0.23</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong>0.019</strong></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_ave</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.84</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">1.96</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.346</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.12</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.530</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9"></td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"></td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_fours</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.65</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.64</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">0.65</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.10</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_sixes</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.13</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.03</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; "><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.13</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.03</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7"><strong><0.001</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">0.13</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.03</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0"><strong><0.001</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_ducks</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.05</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.943</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.05</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.922</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">0.00</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.05</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">0.934</td> </tr> <tr> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:left; ">log_hs</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.02</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.804</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">-0.02</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; ">0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col7">0.811</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col8">-0.02</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; col9">0.08</td> <td style=" padding:0.2cm; text-align:left; vertical-align:top; text-align:center; 0">0.837</td> </tr> </table> --- ## High correlation `\(\neq\)` multicollinearity .pull-left[ <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="648" /> ] <br> <br> .pull-right[ + By definition, it is the linear combination of variables that causes MC. + The causal variables are not the most highly correlated. + Thus, identifying high correlation does not always identify sources of MC. .blockquote[ Diagnosis of multicollinearity requires specialised statistics. ] ] --- class: segue # Existing methods --- ## 1. Variance inflation factors (VIFs) Introduced in Marquaridt (1970) and elsewhere: `$$VIF_j = \frac{1}{1 - R^2_j}, \qquad j = 1, \dots, p,$$` where `\(R^2_j\)` is the coefficient of determination when the `\(\boldsymbol{x}_j\)` independent variable is treated as a response variable against the remaining `\(p-1\)` independent variables. A **larger** value of `\(VIF_j\)` implies `\(\boldsymbol{x}_j\)` can be highly predicted by other variables, and thus implies higher cause of MC by that variable. -- ```r M1 = lm(log_100 ~ ., data = X) M1 %>% car::vif() %>% round(2) ``` ``` ## log_runs log_outs log_ave log_fours log_sixes log_ducks log_hs ## 23995.96 11410.15 4666.15 55.60 2.53 3.99 12.17 ``` + Using a threshold of 5 as suggested by Sheather (2009), 5 MC-causing variables are identified. <!-- The top four variables for causing multicollinearity are: --> <!-- ✅ `log_runs` --> <!-- ✅ `log_outs` --> <!-- ❌ `log_fours` --> <!-- ✅ `log_ave` --> --- ## 2.Eigenvalues of `\(X^\top X\)` Eigenvalues of the "uncentered covariance matrix" `\(\lambda_{1}\geq\lambda_{2}\geq{\ldots}\geq{\lambda_{p}}\geq 0\)` offers a more linear algebra interpretation of MC. A **smaller** value of `\(\lambda_{p}\)` produces a matrix determinant closer to 0, which implies linear dependence in `\(X\)` and thus MC (Stewart 1987). ```r Xmat = X %>% as.data.frame() %>% as.matrix() %>% scale() eigen = svd(t(Xmat) %*% Xmat) round(eigen$d, 3) ``` ``` ## [1] 4839.921 928.325 303.818 252.626 91.953 45.354 9.982 0.020 ``` Note: this only implicates the existence of MC, not which variable causes MC. --- ## Relationships between the two measures Suppose that `\(X\)` is standardised to have mean 0 and variance 1, and we decompose `\((X^\top X)^{-1}\)` into `\(G\operatorname{diag}(1/\lambda_{1},\dots,1/\lambda_{p}){G^\top }\)`, then: .center[ `\(\left(\begin{array}{ccc} VIF_1 \\\vdots \\VIF_p \end{array}\right)=\left(\begin{array}{ccc}g_{11}^2 & \cdots & g_{1p}^2 \\ \vdots & \ddots & \vdots \\ g_{p1}^2 & \cdots & g_{pp}^2 \end{array} \right) \left(\begin{array}{ccc} \tau_{1} \\ \vdots \\ \tau_{p} \end{array} \right) = (G \circ G) \boldsymbol{\tau}\)`, ] where `\(\tau_{j}=1/\lambda_{j}, \quad j=1,\ldots,p\)`. .blockquote[ .center[ Larger `\(\tau_p\)` value indicates larger MC. ] ] -- + It will be great if we have a formula of the form `\(\tau_p = f(VIF_1, \dots, VIF_p)\)` to reveal the relationship between every variable `\(\boldsymbol{x}_j\)` and the cause of MC, `\(\tau_p\)`. ```r solve(eigen$u * eigen$u)[1:2,1:5] ``` ``` ## [,1] [,2] [,3] [,4] [,5] ## [1,] -3.549883e+14 3.346813e+14 1.009966e+15 -6.696969e+13 1.492518e+14 ## [2,] -1.050852e+13 9.907388e+12 2.989748e+13 -1.982468e+12 4.418220e+12 ``` --- class: segue # The mcvis method --- ## mcvis <br> <br> <br> .blockquote[ .center[ We perform linear regression between `\(\tau_p\)` and every VIF. ] ] + By quantifying the linearity between `\(\tau_p\)` and VIFs, we can diagnose MC-causing variables. + How can we generate multiple "observations" of both `\(\tau_p\)` and VIFs? + Sampling! --- <center> <img src="figures/mcvis_figures/mcvis_figures.001.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.002.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.003.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.004.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.005.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.006.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.007.png", width="100%"> </center> --- <center> <img src="figures/mcvis_figures/mcvis_figures.008.png", width="100%"> </center> --- class: segue # The `mcvis` package --- ## 1. MC-index ```r library(mcvis) set.seed(13) p = ncol(X) mcvis_result = mcvis(X[,-p]) round(mcvis_result$MC[p-1,], 2) ``` ``` ## log_runs log_outs log_ave log_fours log_sixes log_ducks log_hs ## 0.69 0.16 0.14 0.00 0.00 0.00 0.00 ``` --- ## 2. MC visualisation .center[ ```r ggplot_mcvis(mcvis_result) ``` <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="720" /> ] --- ## 3. Shiny app for interactive exploration of data <center> <img src="figures/shiny.png", width="60%"> </center> --- ## Extension work: Multiple `\(\tau\)`'s .center[ ```r ggplot_mcvis(mcvis_result, eig_max = 7) ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="720" /> ] --- ## Simulated data from `mplot` + The `R` package `mplot` (Tarr et. al. 2018) introduces a simulated data with correlated variables in context of variable selection. + `mcvis` clearly identifies the correct cause of linearity (.sydney-red[x8]) whereas VIF identifies (.sydney-blue[x6]). | | `\(x_1\)` | `\(x_2\)` | `\(x_3\)` | `\(x_4\)` | `\(x_5\)` | `\(x_6\)` | `\(x_7\)` | `\(x_8\)` | `\(x_9\)` | |:--------:|:-------:|:------:|:------:|:-----:|:-----:|:------:|:-----:|:------:|:-----:| | MC index | 0.00885 | 0.0113 | 0.282 | 0.002 | 0.023 | 0.276 | 0.015 | .sydney-red[0.362] | 0.020 | | VIF | 23.84 | 23.36 | 109.76 | 7.82 | 41.52 | .sydney-blue[167.07] | 31.37 | 145.89 | 41.12 | .center[ <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="720" /> ] --- <!-- ## Applying VIF 2 (DELETE) --> <!-- ```{r} --> <!-- M4 = lm(log_hs ~ . -log_outs-log_runs, data = X) --> <!-- M5 = lm(log_hs ~ . -log_runs, data = X) --> <!-- list_M = list(M1, M2, M3, M4, M5) --> <!-- list_M %>% --> <!-- purrr::map(car::vif) %>% --> <!-- purrr::map(round, 2) --> <!-- ``` --> <!-- The top three variables for causing multicollinearity are: --> <!-- + `log_runs` ✅ --> <!-- + `log_outs` ✅ --> <!-- + `log_ave` ✅ --> <!-- + `log_fours` ❌ --> ## Final remarks + mcvis provides a new MC-index and a visualisation of multicollinearity in linear regression. + mcvis builds on top of classical statistics under a resampling framework and uncovers new sources of collinearity with an understanding of variability. + Learn more from: .pull-left[ -
[leaffur/mcvis](https://github.com/leaffur/mcvis) -
[kevinwang09/mcvispy](https://github.com/kevinwang09/mcvispy) -
[samuel.mueller@sydney.edu.au](mailto:samuel.mueller@sydney.edu.au) -
[@KevinWang009](https://twitter.com/KevinWang009) and [@SamuelMuller74](https://twitter.com/SamuelMuller74) ] .pull-right[ <center> <img src="https://raw.githubusercontent.com/kevinwang09/mcvis/master/inst/mcvis_logo.png", width="60%"> </center> ] --- ## Bibliography <p><cite>Marquaridt, D. W. (1970). “Generalized Inverses, Ridge Regression, Biased Linear Estimation, and Nonlinear Estimation”. In: <em>Technometrics</em> 12.3, pp. 591–612.</cite></p> <p><cite>Belsley, D. A. (1984). “Demeaning Conditioning Diagnostics through Centering”. In: <em>The American Statistician</em> 38.2, pp. 73–77.</cite></p> <p><cite>Stewart, G. W. (1987). “Collinearity and Least Squares Regression”. In: <em>Statistical Science</em> 2.1, pp. 68–84.</cite></p> <p><cite>O'Brien, R. M. (2007). “A Caution Regarding Rules of Thumb for Variance Inflation Factors”. In: <em>Quality & Quantity</em> 41.5, pp. 673–690.</cite></p> <p><cite>Sheather, S. (2009). <em>A Modern Approach to Regression with R</em>. Springer Texts in Statistics. New York, NY: Springer New York.</cite></p> <p><cite>Friendly, M. and E. Kwan (2009). “Statistical computing and graphics where's Waldo? Visualizing collinearity diagnostics”. In: <em>The American Statistician</em> 63.1, pp. 56–65.</cite></p> list()