For this comparison, I will compare the scikit-learn
library in Python
and the newly developed tidymodels
meta-package in R
. The main reason that I have chosen these two is because they share a lot of similarities and imposed strict frameworks in data pre-processing, modelling and evaluations.
The data that I will use is the penguins
data from the R
package palmerpenguins
, which you can learn more about here. The response variable is a factor
variable, species
, indicating the species of a penguin. The other predictor variables are a mix of both numeric and factor variables. For convenience, I have reduced the number of species to two and extracted the data below in a CSV format so that Python can also use this data through pd.read_csv
.
library(palmerpenguins)
library(tidyverse)
readr::write_csv(penguins %>% na.omit, path = "data/penguins_complete.csv")
## python: /home/runner/.virtualenvs/r-reticulate/bin/python
## libpython: /home/runner/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome: /home/runner/.virtualenvs/r-reticulate:/home/runner/.virtualenvs/r-reticulate
## version: 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) [GCC 9.3.0]
## numpy: /home/runner/.virtualenvs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version: 1.19.5
##
## NOTE: Python version was forced by use_python function
## [1] TRUE
## [1] FALSE
R
: tidyverse
It is hard for me to remember how to use R
without tidyverse
. And by tidyverse
, I am referring to both the package and the whole framework that it has established and keeps proliferating through many community-built R
packages.
Here, we will use the readr
and dplyr
packages to import the data and quickly get a summary of the data.
library(tidyverse)
penguins = readr::read_csv(file = "data/penguins_complete.csv")
penguins %>% glimpse()
## Rows: 214
## Columns: 8
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex <chr> "male", "female", "female", "female", "male", "femal…
## $ year <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Python
: pandas
import pandas as pd
penguins = pd.read_csv("data/penguins_complete.csv")
penguins.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 214 entries, 0 to 213
## Data columns (total 8 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 species 214 non-null object
## 1 island 214 non-null object
## 2 bill_length_mm 214 non-null float64
## 3 bill_depth_mm 214 non-null float64
## 4 flipper_length_mm 214 non-null int64
## 5 body_mass_g 214 non-null int64
## 6 sex 214 non-null object
## 7 year 214 non-null int64
## dtypes: float64(2), int64(3), object(3)
## memory usage: 13.5+ KB
penguins.describe()
## bill_length_mm bill_depth_mm ... body_mass_g year
## count 214.000000 214.000000 ... 214.000000 214.000000
## mean 42.004673 18.370561 ... 3714.719626 2008.028037
## std 5.491545 1.191134 ... 435.667063 0.827440
## min 32.100000 15.500000 ... 2700.000000 2007.000000
## 25% 37.800000 17.500000 ... 3400.000000 2007.000000
## 50% 40.600000 18.400000 ... 3700.000000 2008.000000
## 75% 46.000000 19.100000 ... 3993.750000 2009.000000
## max 58.000000 21.500000 ... 4800.000000 2009.000000
##
## [8 rows x 5 columns]
R
: rsample
library(tidymodels)
set.seed(2020)
penguins = penguins %>%
dplyr::mutate(across(.cols = is.character, .fns = as.factor))
## Warning: Predicate functions must be wrapped in `where()`.
##
## # Bad
## data %>% select(is.character)
##
## # Good
## data %>% select(where(is.character))
##
## ℹ Please update your code.
## This message is displayed once per session.
penguins_split = rsample::initial_split(data = penguins, prop = 0.1)
penguins_split
## <Analysis/Assess/Total>
## <21/193/214>
penguins_train = penguins_split %>% training()
penguins_test = penguins_split %>% testing()
Python
: train_test_split
from sklearn.model_selection
from sklearn.model_selection import train_test_split
from pandas import get_dummies
y = penguins["species"].values
penguins_x = pd.get_dummies(penguins.drop("species", axis = 1))
x = penguins_x.values
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.1, random_state = 123)
R
: recipes
penguins_recipe = recipes::recipe(species ~ ., data = penguins_train) %>%
step_knnimpute(all_predictors()) %>%
prep()
## Warning: `step_knnimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_knn()` instead.
penguins_recipe
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 7
##
## Training data contained 21 data points and no missing data.
##
## Operations:
##
## K-nearest neighbor imputation for bill_length_mm, ... [trained]
Python
: NoneR
: parsnip
rf_model = rand_forest() %>%
set_args(trees = 100) %>%
set_engine("randomForest") %>%
set_mode("classification")
Python
: sklearn.ensemble
from sklearn.ensemble import RandomForestClassifier
steps = [('rf', RandomForestClassifier(n_estimators = 100))]
R
: workflowr
rf_workflow = workflow() %>%
add_recipe(penguins_recipe) %>%
add_model(rf_model)
rf_pred_tbl = predict(rf_workflow %>% fit(data = penguins_train),
new_data = penguins_test)
rf_pred_tbl = rf_pred_tbl %>%
dplyr::mutate(true = penguins_test$species %>% factor)
rf_pred_tbl
## # A tibble: 193 x 2
## .pred_class true
## <fct> <fct>
## 1 Adelie Adelie
## 2 Adelie Adelie
## 3 Adelie Adelie
## 4 Adelie Adelie
## 5 Adelie Adelie
## 6 Adelie Adelie
## 7 Adelie Adelie
## 8 Adelie Adelie
## 9 Adelie Adelie
## 10 Adelie Adelie
## # … with 183 more rows
Python
: Pipeline
from sklearn.pipeline
from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps = steps)
rf_pipeline_model = pipeline.fit(x_train, y_train)
rf_pred_tbl = pd.DataFrame(
{"true": y_test,
"pred": rf_pipeline_model.predict(x_test)})
rf_pred_tbl
## true pred
## 0 Chinstrap Adelie
## 1 Chinstrap Chinstrap
## 2 Adelie Adelie
## 3 Adelie Adelie
## 4 Adelie Adelie
## .. ... ...
## 188 Adelie Adelie
## 189 Adelie Adelie
## 190 Adelie Adelie
## 191 Adelie Adelie
## 192 Chinstrap Chinstrap
##
## [193 rows x 2 columns]
R
: yardstick
multi_metric = metric_set(accuracy, precision, recall, f_meas)
multi_metric(data = rf_pred_tbl, truth = true, estimate = ".pred_class")
## # A tibble: 4 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.943
## 2 precision binary 0.923
## 3 recall binary 1
## 4 f_meas binary 0.960
Python
: sklearn.metrics
from sklearn.metrics import classification_report
cr = classification_report(
y_true = rf_pred_tbl.true,
y_pred = rf_pred_tbl.pred)
print(cr)
## precision recall f1-score support
##
## Adelie 0.96 1.00 0.98 130
## Chinstrap 1.00 0.90 0.95 63
##
## accuracy 0.97 193
## macro avg 0.98 0.95 0.96 193
## weighted avg 0.97 0.97 0.97 193