For this comparison, I will compare the scikit-learn library in Python and the newly developed tidymodels meta-package in R. The main reason that I have chosen these two is because they share a lot of similarities and imposed strict frameworks in data pre-processing, modelling and evaluations.

The data that I will use is the penguins data from the R package palmerpenguins, which you can learn more about here. The response variable is a factor variable, species, indicating the species of a penguin. The other predictor variables are a mix of both numeric and factor variables. For convenience, I have reduced the number of species to two and extracted the data below in a CSV format so that Python can also use this data through pd.read_csv.

library(palmerpenguins)
library(tidyverse)
readr::write_csv(penguins %>% na.omit, path = "data/penguins_complete.csv")
## python:         /home/runner/.virtualenvs/r-reticulate/bin/python
## libpython:      /home/runner/.local/share/r-miniconda/envs/r-reticulate/lib/libpython3.6m.so
## pythonhome:     /home/runner/.virtualenvs/r-reticulate:/home/runner/.virtualenvs/r-reticulate
## version:        3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01)  [GCC 9.3.0]
## numpy:          /home/runner/.virtualenvs/r-reticulate/lib/python3.6/site-packages/numpy
## numpy_version:  1.19.5
## 
## NOTE: Python version was forced by use_python function
## [1] TRUE
## [1] FALSE

Importing data and getting a summary

R: tidyverse

It is hard for me to remember how to use R without tidyverse. And by tidyverse, I am referring to both the package and the whole framework that it has established and keeps proliferating through many community-built R packages.

Here, we will use the readr and dplyr packages to import the data and quickly get a summary of the data.

library(tidyverse)
penguins = readr::read_csv(file = "data/penguins_complete.csv")
penguins %>% glimpse()
## Rows: 214
## Columns: 8
## $ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex               <chr> "male", "female", "female", "female", "male", "femal…
## $ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Python: pandas

import pandas as pd
penguins = pd.read_csv("data/penguins_complete.csv")
penguins.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 214 entries, 0 to 213
## Data columns (total 8 columns):
##  #   Column             Non-Null Count  Dtype  
## ---  ------             --------------  -----  
##  0   species            214 non-null    object 
##  1   island             214 non-null    object 
##  2   bill_length_mm     214 non-null    float64
##  3   bill_depth_mm      214 non-null    float64
##  4   flipper_length_mm  214 non-null    int64  
##  5   body_mass_g        214 non-null    int64  
##  6   sex                214 non-null    object 
##  7   year               214 non-null    int64  
## dtypes: float64(2), int64(3), object(3)
## memory usage: 13.5+ KB
penguins.describe()
##        bill_length_mm  bill_depth_mm  ...  body_mass_g         year
## count      214.000000     214.000000  ...   214.000000   214.000000
## mean        42.004673      18.370561  ...  3714.719626  2008.028037
## std          5.491545       1.191134  ...   435.667063     0.827440
## min         32.100000      15.500000  ...  2700.000000  2007.000000
## 25%         37.800000      17.500000  ...  3400.000000  2007.000000
## 50%         40.600000      18.400000  ...  3700.000000  2008.000000
## 75%         46.000000      19.100000  ...  3993.750000  2009.000000
## max         58.000000      21.500000  ...  4800.000000  2009.000000
## 
## [8 rows x 5 columns]

Splitting data into train-test set

R: rsample

library(tidymodels)
set.seed(2020)
penguins = penguins %>% 
  dplyr::mutate(across(.cols = is.character, .fns = as.factor))
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.character)
## 
##   # Good
##   data %>% select(where(is.character))
## 
## ℹ Please update your code.
## This message is displayed once per session.
penguins_split = rsample::initial_split(data = penguins, prop = 0.1)
penguins_split
## <Analysis/Assess/Total>
## <21/193/214>
penguins_train = penguins_split %>% training()
penguins_test = penguins_split %>% testing()

Python: train_test_split from sklearn.model_selection

from sklearn.model_selection import train_test_split
from pandas import get_dummies

y = penguins["species"].values
penguins_x = pd.get_dummies(penguins.drop("species", axis = 1))
x = penguins_x.values

x_train, x_test, y_train, y_test = train_test_split(x, y, train_size = 0.1, random_state = 123)

Setting up a recipe

R: recipes

penguins_recipe = recipes::recipe(species ~ ., data = penguins_train) %>% 
  step_knnimpute(all_predictors()) %>% 
  prep()
## Warning: `step_knnimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_knn()` instead.
penguins_recipe
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          7
## 
## Training data contained 21 data points and no missing data.
## 
## Operations:
## 
## K-nearest neighbor imputation for bill_length_mm, ... [trained]

Python: None

Defining a (random forest) model

R: parsnip

rf_model = rand_forest() %>%
  set_args(trees = 100) %>% 
  set_engine("randomForest") %>% 
  set_mode("classification")

Python: sklearn.ensemble

from sklearn.ensemble import RandomForestClassifier

steps = [('rf', RandomForestClassifier(n_estimators = 100))]

Setting up a workflow/pipeline

R: workflowr

rf_workflow = workflow() %>% 
  add_recipe(penguins_recipe) %>% 
  add_model(rf_model)

rf_pred_tbl = predict(rf_workflow %>% fit(data = penguins_train), 
                      new_data = penguins_test)

rf_pred_tbl = rf_pred_tbl %>% 
  dplyr::mutate(true = penguins_test$species %>% factor)

rf_pred_tbl
## # A tibble: 193 x 2
##    .pred_class true  
##    <fct>       <fct> 
##  1 Adelie      Adelie
##  2 Adelie      Adelie
##  3 Adelie      Adelie
##  4 Adelie      Adelie
##  5 Adelie      Adelie
##  6 Adelie      Adelie
##  7 Adelie      Adelie
##  8 Adelie      Adelie
##  9 Adelie      Adelie
## 10 Adelie      Adelie
## # … with 183 more rows

Python: Pipeline from sklearn.pipeline

from sklearn.pipeline import Pipeline
pipeline = Pipeline(steps = steps)

rf_pipeline_model = pipeline.fit(x_train, y_train)

rf_pred_tbl = pd.DataFrame(
{"true": y_test,
"pred": rf_pipeline_model.predict(x_test)})

rf_pred_tbl
##           true       pred
## 0    Chinstrap     Adelie
## 1    Chinstrap  Chinstrap
## 2       Adelie     Adelie
## 3       Adelie     Adelie
## 4       Adelie     Adelie
## ..         ...        ...
## 188     Adelie     Adelie
## 189     Adelie     Adelie
## 190     Adelie     Adelie
## 191     Adelie     Adelie
## 192  Chinstrap  Chinstrap
## 
## [193 rows x 2 columns]

Evaluation

R: yardstick

multi_metric = metric_set(accuracy, precision, recall, f_meas)
multi_metric(data = rf_pred_tbl, truth = true, estimate = ".pred_class")
## # A tibble: 4 x 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.943
## 2 precision binary         0.923
## 3 recall    binary         1    
## 4 f_meas    binary         0.960

Python: sklearn.metrics

from sklearn.metrics import classification_report

cr = classification_report(
y_true = rf_pred_tbl.true, 
y_pred = rf_pred_tbl.pred)

print(cr)
##               precision    recall  f1-score   support
## 
##       Adelie       0.96      1.00      0.98       130
##    Chinstrap       1.00      0.90      0.95        63
## 
##     accuracy                           0.97       193
##    macro avg       0.98      0.95      0.96       193
## weighted avg       0.97      0.97      0.97       193