For this comparison, I will compare the scikit-learn library in Python and the newly developed tidymodels meta-package in R. The main reason that I have chosen these two is because they share a lot of similarities and imposed strict frameworks in data pre-processing, modelling and evaluations.

The data that I will use is the penguins data from the R package palmerpenguins, which you can learn more about here. The response variable is a factor variable, species, indicating the species of a penguin. The other predictor variables are a mix of both numeric and factor variables. For convenience, I have reduced the number of species to two and extracted the data below in a CSV format so that Python can also use this data through pd.read_csv.

library(palmerpenguins)
library(tidyverse)

penguins %>% 
  na.omit %>% 
  dplyr::filter(species %in% c("Adelie", "Chinstrap")) %>% 
  readr::write_csv(path = "data/penguins_complete.csv")

Importing data and getting a summary

`R`: `tidyverse`

library(tidyverse)
penguins = readr::read_csv(file = "data/penguins_complete.csv")
penguins %>% colnames()

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

`Python`: `pandas`

import pandas as pd
penguins = pd.read_csv("data/penguins_complete.csv")
penguins.columns

## Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
##        'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
##       dtype='object')

feature_set = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[feature_set]
y = penguins.species

Grid search

R

Python

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np

param_grid = {'max_depth': np.arange(1, 10)} 
dtc_model = DecisionTreeClassifier(random_state = 0)
dtc_cv = GridSearchCV(dtc_model, param_grid, cv = 5) 
dtc_cv.fit(X, y)

## GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
##              param_grid={'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})

dtc_cv.best_params_

## {'max_depth': 3}

Compare R and Python: model tuning

Importing data and getting a summary

R: tidyverse

Python: pandas

Grid search

R

Python

`R`: `tidyverse`

`Python`: `pandas`