For this comparison, I will compare the scikit-learn library in Python and the newly developed tidymodels meta-package in R. The main reason that I have chosen these two is because they share a lot of similarities and imposed strict frameworks in data pre-processing, modelling and evaluations.

The data that I will use is the penguins data from the R package palmerpenguins, which you can learn more about here. The response variable is a factor variable, species, indicating the species of a penguin. The other predictor variables are a mix of both numeric and factor variables. For convenience, I have reduced the number of species to two and extracted the data below in a CSV format so that Python can also use this data through pd.read_csv.


penguins %>% 
  na.omit %>% 
  dplyr::filter(species %in% c("Adelie", "Chinstrap")) %>% 
  readr::write_csv(path = "data/penguins_complete.csv")

Importing data and getting a summary

R: tidyverse

penguins = readr::read_csv(file = "data/penguins_complete.csv")
penguins %>% colnames()
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Python: pandas

import pandas as pd
penguins = pd.read_csv("data/penguins_complete.csv")
## Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
##        'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
##       dtype='object')
feature_set = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[feature_set]
y = penguins.species