For this comparison, I will compare the scikit-learn
library in Python
and the newly developed tidymodels
meta-package in R
. The main reason that I have chosen these two is because they share a lot of similarities and imposed strict frameworks in data pre-processing, modelling and evaluations.
The data that I will use is the penguins
data from the R
package palmerpenguins
, which you can learn more about here. The response variable is a factor
variable, species
, indicating the species of a penguin. The other predictor variables are a mix of both numeric and factor variables. For convenience, I have reduced the number of species to two and extracted the data below in a CSV format so that Python can also use this data through pd.read_csv
.
library(palmerpenguins)
library(tidyverse)
penguins %>%
na.omit %>%
dplyr::filter(species %in% c("Adelie", "Chinstrap")) %>%
readr::write_csv(path = "data/penguins_complete.csv")
R
: tidyverse
library(tidyverse)
penguins = readr::read_csv(file = "data/penguins_complete.csv")
penguins %>% colnames()
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Python
: pandas
import pandas as pd
penguins = pd.read_csv("data/penguins_complete.csv")
penguins.columns
## Index(['species', 'island', 'bill_length_mm', 'bill_depth_mm',
## 'flipper_length_mm', 'body_mass_g', 'sex', 'year'],
## dtype='object')
feature_set = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']
X = penguins[feature_set]
y = penguins.species
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
import numpy as np
param_grid = {'max_depth': np.arange(1, 10)}
dtc_model = DecisionTreeClassifier(random_state = 0)
dtc_cv = GridSearchCV(dtc_model, param_grid, cv = 5)
dtc_cv.fit(X, y)
## GridSearchCV(cv=5, estimator=DecisionTreeClassifier(random_state=0),
## param_grid={'max_depth': array([1, 2, 3, 4, 5, 6, 7, 8, 9])})
dtc_cv.best_params_
## {'max_depth': 3}