Evaluated only once to create a CSV file.
library(palmerpenguins)
library(tidyverse)
readr::write_csv(x = penguins, file = "data/penguins.csv")
R)This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.
The print methods for R is different to that of
SQL.
library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
penguins = readr::read_csv(file = "data/penguins.csv")
dbListTables(con)
## character(0)
dbWriteTable(con, "penguins", penguins)
dbListTables(con)
## [1] "penguins"
dbListFields(con, "penguins")
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Both languages use select to subset columns.
Rpenguins %>%
select(species, island)
SQLSELECT species, island
FROM penguins
| species | island |
|---|---|
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
| Adelie | Torgersen |
Rpenguins %>%
select_all()
SQLSELECT *
FROM penguins;
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
| Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
| Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | NA | 2007 |
| Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | NA | 2007 |
Both languages use a version of distinct to extract
unique categorical variables, or a combination of several categorical
variables.
Rpenguins %>%
distinct(species)
penguins %>%
distinct(species, island)
SQLSELECT DISTINCT species
FROM penguins;
| species |
|---|
| Adelie |
| Gentoo |
| Chinstrap |
SELECT DISTINCT species, island
FROM penguins;
| species | island |
|---|---|
| Adelie | Torgersen |
| Adelie | Biscoe |
| Adelie | Dream |
| Gentoo | Biscoe |
| Chinstrap | Dream |
This is a frequency table.
R requires a tally function, which is smart enough to
know that the grouping variable must always be kept.
SQL requires manual selection of both the grouping
variable and the numerical frequencies.
Rpenguins %>%
group_by(species) %>%
tally()
SQLSELECT species,
COUNT(species) AS n
FROM penguins
GROUP BY species;
| species | n |
|---|---|
| Adelie | 152 |
| Chinstrap | 68 |
| Gentoo | 124 |
Rpenguins %>%
slice(1:10)
penguins %>%
top_n(10, body_mass_g)
SQLSELECT *
FROM penguins
LIMIT 10
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
| Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
| Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
| Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
| Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
| Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
| Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
| Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
| Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | NA | 2007 |
| Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | NA | 2007 |
SELECT *
FROM penguins
ORDER BY body_mass_g DESC
LIMIT 10
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Gentoo | Biscoe | 49.2 | 15.2 | 221 | 6300 | male | 2007 |
| Gentoo | Biscoe | 59.6 | 17.0 | 230 | 6050 | male | 2007 |
| Gentoo | Biscoe | 51.1 | 16.3 | 220 | 6000 | male | 2008 |
| Gentoo | Biscoe | 48.8 | 16.2 | 222 | 6000 | male | 2009 |
| Gentoo | Biscoe | 45.2 | 16.4 | 223 | 5950 | male | 2008 |
| Gentoo | Biscoe | 49.8 | 15.9 | 229 | 5950 | male | 2009 |
| Gentoo | Biscoe | 48.4 | 14.6 | 213 | 5850 | male | 2007 |
| Gentoo | Biscoe | 49.3 | 15.7 | 217 | 5850 | male | 2007 |
| Gentoo | Biscoe | 55.1 | 16.0 | 230 | 5850 | male | 2009 |
| Gentoo | Biscoe | 49.5 | 16.2 | 229 | 5800 | male | 2008 |
The task is to find the species frequency distribution of penguins with body mass greater than 4000 grams.
The ordering of the dplyr verbs matter quite a bit here.
The filtering operation must be done first, before the tallying. This is
because the tallying operation (which is fundamentally a
summarise operation) is an aggregation of the raw data. And
thus, if the order is reversed, information regarding body mass is
lost.
In SQL, the structure of the query is that
SELECT must be written upfront, and so, even though the
grouping variable can be written either in the fourth or fifth row of
the code, it is somewhat disconnected from where the COUNT
operation when written out.
For two-sided bounds, SQL has an extra function,
BETWEEN.
Rpenguins %>%
filter(body_mass_g > 4000) %>%
group_by(species) %>%
tally()
penguins %>%
filter(body_mass_g< 4500, body_mass_g > 4000) %>%
group_by(species) %>%
tally()
SQLSELECT species,
COUNT(species) AS n
FROM penguins
WHERE body_mass_g > 4000
GROUP BY species
| species | n |
|---|---|
| Adelie | 35 |
| Chinstrap | 15 |
| Gentoo | 122 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE body_mass_g
BETWEEN 4000 AND 4500
GROUP BY species
| species | n |
|---|---|
| Adelie | 32 |
| Chinstrap | 14 |
| Gentoo | 16 |
The major difference is that R uses double equal sign
== whereas SQL uses a single equal sign
= when subsetting categorical variables.
For multiple values, R uses %in% and a vector created
using c(), whereas SQL uses IN
and () respectively.
Rpenguins %>%
filter(island == "Biscoe") %>%
group_by(species) %>%
tally()
penguins %>%
filter(island %in% c("Biscoe", "Dream")) %>%
group_by(species) %>%
tally()
SQLSELECT species,
COUNT(species) AS n
FROM penguins
WHERE island = 'Biscoe'
GROUP BY species
| species | n |
|---|---|
| Adelie | 44 |
| Gentoo | 124 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island IN ("Biscoe", "Dream")
GROUP BY species
| species | n |
|---|---|
| Adelie | 100 |
| Chinstrap | 68 |
| Gentoo | 124 |
R uses NA to encode missing values.
SQL uses NULL to encode missing values,
which is not to be confused with NULL in R.
NULL in R is mainly used to represent the
lists with zero length.
One can use is.na in R to return a logical
vector to subset on the data, whereas SQL uses
IS NULL. The opposite operation could be performed uing
complete.cases and IS NOT NULL, respectively
for the two languages.
Below examines the entries with missing/complete values in the
sex column of the data and tabulate the frequency of
species.
Rpenguins %>%
filter(is.na(sex)) %>%
group_by(species) %>%
tally()
penguins %>%
filter(complete.cases(sex)) %>%
group_by(species) %>%
tally()
SQLSELECT species,
COUNT(species) AS n
FROM penguins
WHERE sex IS NULL
GROUP BY species
| species | n |
|---|---|
| Adelie | 6 |
| Gentoo | 5 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE sex IS NOT NULL
GROUP BY species
| species | n |
|---|---|
| Adelie | 146 |
| Chinstrap | 68 |
| Gentoo | 119 |
In R, the stringr package provides a
consistent framework to perform various string operations. In
SQL, there are some similar functions too.
The first task aims to tally the species of penguins, for entries with island begin with the characters “Torg”.
The second task aims to tally the species of penguins, for entries with island’s second character equal to “r”.
Rlibrary(stringr)
penguins %>%
filter(str_starts(island, "Torg")) %>%
group_by(species) %>%
tally()
penguins %>%
filter(island %>% str_sub(2, 2) == "r") %>%
group_by(species) %>%
tally()
SQLSELECT species,
COUNT(species) AS n
FROM penguins
WHERE island LIKE "Torg%"
GROUP BY species
| species | n |
|---|---|
| Adelie | 52 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island LIKE "_r%"
GROUP BY species
| species | n |
|---|---|
| Adelie | 56 |
| Chinstrap | 68 |
Under the dplyr framework, summarisation is done using
the summarise function. This provides a very consistent
framwork to perform summarisation, compare to base R.
Similar operation is done using a range of aggregation functions in
SQL.
Interestingly, in SQL, the aggregation functions
automatically ignores the NULL values. Whereas in
R, if there are any NA values, then any
summary will return NA as a warning, unless
na.rm = TRUE is specified.
Rpenguins %>%
group_by(species) %>%
filter(complete.cases(body_mass_g)) %>%
summarise(
min_mass = min(body_mass_g),
mean_mass = mean(body_mass_g),
max_mass = max(body_mass_g),
sd_mass = sd(body_mass_g),
mean_mass_kg = mean(body_mass_g/1000))
SQLSELECT species,
MIN(body_mass_g) AS min_mass,
AVG(body_mass_g) AS mean_mass,
MAX(body_mass_g) AS max_mass,
STDEV(body_mass_g) AS sd_mass,
AVG(body_mass_g/1000.0) AS mass_kg
FROM penguins
GROUP BY species
| species | min_mass | mean_mass | max_mass | sd_mass | mass_kg |
|---|---|---|---|---|---|
| Adelie | 2850 | 3700.662 | 4775 | 458.5661 | 3.700662 |
| Chinstrap | 2700 | 3733.088 | 4800 | 384.3351 | 3.733088 |
| Gentoo | 3950 | 5076.016 | 6300 | 504.1162 | 5.076016 |
R uses arrange and desc to
control the ordering. Whereas SQL uses
ORDER BY and DESC respectively.
Rpenguins %>%
group_by(species) %>%
tally() %>%
arrange(desc(n))
SQLSELECT species,
COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion DESC
| species | proportion |
|---|---|
| Adelie | 0.4418605 |
| Gentoo | 0.3604651 |
| Chinstrap | 0.1976744 |
This is where the computations get a bit complicated and
dplyr has an edge in my opinion.
In the code below, we are calculating the frequencies of penguins and the percentages. Notice that this percentage is calculated with respect to the entire penguin population, i.e. the total is needed.
For R, the total of all the penguins can be computed
naturally. But in SQL, the output will be of same variable
type as the input, so the multiplication of 1.0 is
necessary to make the outputs come out as decimals rather than being
rounded to an integer.
Rpenguins %>%
group_by(species) %>%
tally() %>%
mutate(proportion = n/sum(n)) %>%
arrange(desc(proportion))
SQLSELECT species,
COUNT(*)*1.0/SUM(COUNT(*)) OVER () AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion
| species | proportion |
|---|---|
| Chinstrap | 0.1976744 |
| Gentoo | 0.3604651 |
| Adelie | 0.4418605 |
Alternatively:
SELECT species,
COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion DESC
| species | proportion |
|---|---|
| Adelie | 0.4418605 |
| Gentoo | 0.3604651 |
| Chinstrap | 0.1976744 |
In SQL, the WHERE function cannot operate
on summary statistics, this is a job for HAVING. However,
in R, this is not a restriction.
In the task below, we will group by the species and sex of the penguins and filter for penguins species/sex combo that has an average mass of greater than 4700 grams. The result should be Gentoo male penguins.
Rpenguins %>%
filter(complete.cases(body_mass_g)) %>%
group_by(species, sex) %>%
filter(mean(body_mass_g) >= 4700)
# summarise(mean(body_mass_g))
SQLSELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL
GROUP BY species, sex
HAVING AVG(body_mass_g) >= 4700;
| species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
|---|---|---|---|---|---|---|---|
| Gentoo | Biscoe | 50 | 16.3 | 230 | 5700 | male | 2007 |