Evaluated only once to create a CSV file.
library(palmerpenguins)
library(tidyverse)
readr::write_csv(x = penguins, file = "data/penguins.csv")
R
)This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.
The print methods for R
is different to that of
SQL
.
library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
penguins = readr::read_csv(file = "data/penguins.csv")
dbListTables(con)
## character(0)
dbWriteTable(con, "penguins", penguins)
dbListTables(con)
## [1] "penguins"
dbListFields(con, "penguins")
## [1] "species" "island" "bill_length_mm"
## [4] "bill_depth_mm" "flipper_length_mm" "body_mass_g"
## [7] "sex" "year"
Both languages use select
to subset columns.
R
penguins %>%
select(species, island)
SQL
SELECT species, island
FROM penguins
species | island |
---|---|
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
Adelie | Torgersen |
R
penguins %>%
select_all()
SQL
SELECT *
FROM penguins;
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | NA | 2007 |
Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | NA | 2007 |
Both languages use a version of distinct
to extract
unique categorical variables, or a combination of several categorical
variables.
R
penguins %>%
distinct(species)
penguins %>%
distinct(species, island)
SQL
SELECT DISTINCT species
FROM penguins;
species |
---|
Adelie |
Gentoo |
Chinstrap |
SELECT DISTINCT species, island
FROM penguins;
species | island |
---|---|
Adelie | Torgersen |
Adelie | Biscoe |
Adelie | Dream |
Gentoo | Biscoe |
Chinstrap | Dream |
This is a frequency table.
R
requires a tally function, which is smart enough to
know that the grouping variable must always be kept.
SQL
requires manual selection of both the grouping
variable and the numerical frequencies.
R
penguins %>%
group_by(species) %>%
tally()
SQL
SELECT species,
COUNT(species) AS n
FROM penguins
GROUP BY species;
species | n |
---|---|
Adelie | 152 |
Chinstrap | 68 |
Gentoo | 124 |
R
penguins %>%
slice(1:10)
penguins %>%
top_n(10, body_mass_g)
SQL
SELECT *
FROM penguins
LIMIT 10
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Adelie | Torgersen | 39.1 | 18.7 | 181 | 3750 | male | 2007 |
Adelie | Torgersen | 39.5 | 17.4 | 186 | 3800 | female | 2007 |
Adelie | Torgersen | 40.3 | 18.0 | 195 | 3250 | female | 2007 |
Adelie | Torgersen | NA | NA | NA | NA | NA | 2007 |
Adelie | Torgersen | 36.7 | 19.3 | 193 | 3450 | female | 2007 |
Adelie | Torgersen | 39.3 | 20.6 | 190 | 3650 | male | 2007 |
Adelie | Torgersen | 38.9 | 17.8 | 181 | 3625 | female | 2007 |
Adelie | Torgersen | 39.2 | 19.6 | 195 | 4675 | male | 2007 |
Adelie | Torgersen | 34.1 | 18.1 | 193 | 3475 | NA | 2007 |
Adelie | Torgersen | 42.0 | 20.2 | 190 | 4250 | NA | 2007 |
SELECT *
FROM penguins
ORDER BY body_mass_g DESC
LIMIT 10
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Gentoo | Biscoe | 49.2 | 15.2 | 221 | 6300 | male | 2007 |
Gentoo | Biscoe | 59.6 | 17.0 | 230 | 6050 | male | 2007 |
Gentoo | Biscoe | 51.1 | 16.3 | 220 | 6000 | male | 2008 |
Gentoo | Biscoe | 48.8 | 16.2 | 222 | 6000 | male | 2009 |
Gentoo | Biscoe | 45.2 | 16.4 | 223 | 5950 | male | 2008 |
Gentoo | Biscoe | 49.8 | 15.9 | 229 | 5950 | male | 2009 |
Gentoo | Biscoe | 48.4 | 14.6 | 213 | 5850 | male | 2007 |
Gentoo | Biscoe | 49.3 | 15.7 | 217 | 5850 | male | 2007 |
Gentoo | Biscoe | 55.1 | 16.0 | 230 | 5850 | male | 2009 |
Gentoo | Biscoe | 49.5 | 16.2 | 229 | 5800 | male | 2008 |
The task is to find the species frequency distribution of penguins with body mass greater than 4000 grams.
The ordering of the dplyr
verbs matter quite a bit here.
The filtering operation must be done first, before the tallying. This is
because the tallying operation (which is fundamentally a
summarise
operation) is an aggregation of the raw data. And
thus, if the order is reversed, information regarding body mass is
lost.
In SQL
, the structure of the query is that
SELECT
must be written upfront, and so, even though the
grouping variable can be written either in the fourth or fifth row of
the code, it is somewhat disconnected from where the COUNT
operation when written out.
For two-sided bounds, SQL
has an extra function,
BETWEEN
.
R
penguins %>%
filter(body_mass_g > 4000) %>%
group_by(species) %>%
tally()
penguins %>%
filter(body_mass_g< 4500, body_mass_g > 4000) %>%
group_by(species) %>%
tally()
SQL
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE body_mass_g > 4000
GROUP BY species
species | n |
---|---|
Adelie | 35 |
Chinstrap | 15 |
Gentoo | 122 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE body_mass_g
BETWEEN 4000 AND 4500
GROUP BY species
species | n |
---|---|
Adelie | 32 |
Chinstrap | 14 |
Gentoo | 16 |
The major difference is that R
uses double equal sign
==
whereas SQL
uses a single equal sign
=
when subsetting categorical variables.
For multiple values, R
uses %in% and a vector created
using c()
, whereas SQL
uses IN
and ()
respectively.
R
penguins %>%
filter(island == "Biscoe") %>%
group_by(species) %>%
tally()
penguins %>%
filter(island %in% c("Biscoe", "Dream")) %>%
group_by(species) %>%
tally()
SQL
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island = 'Biscoe'
GROUP BY species
species | n |
---|---|
Adelie | 44 |
Gentoo | 124 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island IN ("Biscoe", "Dream")
GROUP BY species
species | n |
---|---|
Adelie | 100 |
Chinstrap | 68 |
Gentoo | 124 |
R
uses NA
to encode missing values.
SQL
uses NULL
to encode missing values,
which is not to be confused with NULL
in R
.
NULL
in R
is mainly used to represent the
lists with zero length.
One can use is.na
in R
to return a logical
vector to subset on the data, whereas SQL
uses
IS NULL
. The opposite operation could be performed uing
complete.cases
and IS NOT NULL
, respectively
for the two languages.
Below examines the entries with missing/complete values in the
sex
column of the data and tabulate the frequency of
species.
R
penguins %>%
filter(is.na(sex)) %>%
group_by(species) %>%
tally()
penguins %>%
filter(complete.cases(sex)) %>%
group_by(species) %>%
tally()
SQL
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE sex IS NULL
GROUP BY species
species | n |
---|---|
Adelie | 6 |
Gentoo | 5 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE sex IS NOT NULL
GROUP BY species
species | n |
---|---|
Adelie | 146 |
Chinstrap | 68 |
Gentoo | 119 |
In R
, the stringr
package provides a
consistent framework to perform various string operations. In
SQL
, there are some similar functions too.
The first task aims to tally the species of penguins, for entries with island begin with the characters “Torg”.
The second task aims to tally the species of penguins, for entries with island’s second character equal to “r”.
R
library(stringr)
penguins %>%
filter(str_starts(island, "Torg")) %>%
group_by(species) %>%
tally()
penguins %>%
filter(island %>% str_sub(2, 2) == "r") %>%
group_by(species) %>%
tally()
SQL
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island LIKE "Torg%"
GROUP BY species
species | n |
---|---|
Adelie | 52 |
SELECT species,
COUNT(species) AS n
FROM penguins
WHERE island LIKE "_r%"
GROUP BY species
species | n |
---|---|
Adelie | 56 |
Chinstrap | 68 |
Under the dplyr
framework, summarisation is done using
the summarise
function. This provides a very consistent
framwork to perform summarisation, compare to base R
.
Similar operation is done using a range of aggregation functions in
SQL
.
Interestingly, in SQL
, the aggregation functions
automatically ignores the NULL
values. Whereas in
R
, if there are any NA
values, then any
summary will return NA
as a warning, unless
na.rm = TRUE
is specified.
R
penguins %>%
group_by(species) %>%
filter(complete.cases(body_mass_g)) %>%
summarise(
min_mass = min(body_mass_g),
mean_mass = mean(body_mass_g),
max_mass = max(body_mass_g),
sd_mass = sd(body_mass_g),
mean_mass_kg = mean(body_mass_g/1000))
SQL
SELECT species,
MIN(body_mass_g) AS min_mass,
AVG(body_mass_g) AS mean_mass,
MAX(body_mass_g) AS max_mass,
STDEV(body_mass_g) AS sd_mass,
AVG(body_mass_g/1000.0) AS mass_kg
FROM penguins
GROUP BY species
species | min_mass | mean_mass | max_mass | sd_mass | mass_kg |
---|---|---|---|---|---|
Adelie | 2850 | 3700.662 | 4775 | 458.5661 | 3.700662 |
Chinstrap | 2700 | 3733.088 | 4800 | 384.3351 | 3.733088 |
Gentoo | 3950 | 5076.016 | 6300 | 504.1162 | 5.076016 |
R
uses arrange
and desc
to
control the ordering. Whereas SQL
uses
ORDER BY
and DESC
respectively.
R
penguins %>%
group_by(species) %>%
tally() %>%
arrange(desc(n))
SQL
SELECT species,
COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion DESC
species | proportion |
---|---|
Adelie | 0.4418605 |
Gentoo | 0.3604651 |
Chinstrap | 0.1976744 |
This is where the computations get a bit complicated and
dplyr
has an edge in my opinion.
In the code below, we are calculating the frequencies of penguins and the percentages. Notice that this percentage is calculated with respect to the entire penguin population, i.e. the total is needed.
For R
, the total of all the penguins can be computed
naturally. But in SQL
, the output will be of same variable
type as the input, so the multiplication of 1.0
is
necessary to make the outputs come out as decimals rather than being
rounded to an integer.
R
penguins %>%
group_by(species) %>%
tally() %>%
mutate(proportion = n/sum(n)) %>%
arrange(desc(proportion))
SQL
SELECT species,
COUNT(*)*1.0/SUM(COUNT(*)) OVER () AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion
species | proportion |
---|---|
Chinstrap | 0.1976744 |
Gentoo | 0.3604651 |
Adelie | 0.4418605 |
Alternatively:
SELECT species,
COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species
ORDER BY proportion DESC
species | proportion |
---|---|
Adelie | 0.4418605 |
Gentoo | 0.3604651 |
Chinstrap | 0.1976744 |
In SQL
, the WHERE
function cannot operate
on summary statistics, this is a job for HAVING
. However,
in R
, this is not a restriction.
In the task below, we will group by the species and sex of the penguins and filter for penguins species/sex combo that has an average mass of greater than 4700 grams. The result should be Gentoo male penguins.
R
penguins %>%
filter(complete.cases(body_mass_g)) %>%
group_by(species, sex) %>%
filter(mean(body_mass_g) >= 4700)
# summarise(mean(body_mass_g))
SQL
SELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL
GROUP BY species, sex
HAVING AVG(body_mass_g) >= 4700;
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | year |
---|---|---|---|---|---|---|---|
Gentoo | Biscoe | 50 | 16.3 | 230 | 5700 | male | 2007 |