Data preparation

Evaluated only once to create a CSV file.

library(palmerpenguins)
library(tidyverse)
readr::write_csv(x = penguins, file = "data/penguins.csv")

Setting up connection (in R)

This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.

The print methods for R is different to that of SQL.

library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
penguins = readr::read_csv(file = "data/penguins.csv")
dbListTables(con)
## character(0)
dbWriteTable(con, "penguins", penguins)
dbListTables(con)
## [1] "penguins"
dbListFields(con, "penguins")
## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Selecting columns

Both languages use select to subset columns.

R

penguins %>% 
  select(species, island)

SQL

SELECT species, island 
FROM penguins
Displaying records 1 - 10
species island
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen
Adelie Torgersen

Select all columns

R

penguins %>% 
  select_all()

SQL

SELECT * 
FROM penguins;
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007

Selecting distict entries from

Both languages use a version of distinct to extract unique categorical variables, or a combination of several categorical variables.

R

penguins %>% 
  distinct(species)
penguins %>% 
  distinct(species, island)

SQL

SELECT DISTINCT species 
FROM penguins;
3 records
species
Adelie
Gentoo
Chinstrap
SELECT DISTINCT species, island 
FROM penguins;
5 records
species island
Adelie Torgersen
Adelie Biscoe
Adelie Dream
Gentoo Biscoe
Chinstrap Dream

Counting number of distinct entries

This is a frequency table.

R requires a tally function, which is smart enough to know that the grouping variable must always be kept.

SQL requires manual selection of both the grouping variable and the numerical frequencies.

R

penguins %>%
  group_by(species) %>% 
  tally()

SQL

SELECT species, 
       COUNT(species) AS n
FROM penguins
GROUP BY species;
3 records
species n
Adelie 152
Chinstrap 68
Gentoo 124

Filtering

Top rows

R

penguins %>% 
  slice(1:10)
penguins %>% 
  top_n(10, body_mass_g)

SQL

SELECT *
FROM penguins
LIMIT 10
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Adelie Torgersen 39.1 18.7 181 3750 male 2007
Adelie Torgersen 39.5 17.4 186 3800 female 2007
Adelie Torgersen 40.3 18.0 195 3250 female 2007
Adelie Torgersen NA NA NA NA NA 2007
Adelie Torgersen 36.7 19.3 193 3450 female 2007
Adelie Torgersen 39.3 20.6 190 3650 male 2007
Adelie Torgersen 38.9 17.8 181 3625 female 2007
Adelie Torgersen 39.2 19.6 195 4675 male 2007
Adelie Torgersen 34.1 18.1 193 3475 NA 2007
Adelie Torgersen 42.0 20.2 190 4250 NA 2007
SELECT * 
FROM penguins
ORDER BY body_mass_g DESC
LIMIT 10
Displaying records 1 - 10
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 49.2 15.2 221 6300 male 2007
Gentoo Biscoe 59.6 17.0 230 6050 male 2007
Gentoo Biscoe 51.1 16.3 220 6000 male 2008
Gentoo Biscoe 48.8 16.2 222 6000 male 2009
Gentoo Biscoe 45.2 16.4 223 5950 male 2008
Gentoo Biscoe 49.8 15.9 229 5950 male 2009
Gentoo Biscoe 48.4 14.6 213 5850 male 2007
Gentoo Biscoe 49.3 15.7 217 5850 male 2007
Gentoo Biscoe 55.1 16.0 230 5850 male 2009
Gentoo Biscoe 49.5 16.2 229 5800 male 2008

Numeric columns

The task is to find the species frequency distribution of penguins with body mass greater than 4000 grams.

The ordering of the dplyr verbs matter quite a bit here. The filtering operation must be done first, before the tallying. This is because the tallying operation (which is fundamentally a summarise operation) is an aggregation of the raw data. And thus, if the order is reversed, information regarding body mass is lost.

In SQL, the structure of the query is that SELECT must be written upfront, and so, even though the grouping variable can be written either in the fourth or fifth row of the code, it is somewhat disconnected from where the COUNT operation when written out.

For two-sided bounds, SQL has an extra function, BETWEEN.

R

penguins %>% 
  filter(body_mass_g > 4000) %>% 
  group_by(species) %>% 
  tally()
penguins %>% 
  filter(body_mass_g< 4500, body_mass_g > 4000) %>% 
  group_by(species) %>% 
  tally()

SQL

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE body_mass_g > 4000
GROUP BY species
3 records
species n
Adelie 35
Chinstrap 15
Gentoo 122
SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE body_mass_g 
BETWEEN 4000 AND 4500
GROUP BY species
3 records
species n
Adelie 32
Chinstrap 14
Gentoo 16

Categorical columns

The major difference is that R uses double equal sign == whereas SQL uses a single equal sign = when subsetting categorical variables.

For multiple values, R uses %in% and a vector created using c(), whereas SQL uses IN and () respectively.

R

penguins %>% 
  filter(island == "Biscoe") %>% 
  group_by(species) %>% 
  tally()
penguins %>% 
  filter(island %in% c("Biscoe", "Dream")) %>% 
  group_by(species) %>% 
  tally()

SQL

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island = 'Biscoe'
GROUP BY species
2 records
species n
Adelie 44
Gentoo 124
SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island IN ("Biscoe", "Dream")
GROUP BY species
3 records
species n
Adelie 100
Chinstrap 68
Gentoo 124

Missing values

R uses NA to encode missing values.

SQL uses NULL to encode missing values, which is not to be confused with NULL in R. NULL in R is mainly used to represent the lists with zero length.

One can use is.na in R to return a logical vector to subset on the data, whereas SQL uses IS NULL. The opposite operation could be performed uing complete.cases and IS NOT NULL, respectively for the two languages.

Below examines the entries with missing/complete values in the sex column of the data and tabulate the frequency of species.

R

penguins %>% 
  filter(is.na(sex)) %>% 
  group_by(species) %>% 
  tally()
penguins %>% 
  filter(complete.cases(sex)) %>% 
  group_by(species) %>% 
  tally()

SQL

SELECT species,
       COUNT(species) AS n
FROM penguins
WHERE sex IS NULL
GROUP BY species
2 records
species n
Adelie 6
Gentoo 5
SELECT species,
       COUNT(species) AS n
FROM penguins
WHERE sex IS NOT NULL
GROUP BY species
3 records
species n
Adelie 146
Chinstrap 68
Gentoo 119

String operations

In R, the stringr package provides a consistent framework to perform various string operations. In SQL, there are some similar functions too.

The first task aims to tally the species of penguins, for entries with island begin with the characters “Torg”.

The second task aims to tally the species of penguins, for entries with island’s second character equal to “r”.

R

library(stringr)

penguins %>%
  filter(str_starts(island, "Torg")) %>% 
  group_by(species) %>% 
  tally()
penguins %>%
  filter(island %>% str_sub(2, 2) == "r") %>% 
  group_by(species) %>% 
  tally()

SQL

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island LIKE "Torg%"
GROUP BY species
1 records
species n
Adelie 52
SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island LIKE "_r%"
GROUP BY species
2 records
species n
Adelie 56
Chinstrap 68

Summarisation

Under the dplyr framework, summarisation is done using the summarise function. This provides a very consistent framwork to perform summarisation, compare to base R.

Similar operation is done using a range of aggregation functions in SQL.

Interestingly, in SQL, the aggregation functions automatically ignores the NULL values. Whereas in R, if there are any NA values, then any summary will return NA as a warning, unless na.rm = TRUE is specified.

R

penguins %>% 
  group_by(species) %>% 
  filter(complete.cases(body_mass_g)) %>% 
  summarise(
    min_mass = min(body_mass_g),
    mean_mass = mean(body_mass_g),
    max_mass = max(body_mass_g),
    sd_mass = sd(body_mass_g),
    mean_mass_kg = mean(body_mass_g/1000))

SQL

SELECT species, 
       MIN(body_mass_g) AS min_mass,
       AVG(body_mass_g) AS mean_mass,
       MAX(body_mass_g) AS max_mass,
       STDEV(body_mass_g) AS sd_mass,
       AVG(body_mass_g/1000.0) AS mass_kg
FROM penguins
GROUP BY species
3 records
species min_mass mean_mass max_mass sd_mass mass_kg
Adelie 2850 3700.662 4775 458.5661 3.700662
Chinstrap 2700 3733.088 4800 384.3351 3.733088
Gentoo 3950 5076.016 6300 504.1162 5.076016

Sorting

R uses arrange and desc to control the ordering. Whereas SQL uses ORDER BY and DESC respectively.

R

penguins %>% 
  group_by(species) %>% 
  tally() %>%
  arrange(desc(n))

SQL

SELECT species, 
       COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion DESC
3 records
species proportion
Adelie 0.4418605
Gentoo 0.3604651
Chinstrap 0.1976744

Proportion calculation

This is where the computations get a bit complicated and dplyr has an edge in my opinion.

In the code below, we are calculating the frequencies of penguins and the percentages. Notice that this percentage is calculated with respect to the entire penguin population, i.e. the total is needed.

For R, the total of all the penguins can be computed naturally. But in SQL, the output will be of same variable type as the input, so the multiplication of 1.0 is necessary to make the outputs come out as decimals rather than being rounded to an integer.

R

penguins %>% 
  group_by(species) %>% 
  tally() %>% 
  mutate(proportion = n/sum(n)) %>% 
  arrange(desc(proportion))

SQL

SELECT species, 
       COUNT(*)*1.0/SUM(COUNT(*)) OVER () AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion
3 records
species proportion
Chinstrap 0.1976744
Gentoo 0.3604651
Adelie 0.4418605

Alternatively:

SELECT species, 
       COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion DESC
3 records
species proportion
Adelie 0.4418605
Gentoo 0.3604651
Chinstrap 0.1976744

Filtering using summary statistics

In SQL, the WHERE function cannot operate on summary statistics, this is a job for HAVING. However, in R, this is not a restriction.

In the task below, we will group by the species and sex of the penguins and filter for penguins species/sex combo that has an average mass of greater than 4700 grams. The result should be Gentoo male penguins.

R

penguins %>% 
  filter(complete.cases(body_mass_g)) %>% 
  group_by(species, sex) %>%
  filter(mean(body_mass_g) >= 4700)
  # summarise(mean(body_mass_g))

SQL

SELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL
GROUP BY species, sex
HAVING AVG(body_mass_g) >= 4700;
1 records
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
Gentoo Biscoe 50 16.3 230 5700 male 2007