Data preparation
Setting up connection (in R)
Selecting columns
Counting number of distinct entries
Filtering
Missing values
String operations
Summarisation
Sorting
Proportion calculation
Filtering using summary statistics

Data preparation

Evaluated only once to create a CSV file.

library(palmerpenguins)
library(tidyverse)
readr::write_csv(x = penguins, file = "data/penguins.csv")

Setting up connection (in `R`)

This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.

The print methods for R is different to that of SQL.

library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
penguins = readr::read_csv(file = "data/penguins.csv")
dbListTables(con)

## character(0)

dbWriteTable(con, "penguins", penguins)
dbListTables(con)

## [1] "penguins"

dbListFields(con, "penguins")

## [1] "species"           "island"            "bill_length_mm"   
## [4] "bill_depth_mm"     "flipper_length_mm" "body_mass_g"      
## [7] "sex"               "year"

Selecting columns

Both languages use select to subset columns.

`R`

penguins %>% 
  select(species, island)

`SQL`

SELECT species, island 
FROM penguins

Displaying records 1 - 10
species	island
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen
Adelie	Torgersen

Select all columns

`R`

penguins %>% 
  select_all()

`SQL`

SELECT * 
FROM penguins;

Displaying records 1 - 10
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007
Adelie	Torgersen	39.2	19.6	195	4675	male	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007

Selecting distict entries from

Both languages use a version of distinct to extract unique categorical variables, or a combination of several categorical variables.

`R`

penguins %>% 
  distinct(species)

penguins %>% 
  distinct(species, island)

`SQL`

SELECT DISTINCT species 
FROM penguins;

3 records
species
Adelie
Gentoo
Chinstrap

SELECT DISTINCT species, island 
FROM penguins;

5 records
species	island
Adelie	Torgersen
Adelie	Biscoe
Adelie	Dream
Gentoo	Biscoe
Chinstrap	Dream

Counting number of distinct entries

This is a frequency table.

R requires a tally function, which is smart enough to know that the grouping variable must always be kept.

SQL requires manual selection of both the grouping variable and the numerical frequencies.

`R`

penguins %>%
  group_by(species) %>% 
  tally()

`SQL`

SELECT species, 
       COUNT(species) AS n
FROM penguins
GROUP BY species;

3 records
species	n
Adelie	152
Chinstrap	68
Gentoo	124

Filtering

Top rows

`R`

penguins %>% 
  slice(1:10)

penguins %>% 
  top_n(10, body_mass_g)

`SQL`

SELECT *
FROM penguins
LIMIT 10

Displaying records 1 - 10
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Adelie	Torgersen	39.1	18.7	181	3750	male	2007
Adelie	Torgersen	39.5	17.4	186	3800	female	2007
Adelie	Torgersen	40.3	18.0	195	3250	female	2007
Adelie	Torgersen	NA	NA	NA	NA	NA	2007
Adelie	Torgersen	36.7	19.3	193	3450	female	2007
Adelie	Torgersen	39.3	20.6	190	3650	male	2007
Adelie	Torgersen	38.9	17.8	181	3625	female	2007
Adelie	Torgersen	39.2	19.6	195	4675	male	2007
Adelie	Torgersen	34.1	18.1	193	3475	NA	2007
Adelie	Torgersen	42.0	20.2	190	4250	NA	2007

SELECT * 
FROM penguins
ORDER BY body_mass_g DESC
LIMIT 10

Displaying records 1 - 10
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Gentoo	Biscoe	49.2	15.2	221	6300	male	2007
Gentoo	Biscoe	59.6	17.0	230	6050	male	2007
Gentoo	Biscoe	51.1	16.3	220	6000	male	2008
Gentoo	Biscoe	48.8	16.2	222	6000	male	2009
Gentoo	Biscoe	45.2	16.4	223	5950	male	2008
Gentoo	Biscoe	49.8	15.9	229	5950	male	2009
Gentoo	Biscoe	48.4	14.6	213	5850	male	2007
Gentoo	Biscoe	49.3	15.7	217	5850	male	2007
Gentoo	Biscoe	55.1	16.0	230	5850	male	2009
Gentoo	Biscoe	49.5	16.2	229	5800	male	2008

Numeric columns

The task is to find the species frequency distribution of penguins with body mass greater than 4000 grams.

The ordering of the dplyr verbs matter quite a bit here. The filtering operation must be done first, before the tallying. This is because the tallying operation (which is fundamentally a summarise operation) is an aggregation of the raw data. And thus, if the order is reversed, information regarding body mass is lost.

In SQL, the structure of the query is that SELECT must be written upfront, and so, even though the grouping variable can be written either in the fourth or fifth row of the code, it is somewhat disconnected from where the COUNT operation when written out.

For two-sided bounds, SQL has an extra function, BETWEEN.

`R`

penguins %>% 
  filter(body_mass_g > 4000) %>% 
  group_by(species) %>% 
  tally()

penguins %>% 
  filter(body_mass_g< 4500, body_mass_g > 4000) %>% 
  group_by(species) %>% 
  tally()

`SQL`

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE body_mass_g > 4000
GROUP BY species

3 records
species	n
Adelie	35
Chinstrap	15
Gentoo	122

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE body_mass_g 
BETWEEN 4000 AND 4500
GROUP BY species

3 records
species	n
Adelie	32
Chinstrap	14
Gentoo	16

Categorical columns

The major difference is that R uses double equal sign == whereas SQL uses a single equal sign = when subsetting categorical variables.

For multiple values, R uses %in% and a vector created using c(), whereas SQL uses IN and () respectively.

`R`

penguins %>% 
  filter(island == "Biscoe") %>% 
  group_by(species) %>% 
  tally()

penguins %>% 
  filter(island %in% c("Biscoe", "Dream")) %>% 
  group_by(species) %>% 
  tally()

`SQL`

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island = 'Biscoe'
GROUP BY species

2 records
species	n
Adelie	44
Gentoo	124

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island IN ("Biscoe", "Dream")
GROUP BY species

3 records
species	n
Adelie	100
Chinstrap	68
Gentoo	124

Missing values

R uses NA to encode missing values.

SQL uses NULL to encode missing values, which is not to be confused with NULL in R. NULL in R is mainly used to represent the lists with zero length.

One can use is.na in R to return a logical vector to subset on the data, whereas SQL uses IS NULL. The opposite operation could be performed uing complete.cases and IS NOT NULL, respectively for the two languages.

Below examines the entries with missing/complete values in the sex column of the data and tabulate the frequency of species.

`R`

penguins %>% 
  filter(is.na(sex)) %>% 
  group_by(species) %>% 
  tally()

penguins %>% 
  filter(complete.cases(sex)) %>% 
  group_by(species) %>% 
  tally()

`SQL`

SELECT species,
       COUNT(species) AS n
FROM penguins
WHERE sex IS NULL
GROUP BY species

2 records
species	n
Adelie	6
Gentoo	5

SELECT species,
       COUNT(species) AS n
FROM penguins
WHERE sex IS NOT NULL
GROUP BY species

3 records
species	n
Adelie	146
Chinstrap	68
Gentoo	119

String operations

In R, the stringr package provides a consistent framework to perform various string operations. In SQL, there are some similar functions too.

The first task aims to tally the species of penguins, for entries with island begin with the characters “Torg”.

The second task aims to tally the species of penguins, for entries with island’s second character equal to “r”.

`R`

library(stringr)

penguins %>%
  filter(str_starts(island, "Torg")) %>% 
  group_by(species) %>% 
  tally()

penguins %>%
  filter(island %>% str_sub(2, 2) == "r") %>% 
  group_by(species) %>% 
  tally()

`SQL`

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island LIKE "Torg%"
GROUP BY species

1 records
species	n
Adelie	52

SELECT species, 
       COUNT(species) AS n
FROM penguins
WHERE island LIKE "_r%"
GROUP BY species

2 records
species	n
Adelie	56
Chinstrap	68

Summarisation

Under the dplyr framework, summarisation is done using the summarise function. This provides a very consistent framwork to perform summarisation, compare to base R.

Similar operation is done using a range of aggregation functions in SQL.

Interestingly, in SQL, the aggregation functions automatically ignores the NULL values. Whereas in R, if there are any NA values, then any summary will return NA as a warning, unless na.rm = TRUE is specified.

`R`

penguins %>% 
  group_by(species) %>% 
  filter(complete.cases(body_mass_g)) %>% 
  summarise(
    min_mass = min(body_mass_g),
    mean_mass = mean(body_mass_g),
    max_mass = max(body_mass_g),
    sd_mass = sd(body_mass_g),
    mean_mass_kg = mean(body_mass_g/1000))

`SQL`

SELECT species, 
       MIN(body_mass_g) AS min_mass,
       AVG(body_mass_g) AS mean_mass,
       MAX(body_mass_g) AS max_mass,
       STDEV(body_mass_g) AS sd_mass,
       AVG(body_mass_g/1000.0) AS mass_kg
FROM penguins
GROUP BY species

3 records
species	min_mass	mean_mass	max_mass	sd_mass	mass_kg
Adelie	2850	3700.662	4775	458.5661	3.700662
Chinstrap	2700	3733.088	4800	384.3351	3.733088
Gentoo	3950	5076.016	6300	504.1162	5.076016

Sorting

R uses arrange and desc to control the ordering. Whereas SQL uses ORDER BY and DESC respectively.

`R`

penguins %>% 
  group_by(species) %>% 
  tally() %>%
  arrange(desc(n))

`SQL`

SELECT species, 
       COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion DESC

3 records
species	proportion
Adelie	0.4418605
Gentoo	0.3604651
Chinstrap	0.1976744

Proportion calculation

This is where the computations get a bit complicated and dplyr has an edge in my opinion.

In the code below, we are calculating the frequencies of penguins and the percentages. Notice that this percentage is calculated with respect to the entire penguin population, i.e. the total is needed.

For R, the total of all the penguins can be computed naturally. But in SQL, the output will be of same variable type as the input, so the multiplication of 1.0 is necessary to make the outputs come out as decimals rather than being rounded to an integer.

`R`

penguins %>% 
  group_by(species) %>% 
  tally() %>% 
  mutate(proportion = n/sum(n)) %>% 
  arrange(desc(proportion))

`SQL`

SELECT species, 
       COUNT(*)*1.0/SUM(COUNT(*)) OVER () AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion

3 records
species	proportion
Chinstrap	0.1976744
Gentoo	0.3604651
Adelie	0.4418605

Alternatively:

SELECT species, 
       COUNT(*)*1.0/(SELECT COUNT(*) AS total FROM penguins) AS proportion
FROM penguins
GROUP BY species 
ORDER BY proportion DESC

3 records
species	proportion
Adelie	0.4418605
Gentoo	0.3604651
Chinstrap	0.1976744

Filtering using summary statistics

In SQL, the WHERE function cannot operate on summary statistics, this is a job for HAVING. However, in R, this is not a restriction.

In the task below, we will group by the species and sex of the penguins and filter for penguins species/sex combo that has an average mass of greater than 4700 grams. The result should be Gentoo male penguins.

`R`

penguins %>% 
  filter(complete.cases(body_mass_g)) %>% 
  group_by(species, sex) %>%
  filter(mean(body_mass_g) >= 4700)

  # summarise(mean(body_mass_g))

`SQL`

SELECT *
FROM penguins
WHERE body_mass_g IS NOT NULL
GROUP BY species, sex
HAVING AVG(body_mass_g) >= 4700;

1 records
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex	year
Gentoo	Biscoe	50	16.3	230	5700	male	2007

Basic operations

Kevin Wang

4 June 2021

Data preparation

Setting up connection (in R)

Selecting columns

R

SQL

Select all columns

R

SQL

Selecting distict entries from

R

SQL

Counting number of distinct entries

R

SQL

Filtering

Top rows

R

SQL

Numeric columns

R

SQL

Categorical columns

R

SQL

Missing values

R

SQL

String operations

R

SQL

Summarisation

R

SQL

Sorting

R

SQL

Proportion calculation

R

SQL

Filtering using summary statistics

R

SQL

Setting up connection (in `R`)

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`

`R`

`SQL`