Data preparation

Evaluated only once to create a CSV file.

library(nycflights13)
library(tidyverse)
set.seed(123)
readr::write_csv(x = nycflights13::airports %>% 
                   dplyr::select(faa, name), 
                 file = "data/airports.csv")
readr::write_csv(x = nycflights13::airlines, file = "data/airlines.csv")
readr::write_csv(x = nycflights13::flights %>% 
                   dplyr::slice(1:100) %>% 
                   dplyr::select(carrier, flight, tailnum, origin, dest), 
                 file = "data/flights.csv")

Setting up connection (in R)

This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.

The print methods for R is different to that of SQL.

library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
airports = readr::read_csv(file = "data/airports.csv")
airlines = readr::read_csv(file = "data/airlines.csv")
flights = readr::read_csv(file = "data/flights.csv")
dbListTables(con)
## character(0)
dbWriteTable(con, "airports", airports)
dbWriteTable(con, "airlines", airlines)
dbWriteTable(con, "flights", flights)
dbListTables(con)
## [1] "airlines" "airports" "flights"
dbListFields(con, "airports")
## [1] "faa"  "name"
dbListFields(con, "airlines")
## [1] "carrier" "name"
dbListFields(con, "flights")
## [1] "carrier" "flight"  "tailnum" "origin"  "dest"

Inner join

I remembered at one point, joining data frames in R to be quite painful. Things changed a lot after dplyr gained popularity.

Inner join in both language behave pretty similarly. The only difference is that the use of by in R and the use of ON in SQL. R allows a single specification of a common column when joining, whereas SQL needs explicitly specify the common column by name on both data. The way to avoid this is to use the function USING (common_column_name).

R

flights %>% 
  inner_join(airlines, by = "carrier")

SQL

SELECT *
FROM flights
INNER JOIN airlines
  ON flights.carrier = airlines.carrier
Displaying records 1 - 10
carrier flight tailnum origin dest carrier name
UA 1545 N14228 EWR IAH UA United Air Lines Inc.
UA 1714 N24211 LGA IAH UA United Air Lines Inc.
AA 1141 N619AA JFK MIA AA American Airlines Inc.
B6 725 N804JB JFK BQN B6 JetBlue Airways
DL 461 N668DN LGA ATL DL Delta Air Lines Inc.
UA 1696 N39463 EWR ORD UA United Air Lines Inc.
B6 507 N516JB EWR FLL B6 JetBlue Airways
EV 5708 N829AS LGA IAD EV ExpressJet Airlines Inc.
B6 79 N593JB JFK MCO B6 JetBlue Airways
AA 301 N3ALAA LGA ORD AA American Airlines Inc.

Aliasing

This is unique to SQL, as we can freely rename the naming of variables in R.

SELECT *
FROM flights as f
INNER JOIN airlines AS a
  ON f.carrier = a.carrier
Displaying records 1 - 10
carrier flight tailnum origin dest carrier name
UA 1545 N14228 EWR IAH UA United Air Lines Inc.
UA 1714 N24211 LGA IAH UA United Air Lines Inc.
AA 1141 N619AA JFK MIA AA American Airlines Inc.
B6 725 N804JB JFK BQN B6 JetBlue Airways
DL 461 N668DN LGA ATL DL Delta Air Lines Inc.
UA 1696 N39463 EWR ORD UA United Air Lines Inc.
B6 507 N516JB EWR FLL B6 JetBlue Airways
EV 5708 N829AS LGA IAD EV ExpressJet Airlines Inc.
B6 79 N593JB JFK MCO B6 JetBlue Airways
AA 301 N3ALAA LGA ORD AA American Airlines Inc.

Multiple inner joins

Multiple joins in both languages are straight forward.

One of the annoying thing is that SQL allows columns with the same name, so some manual renaming is necessary to avoid confusion in the final output. We will do this in the next section.

R

flights %>% 
  inner_join(airlines, by = "carrier") %>% 
  inner_join(airports, by = c("origin" = "faa"))

SQL

SELECT *
FROM flights
INNER JOIN airlines
  ON flights.carrier = airlines.carrier
INNER JOIN airports
  ON flights.origin = airports.faa
Displaying records 1 - 10
carrier flight tailnum origin dest carrier name faa name
UA 1545 N14228 EWR IAH UA United Air Lines Inc. EWR Newark Liberty Intl
UA 1714 N24211 LGA IAH UA United Air Lines Inc. LGA La Guardia
AA 1141 N619AA JFK MIA AA American Airlines Inc. JFK John F Kennedy Intl
B6 725 N804JB JFK BQN B6 JetBlue Airways JFK John F Kennedy Intl
DL 461 N668DN LGA ATL DL Delta Air Lines Inc. LGA La Guardia
UA 1696 N39463 EWR ORD UA United Air Lines Inc. EWR Newark Liberty Intl
B6 507 N516JB EWR FLL B6 JetBlue Airways EWR Newark Liberty Intl
EV 5708 N829AS LGA IAD EV ExpressJet Airlines Inc. LGA La Guardia
B6 79 N593JB JFK MCO B6 JetBlue Airways JFK John F Kennedy Intl
AA 301 N3ALAA LGA ORD AA American Airlines Inc. LGA La Guardia

Renaming/subquery during join

This is where I think R can be slightly more flexible, as one can easily insert an extra row of code for renaming of columns rather than querying the data to rename. Though this is down to personal preference, as one could do inner_join(data %>% rename(...)) too in R for the same cumbersome effect.

R

flights %>%
  inner_join(airlines, by = "carrier") %>%
  dplyr::rename(carrier_name = name) %>%
  inner_join(airports, by = c("origin" = "faa")) %>%
  dplyr::rename(origin_name = name)

SQL

SELECT *
FROM flights
INNER JOIN (SELECT carrier, name AS carrier_name FROM airlines) as rap
  ON flights.carrier = rap.carrier
INNER JOIN (SELECT faa, name AS airport_name FROM airports) as ral
  ON flights.origin = ral.faa
Displaying records 1 - 10
carrier flight tailnum origin dest carrier carrier_name faa airport_name
UA 1545 N14228 EWR IAH UA United Air Lines Inc. EWR Newark Liberty Intl
UA 1714 N24211 LGA IAH UA United Air Lines Inc. LGA La Guardia
AA 1141 N619AA JFK MIA AA American Airlines Inc. JFK John F Kennedy Intl
B6 725 N804JB JFK BQN B6 JetBlue Airways JFK John F Kennedy Intl
DL 461 N668DN LGA ATL DL Delta Air Lines Inc. LGA La Guardia
UA 1696 N39463 EWR ORD UA United Air Lines Inc. EWR Newark Liberty Intl
B6 507 N516JB EWR FLL B6 JetBlue Airways EWR Newark Liberty Intl
EV 5708 N829AS LGA IAD EV ExpressJet Airlines Inc. LGA La Guardia
B6 79 N593JB JFK MCO B6 JetBlue Airways JFK John F Kennedy Intl
AA 301 N3ALAA LGA ORD AA American Airlines Inc. LGA La Guardia

Case when

Both languages behave similarly.

R

flights %>% 
  mutate(
    runways = case_when(
      origin == "JFK" ~ 4,
      origin == "EWR" ~ 3,
      origin == "LGA" ~ 2,
      TRUE ~ 0))

SQL

SELECT *, 
  CASE WHEN origin = "JFK" THEN 4
       WHEN origin = "EWR" THEN 3
       WHEN origin = "LGA" THEN 2
       ELSE 0 END
       AS runways
FROM flights;
Displaying records 1 - 10
carrier flight tailnum origin dest runways
UA 1545 N14228 EWR IAH 3
UA 1714 N24211 LGA IAH 2
AA 1141 N619AA JFK MIA 4
B6 725 N804JB JFK BQN 4
DL 461 N668DN LGA ATL 2
UA 1696 N39463 EWR ORD 3
B6 507 N516JB EWR FLL 3
EV 5708 N829AS LGA IAD 2
B6 79 N593JB JFK MCO 4
AA 301 N3ALAA LGA ORD 2

SELECT INTO emporary table

Creating a new object in R is relatively easy compared to SQL. Unfortunately, the in-memory SQLite set-up for this worksheet does not allow me to execute this SQL command.

R

freq_carrier = flights %>% 
  group_by(carrier) %>% 
  tally() %>% 
  arrange(desc(n))

freq_carrier %>% 
  inner_join(airlines, by = "carrier")

SQL

-- Creating frequency of carrier summary table
SELECT carrier,
       COUNT(carrier) AS n
INTO freq_carrier
FROM flights
GROUP BY carrier
ORDER BY n DESC;
-- Joining that with the airline data
INNER JOIN airlines ON freq_carrier.carrier = airlines.carriers