Evaluated only once to create a CSV file.
library(nycflights13)
library(tidyverse)
set.seed(123)
readr::write_csv(x = nycflights13::airports %>%
dplyr::select(faa, name),
file = "data/airports.csv")
readr::write_csv(x = nycflights13::airlines, file = "data/airlines.csv")
readr::write_csv(x = nycflights13::flights %>%
dplyr::slice(1:100) %>%
dplyr::select(carrier, flight, tailnum, origin, dest),
file = "data/flights.csv")
R
)This chunk of code will take the in-memory data and allow this data to be queried exactly like a SQL database.
The print methods for R
is different to that of
SQL
.
library(dplyr)
library(DBI)
# Create an ephemeral in-memory RSQLite database
con <- dbConnect(RSQLite::SQLite(), ":memory:")
airports = readr::read_csv(file = "data/airports.csv")
airlines = readr::read_csv(file = "data/airlines.csv")
flights = readr::read_csv(file = "data/flights.csv")
dbListTables(con)
## character(0)
dbWriteTable(con, "airports", airports)
dbWriteTable(con, "airlines", airlines)
dbWriteTable(con, "flights", flights)
dbListTables(con)
## [1] "airlines" "airports" "flights"
dbListFields(con, "airports")
## [1] "faa" "name"
dbListFields(con, "airlines")
## [1] "carrier" "name"
dbListFields(con, "flights")
## [1] "carrier" "flight" "tailnum" "origin" "dest"
I remembered at one point, joining data frames in R
to
be quite painful. Things changed a lot after dplyr
gained
popularity.
Inner join in both language behave pretty similarly. The only
difference is that the use of by
in R
and the
use of ON
in SQL
. R
allows a
single specification of a common column when joining, whereas
SQL
needs explicitly specify the common column by name on
both data. The way to avoid this is to use the function
USING (common_column_name)
.
R
flights %>%
inner_join(airlines, by = "carrier")
SQL
SELECT *
FROM flights
INNER JOIN airlines
ON flights.carrier = airlines.carrier
carrier | flight | tailnum | origin | dest | carrier | name |
---|---|---|---|---|---|---|
UA | 1545 | N14228 | EWR | IAH | UA | United Air Lines Inc. |
UA | 1714 | N24211 | LGA | IAH | UA | United Air Lines Inc. |
AA | 1141 | N619AA | JFK | MIA | AA | American Airlines Inc. |
B6 | 725 | N804JB | JFK | BQN | B6 | JetBlue Airways |
DL | 461 | N668DN | LGA | ATL | DL | Delta Air Lines Inc. |
UA | 1696 | N39463 | EWR | ORD | UA | United Air Lines Inc. |
B6 | 507 | N516JB | EWR | FLL | B6 | JetBlue Airways |
EV | 5708 | N829AS | LGA | IAD | EV | ExpressJet Airlines Inc. |
B6 | 79 | N593JB | JFK | MCO | B6 | JetBlue Airways |
AA | 301 | N3ALAA | LGA | ORD | AA | American Airlines Inc. |
This is unique to SQL
, as we can freely rename the
naming of variables in R
.
SELECT *
FROM flights as f
INNER JOIN airlines AS a
ON f.carrier = a.carrier
carrier | flight | tailnum | origin | dest | carrier | name |
---|---|---|---|---|---|---|
UA | 1545 | N14228 | EWR | IAH | UA | United Air Lines Inc. |
UA | 1714 | N24211 | LGA | IAH | UA | United Air Lines Inc. |
AA | 1141 | N619AA | JFK | MIA | AA | American Airlines Inc. |
B6 | 725 | N804JB | JFK | BQN | B6 | JetBlue Airways |
DL | 461 | N668DN | LGA | ATL | DL | Delta Air Lines Inc. |
UA | 1696 | N39463 | EWR | ORD | UA | United Air Lines Inc. |
B6 | 507 | N516JB | EWR | FLL | B6 | JetBlue Airways |
EV | 5708 | N829AS | LGA | IAD | EV | ExpressJet Airlines Inc. |
B6 | 79 | N593JB | JFK | MCO | B6 | JetBlue Airways |
AA | 301 | N3ALAA | LGA | ORD | AA | American Airlines Inc. |
Multiple joins in both languages are straight forward.
One of the annoying thing is that SQL
allows columns
with the same name, so some manual renaming is necessary to avoid
confusion in the final output. We will do this in the next section.
R
flights %>%
inner_join(airlines, by = "carrier") %>%
inner_join(airports, by = c("origin" = "faa"))
SQL
SELECT *
FROM flights
INNER JOIN airlines
ON flights.carrier = airlines.carrier
INNER JOIN airports
ON flights.origin = airports.faa
carrier | flight | tailnum | origin | dest | carrier | name | faa | name |
---|---|---|---|---|---|---|---|---|
UA | 1545 | N14228 | EWR | IAH | UA | United Air Lines Inc. | EWR | Newark Liberty Intl |
UA | 1714 | N24211 | LGA | IAH | UA | United Air Lines Inc. | LGA | La Guardia |
AA | 1141 | N619AA | JFK | MIA | AA | American Airlines Inc. | JFK | John F Kennedy Intl |
B6 | 725 | N804JB | JFK | BQN | B6 | JetBlue Airways | JFK | John F Kennedy Intl |
DL | 461 | N668DN | LGA | ATL | DL | Delta Air Lines Inc. | LGA | La Guardia |
UA | 1696 | N39463 | EWR | ORD | UA | United Air Lines Inc. | EWR | Newark Liberty Intl |
B6 | 507 | N516JB | EWR | FLL | B6 | JetBlue Airways | EWR | Newark Liberty Intl |
EV | 5708 | N829AS | LGA | IAD | EV | ExpressJet Airlines Inc. | LGA | La Guardia |
B6 | 79 | N593JB | JFK | MCO | B6 | JetBlue Airways | JFK | John F Kennedy Intl |
AA | 301 | N3ALAA | LGA | ORD | AA | American Airlines Inc. | LGA | La Guardia |
This is where I think R
can be slightly more flexible,
as one can easily insert an extra row of code for renaming of columns
rather than querying the data to rename. Though this is down to personal
preference, as one could do
inner_join(data %>% rename(...))
too in R
for the same cumbersome effect.
R
flights %>%
inner_join(airlines, by = "carrier") %>%
dplyr::rename(carrier_name = name) %>%
inner_join(airports, by = c("origin" = "faa")) %>%
dplyr::rename(origin_name = name)
SQL
SELECT *
FROM flights
INNER JOIN (SELECT carrier, name AS carrier_name FROM airlines) as rap
ON flights.carrier = rap.carrier
INNER JOIN (SELECT faa, name AS airport_name FROM airports) as ral
ON flights.origin = ral.faa
carrier | flight | tailnum | origin | dest | carrier | carrier_name | faa | airport_name |
---|---|---|---|---|---|---|---|---|
UA | 1545 | N14228 | EWR | IAH | UA | United Air Lines Inc. | EWR | Newark Liberty Intl |
UA | 1714 | N24211 | LGA | IAH | UA | United Air Lines Inc. | LGA | La Guardia |
AA | 1141 | N619AA | JFK | MIA | AA | American Airlines Inc. | JFK | John F Kennedy Intl |
B6 | 725 | N804JB | JFK | BQN | B6 | JetBlue Airways | JFK | John F Kennedy Intl |
DL | 461 | N668DN | LGA | ATL | DL | Delta Air Lines Inc. | LGA | La Guardia |
UA | 1696 | N39463 | EWR | ORD | UA | United Air Lines Inc. | EWR | Newark Liberty Intl |
B6 | 507 | N516JB | EWR | FLL | B6 | JetBlue Airways | EWR | Newark Liberty Intl |
EV | 5708 | N829AS | LGA | IAD | EV | ExpressJet Airlines Inc. | LGA | La Guardia |
B6 | 79 | N593JB | JFK | MCO | B6 | JetBlue Airways | JFK | John F Kennedy Intl |
AA | 301 | N3ALAA | LGA | ORD | AA | American Airlines Inc. | LGA | La Guardia |
Both languages behave similarly.
R
flights %>%
mutate(
runways = case_when(
origin == "JFK" ~ 4,
origin == "EWR" ~ 3,
origin == "LGA" ~ 2,
TRUE ~ 0))
SQL
SELECT *,
CASE WHEN origin = "JFK" THEN 4
WHEN origin = "EWR" THEN 3
WHEN origin = "LGA" THEN 2
ELSE 0 END
AS runways
FROM flights;
carrier | flight | tailnum | origin | dest | runways |
---|---|---|---|---|---|
UA | 1545 | N14228 | EWR | IAH | 3 |
UA | 1714 | N24211 | LGA | IAH | 2 |
AA | 1141 | N619AA | JFK | MIA | 4 |
B6 | 725 | N804JB | JFK | BQN | 4 |
DL | 461 | N668DN | LGA | ATL | 2 |
UA | 1696 | N39463 | EWR | ORD | 3 |
B6 | 507 | N516JB | EWR | FLL | 3 |
EV | 5708 | N829AS | LGA | IAD | 2 |
B6 | 79 | N593JB | JFK | MCO | 4 |
AA | 301 | N3ALAA | LGA | ORD | 2 |
Creating a new object in R
is relatively easy compared
to SQL
. Unfortunately, the in-memory SQLite
set-up for this worksheet does not allow me to execute this SQL
command.
R
freq_carrier = flights %>%
group_by(carrier) %>%
tally() %>%
arrange(desc(n))
freq_carrier %>%
inner_join(airlines, by = "carrier")
SQL
-- Creating frequency of carrier summary table
SELECT carrier,
COUNT(carrier) AS n
INTO freq_carrier
FROM flights
GROUP BY carrier
ORDER BY n DESC;
-- Joining that with the airline data
INNER JOIN airlines ON freq_carrier.carrier = airlines.carriers