BigQuery in R
Motivation
Stranded at home due to COVID-19, I am trying to pick up SQL and Google’s BigQuery as new skills. Even though the learning curve has been okay, I can’t help to think how easy (and fast!) it was for me to use dplyr
to do exactly the same operations.
The purpose of this post is to document some SQL against dplyr
, or more precisely, dbplyr
(dplyr
verbs on a database, don’t ask, it is magic!) to query COVID-19 data from BigQuery.
The main part of the article starts here, if you want to skip over using BigQuery in the command line.
Overall, I think dplyr
offers a more modern syntax. But then, of course, I am very biased towards anything that comes out of tidyverse and keep in mind that dplyr
is about 40 years later than SQL and probably borrowed a lot of strength from it.
What you need
- A Google Cloud account. See here for some resources and set it up with an appropriate project
- R packages:
dplyr
,bigrquery
,DBI
- Command line tools from Google Cloud
Why GCP/BigQuery?
Well, Google gave me lots of free credits. I am easily bought.
Jokes aside, I find the GCP documentation to be quite well-written and always up-to-date. Plus, the documentation contains lots examples of how to link the queried data to data studio and machine learning algorithms. Both of these capabilities are things that I want to eventually learn without paying for visualisation softwares like Tableau (sadly, I am no longer a student to qualify for their free accounts).
(Optional) Running a BigQuery query in the command line
(Note: to use BigQuery, you will need to have a Google Cloud account and enable billing for a project, this might cost you, but luckily Google Cloud offers free credits when you sign up.)
The two most common ways of making queries are:
I will show you a quick demo of how to run a simple query in the command line. But I do find the web UI option incredibly useful. I can see myself using web UI on the go if I have some urgent data questions while riding on a bus to work (totally normal behaviour based on my past experiences).
BigQuery has a set of public datasets, including one for COVID-19. The Johns Hopkins University dataset is also part of this collection, and at the time of writing this article, this dataset is small enough for demostration purposes.
In command line, tying
bq show bigquery-public-data:covid19_jhu_csse.summary
gives the output
Table bigquery-public-data:covid19_jhu_csse.summary
Last modified Schema Total Rows Total Bytes Expiration Time Partitioning Clustered Fields Labels
----------------- ----------------------------- ------------ ------------- ------------ ------------------- ------------------ --------------
02 Apr 19:06:19 |- province_state: string 41129 5564578 freebqcovid:
|- country_region: string
|- date: date
|- latitude: float
|- longitude: float
|- location_geom: geography
|- confirmed: integer
|- deaths: integer
|- recovered: integer
|- active: integer
|- fips: string
|- admin2: string
|- combined_key: string
Here, bigquery-public-data
is the name of the BigQuery project that hosts the COVID-19 data. The name of the data stored in that project is covid19_jhu_csse.summary
. And you can see that the output of bq show
gives the specifications of all the columns in this data.
To make a normal SQL query, you can use
bq query --use_legacy_sql=false \
'SELECT *
FROM `bigquery-public-data.covid19_jhu_csse.summary`
LIMIT 10;'
which gives the output
Waiting on bqjob_r28ad9889c4311223_0000017139f7f30c_1 ... (0s) Current status: DONE
+----------------+----------------------+------------+----------+-----------+---------------+-----------+--------+-----------+--------+------+--------+--------------+
| province_state | country_region | date | latitude | longitude | location_geom | confirmed | deaths | recovered | active | fips | admin2 | combined_key |
+----------------+----------------------+------------+----------+-----------+---------------+-----------+--------+-----------+--------+------+--------+--------------+
| NULL | United Arab Emirates | 2020-02-13 | NULL | NULL | NULL | 8 | 0 | 1 | NULL | NULL | NULL | NULL |
| NULL | Thailand | 2020-02-22 | NULL | NULL | NULL | 35 | 0 | 17 | NULL | NULL | NULL | NULL |
| NULL | Vietnam | 2020-01-24 | NULL | NULL | NULL | 2 | NULL | NULL | NULL | NULL | NULL | NULL |
| NULL | Malaysia | 2020-02-17 | NULL | NULL | NULL | 22 | 0 | 7 | NULL | NULL | NULL | NULL |
| NULL | Finland | 2020-02-25 | NULL | NULL | NULL | 1 | 0 | 1 | NULL | NULL | NULL | NULL |
| NULL | Vietnam | 2020-02-21 | NULL | NULL | NULL | 16 | 0 | 14 | NULL | NULL | NULL | NULL |
| NULL | UK | 2020-02-17 | NULL | NULL | NULL | 9 | 0 | 8 | NULL | NULL | NULL | NULL |
| NULL | Nepal | 2020-02-21 | NULL | NULL | NULL | 1 | 0 | 1 | NULL | NULL | NULL | NULL |
| NULL | San Marino | 2020-02-27 | NULL | NULL | NULL | 1 | 0 | 0 | NULL | NULL | NULL | NULL |
| NULL | Thailand | 2020-01-31 | NULL | NULL | NULL | 19 | NULL | 5 | NULL | NULL | NULL | NULL |
+----------------+----------------------+------------+----------+-----------+---------------+-----------+--------+-----------+--------+------+--------+--------------+
Not bad! But let’s try to do this in R
.
Connecting to BigQuery using DBI
The DBI
(DataBaseInterface) package provides an important function, dbconnect
, which facilitates the connection to various databases. The most important argument in dbconnect()
is drv
, which specifies the driver that is necessary to connect to a database. The extra arguments in this case are related to Google’s way of setting up a Google Cloud projects and billing:
+ project = "bigquery-public-data"
sets which project the data is at, in this case, it is at the project managed by Google called “bigquery-public-data”.
+ dataset = "covid19_jhu_csse"
refers to the dataset (consiste of multiple tables) stored in the project.
+ billing = "your_project_name"
refers to billing account that you have with Google. Since you are making queries to databases, which takes up computational resources, which are not free, so you will need an Google Cloud account and your own project so Google will charge you. Mine is called “scpworkshop”.
library(DBI)
library(bigrquery)
con <- dbConnect(
drv = bigrquery::bigquery(),
project = "bigquery-public-data",
dataset = "covid19_jhu_csse",
billing = "scpworkshop"
)
We can list all the tables in the dataset using the dbListTables
function.
dbListTables(con) ## List all tables in this connection
## Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## https://gargle.r-lib.org/articles/non-interactive-auth.html
## The bigrquery package is using a cached token for kevin.wang09@gmail.com.
## [1] "confirmed_cases" "deaths" "recovered_cases" "summary"
Once we decide which table we wish to work on, in this case, the summary table, we can check on the column types of this table first before making the query. The reason that I prefer to check the column types first is because I keep finding a parsing error with the latest version of the bigrquery
package, which I have fixed here. If you want to follow the example below, please install my updated package using devtools::install_github("kevinwang09/bigrquery", ref = "geography")
.
bigrquery::bq_table_fields("bigquery-public-data.covid19_jhu_csse.summary") ## List field types
## <bq_fields>
## province_state <STRING>
## country_region <STRING>
## date <DATE>
## latitude <FLOAT>
## longitude <FLOAT>
## location_geom <GEOGRAPHY>
## confirmed <INTEGER>
## deaths <INTEGER>
## recovered <INTEGER>
## active <INTEGER>
## fips <STRING>
## admin2 <STRING>
## combined_key <STRING>
DBI::dbGetQuery(con,
"SELECT *
FROM `bigquery-public-data.covid19_jhu_csse.summary`
LIMIT 10;")
## # A tibble: 10 x 13
## province_state country_region date latitude longitude location_geom
## <chr> <chr> <date> <dbl> <dbl> <chr>
## 1 Hubei Mainland China 2020-01-26 NA NA <NA>
## 2 Guangdong Mainland China 2020-01-26 NA NA <NA>
## 3 Zhejiang Mainland China 2020-01-26 NA NA <NA>
## 4 Henan Mainland China 2020-01-26 NA NA <NA>
## 5 Chongqing Mainland China 2020-01-26 NA NA <NA>
## 6 Hunan Mainland China 2020-01-26 NA NA <NA>
## 7 Beijing Mainland China 2020-01-26 NA NA <NA>
## 8 Anhui Mainland China 2020-01-26 NA NA <NA>
## 9 Shandong Mainland China 2020-01-26 NA NA <NA>
## 10 Sichuan Mainland China 2020-01-26 NA NA <NA>
## # … with 7 more variables: confirmed <int>, deaths <int>, recovered <int>,
## # active <int>, fips <chr>, admin2 <chr>, combined_key <chr>
Comparing the syntax style of dplyr
and SQL
Suppose we want to check the total confirmed (the columns are already in cumulative confirmed cases) COVID19 cases for Italy and Spain, then in SQL, we can make the following query:
sql_tbl = DBI::dbGetQuery(con,
"SELECT country_region, confirmed
FROM `bigquery-public-data.covid19_jhu_csse.summary`
WHERE country_region IN ('Italy', 'Spain') AND date = '2020-04-02'
")
sql_tbl
## # A tibble: 2 x 2
## country_region confirmed
## <chr> <int>
## 1 Italy 115242
## 2 Spain 112065
Looks good!
But what about dplyr
?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
covid_data = tbl(con, "summary")
dplyr_tbl = covid_data %>%
dplyr::filter(country_region %in% c('Italy', 'Spain'), date == '2020-04-02') %>%
dplyr::select(country_region, confirmed) %>%
collect() ## collect() pulls the entire table into memory
dplyr_tbl
## # A tibble: 2 x 2
## country_region confirmed
## <chr> <int>
## 1 Italy 115242
## 2 Spain 112065
But are the results the same?
all.equal(sql_tbl, dplyr_tbl)
## [1] TRUE
Awesome!
We can see that both dplyr
verbs and SQL in this query are very similar which is great for data scientist to translate between the two.
A more complicated query
You might be asking why I didn’t put US and China in the query, well, this is because that both countries has state/province level data, so the total confirmed caes will need to be summed across the state/province level first before being compared to other data at the country level.
Combine this subuery with the previous query, we get:
sql_tbl = DBI::dbGetQuery(con,
"SELECT country_region, SUM(confirmed) AS country_confirmed
FROM `bigquery-public-data.covid19_jhu_csse.summary`
WHERE country_region IN ('Italy', 'Spain', 'US', 'China') AND date = '2020-04-02'
GROUP BY country_region
ORDER BY country_confirmed;
")
sql_tbl
## # A tibble: 4 x 2
## country_region country_confirmed
## <chr> <int>
## 1 China 82432
## 2 Spain 112065
## 3 Italy 115242
## 4 US 243453
Not bad! The UNION ALL
took care a lot of the hard work!
In dplyr
, the job is (I would argue) simpler:
dplyr_tbl = covid_data %>%
filter(country_region %in% c('Italy', 'Spain', 'US', 'China'), date == '2020-04-02') %>%
group_by(country_region) %>%
summarise(country_confirmed = sum(confirmed)) %>%
arrange(country_confirmed) %>%
collect()
## Warning: Missing values are always removed in SQL.
## Use `SUM(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
dplyr_tbl
## # A tibble: 4 x 2
## country_region country_confirmed
## <chr> <int>
## 1 China 82432
## 2 Spain 112065
## 3 Italy 115242
## 4 US 243453
all.equal(sql_tbl, dplyr_tbl)
## [1] TRUE
Which is better?
Very much up to your personal taste!
I personally prefer dplyr
because I think its design is more intuitive:
the grouping variable is automatically included in the final output, without using extra selection of columns
the operation of selecting columns and summarising columns are two distinct verbs in
dplyr
, whereas in SQL it is justSELECT