As well as working with local in-memory data like data frames and data tables, dplyr also works with remote on-disk data stored in databases. Generally, if your data fits in memory there is no advantage to putting it in a database: it will only be slower and more hassle. The reason you'd want to use dplyr with a database is because either your data is already in a database (and you don't want to work with static csv files that someone else has dumped out for you), or you have so much data that it does not fit in memory and you have to use a database. Currently dplyr supports the three most popular open source databases (sqlite, mysql and postgresql), and google's bigquery.
Since R almost exclusively works with in-memory data, if you do have a lot of data in a database, you can't just dump it into R. Instead, you'll have to work with subsets or aggregates, and dplyr aims to make that as easy as possible. If you're working with large data, it's also likely that you'll need support to get the data into the database and to ensure you have the right indices for good performance. dplyr provides some simple tools to help with these tasks but they are no substitute for a local expert.
The motivation for supporting databases in dplyr is that you never pull down the right subset or aggregate from the database the first time, and usually you have to iterate between R and SQL many times before you get the perfect dataset. Switching between languages is cognitively challenging (especially because R and SQL are so perilously similar), so dplyr allows you to write R code that is automatically translated to SQL. The goal of dplyr is not to replace every SQL function with an R function: that would be difficult and error prone. Instead, dplyr only generates SELECT
statements, the SQL you write most often as an analyst.
To get the most out of this chapter, you'll need to be familiar with querying SQL databases using the SELECT
statement. If you have some familiarity with SQL and you'd like to learn more, I found how indexes work in SQLite and 10 easy steps to a complete understanding of SQL to be particularly helpful.
To experiement with databases, it's easiest to get started with SQLite because everything you need is included in the R package. You don't need to install anything else, and you don't need to deal with the hassle of setting up a database server. Using a SQLite database in dplyr is really easy: just give it a path and the ok to create it.
my_db <- src_sqlite("my_db.sqlite3", create = T)
The main new concert here is the src
, which is a collection of tables. Use src_sqlite()
, src_mysql()
, src_postgres()
and src_bigquery()
to connect to the different databases supported by dplyr.
my_db
currently has no data in it, so we'll load it up with the hflights
data using the convenient copy_to()
function. This is a quick and dirty way of getting data into a database, but it's not suitable for very large datasets because all the data has to flow through R.
data("hflights", package = "hflights")
hflights_sqlite <- copy_to(my_db, hflights, temporary = FALSE, indexes = list(
c("Year", "Month", "DayofMonth"), "UniqueCarrier", "TailNum"))
As you can see, the copy_to()
operation has an additional argument that allows you to supply indexes for the table. Here we set up indexes that will allow us to quickly process the data by day, by carrier and by plane. copy_to()
also executes the SQL ANALYZE
comomand: this ensures that the database has up-to-date table statistics and can pick appropriate query optimisations.
For this particular dataset, there's a built src
that will cache hflights
in a standard location:
hflights_sqlite <- tbl(hflights_sqlite(), "hflights")
#> Loading required package: RSQLite
#> Loading required package: DBI
#> Loading required package: RSQLite.extfuns
hflights_sqlite
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 21]
#>
#> Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1 2011 1 1 6 1400 1500 AA 428
#> 2 2011 1 2 7 1401 1501 AA 428
#> 3 2011 1 3 1 1352 1502 AA 428
#> 4 2011 1 4 2 1403 1513 AA 428
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#> (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#> Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#> CancellationCode (chr), Diverted (int)
You can also create tbl
from arbitrary SQL:
tbl(my_db, sql("SELECT * FROM hflights"))
Remote data sources use exactly the same five verbs as local data sources:
select(hflights_sqlite, Year:DayofMonth, DepDelay, ArrDelay)
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 5]
#>
#> Year Month DayofMonth DepDelay ArrDelay
#> 1 2011 1 1 0 -10
#> 2 2011 1 2 1 -9
#> 3 2011 1 3 -8 -8
#> 4 2011 1 4 3 3
#> .. ... ... ... ... ...
filter(hflights_sqlite, depDelay > 240)
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [389 x 21]
#> Filter: depDelay > 240
#>
#> Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1 2011 1 28 5 1516 1916 CO 1
#> 2 2011 1 27 4 2137 2254 CO 150
#> 3 2011 1 20 4 635 807 CO 59
#> 4 2011 1 17 1 1838 2109 CO 746
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#> (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#> Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#> CancellationCode (chr), Diverted (int)
arrange(hflights_sqlite, Year, Month, DayofMonth)
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 21]
#> Arrange: Year, Month, DayofMonth
#>
#> Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1 2011 1 1 6 1400 1500 AA 428
#> 2 2011 1 1 6 728 840 AA 460
#> 3 2011 1 1 6 1631 1736 AA 1121
#> 4 2011 1 1 6 1756 2112 AA 1294
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#> (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#> Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#> CancellationCode (chr), Diverted (int)
mutate(hflights_sqlite, speed = AirTime / Distance)
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [227,496 x 22]
#>
#> Year Month DayofMonth DayOfWeek DepTime ArrTime UniqueCarrier FlightNum
#> 1 2011 1 1 6 1400 1500 AA 428
#> 2 2011 1 2 7 1401 1501 AA 428
#> 3 2011 1 3 1 1352 1502 AA 428
#> 4 2011 1 4 2 1403 1513 AA 428
#> .. ... ... ... ... ... ... ... ...
#> Variables not shown: TailNum (chr), ActualElapsedTime (int), AirTime
#> (int), ArrDelay (int), DepDelay (int), Origin (chr), Dest (chr),
#> Distance (int), TaxiIn (int), TaxiOut (int), Cancelled (int),
#> CancellationCode (chr), Diverted (int), speed (int)
summarise(hflights_sqlite, delay = mean(DepTime))
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: <derived table> [?? x 1]
#>
#> delay
#> 1 1396
#> .. ...
The most important difference is that the expressions in select()
, filter()
, arrange()
, mutate()
, and summarise()
are translated into SQL so they can be run on the database. This translation is almost perfect for the most common operations but there are some limitations, which you'll learn about later.
When working with databases, dplyr tries to be as lazy as possible. It's lazy in two ways:
It never pulls data back to R unless you explicitly ask for it.
It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.
For example, take the following code:
c1 <- filter(hflights_sqlite, DepDelay > 0)
c2 <- select(c1, Year, Month, DayofMonth, UniqueCarrier, DepDelay, AirTime, Distance)
c3 <- mutate(c2, Speed = Distance / AirTime * 60)
c4 <- arrange(c3, Year, Month, DayofMonth, UniqueCarrier)
Suprisingly, this sequence of operations never actually touches the database. It's not until you ask for the data (e.g. by printing c4
) that dplyr generates the SQL and requests the results from the database, and even then it only pulls down 10 rows.
c4
#> Source: sqlite 3.7.17 [/private/tmp/RtmpWtcdn7/Rinsta96a7d76b104/dplyr/db/hflights.sqlite]
#> From: hflights [109,996 x 8]
#> Filter: DepDelay > 0
#> Arrange: Year, Month, DayofMonth, UniqueCarrier
#>
#> Year Month DayofMonth UniqueCarrier DepDelay AirTime Distance Speed
#> 1 2011 1 1 AA 8 41 224 300
#> 2 2011 1 1 AA 1 37 224 360
#> 3 2011 1 1 AA 1 113 964 480
#> 4 2011 1 1 AA 6 39 224 300
#> .. ... ... ... ... ... ... ... ...
To pull down all the results use collect()
, which returns a tbl_df()
:
collect(c4)
#> Source: local data frame [109,996 x 8]
#>
#> Year Month DayofMonth UniqueCarrier DepDelay AirTime Distance Speed
#> 1 2011 1 1 AA 8 41 224 300
#> 2 2011 1 1 AA 1 37 224 360
#> 3 2011 1 1 AA 1 113 964 480
#> 4 2011 1 1 AA 6 39 224 300
#> .. ... ... ... ... ... ... ... ...
You can see the query dplyr has generated by looking at the query
component of the object:
c4$query
#> <Query> SELECT "Year" AS "Year", "Month" AS "Month", "DayofMonth" AS "DayofMonth", "UniqueCarrier" AS "UniqueCarrier", "DepDelay" AS "DepDelay", "AirTime" AS "AirTime", "Distance" AS "Distance", "Distance" / "AirTime" * 60.0 AS "Speed"
#> FROM "hflights"
#> WHERE "DepDelay" > 0.0
#> ORDER BY "Year", "Month", "DayofMonth", "UniqueCarrier"
#> <SQLiteConnection: DBI CON (43422, 0)>
You can also ask the database how it plans to execute the query with explain()
. The output for SQLite is explained in more detail on the SQLite website and is helpful if you're trying to figure out what indexes are being used.
explain(c4)
#> <SQL>
#> SELECT "Year" AS "Year", "Month" AS "Month", "DayofMonth" AS "DayofMonth", "UniqueCarrier" AS "UniqueCarrier", "DepDelay" AS "DepDelay", "AirTime" AS "AirTime", "Distance" AS "Distance", "Distance" / "AirTime" * 60.0 AS "Speed"
#> FROM "hflights"
#> WHERE "DepDelay" > 0.0
#> ORDER BY "Year", "Month", "DayofMonth", "UniqueCarrier"
#>
#> <PLAN>
#> selectid order from detail
#> 1 0 0 0 SCAN TABLE hflights (~75832 rows)
#> 2 0 0 0 USE TEMP B-TREE FOR ORDER BY
There are three ways to force the computation of a query:
collect()
executes the query and returns the results to R.
compute()
executes the query and stores the results in a temporary table
in the database.
collapse()
turns the query into a table expresion.
You are most likely to use collect()
: once you have interactively converged on the right set of operations, use collect()
to pull down the data into a local tbl_df()
. If you have some knowledge of SQL, you can use compute()
and collapse()
to optimise performance.
dplyr tries to prevent you from accidentally performing expensive query operations:
nrow()
is always NA
: in general, there's no way to determine how
many rows a query will return unless you actually run it.
Printing a tbl only runs the query enough to get the first 10 rows
You can't use tail()
on database tbls: you can't find the last rows
without executing the whole query.
When doing simple mathematical operations of the form you normally use when filtering, mutating and summarising it's relatively straightforward to translate R code to SQL (or indeed to any programming language).
To experiment with the translation, use translate_sql()
. The following examples work through some basic differences between R and SQL.
# In SQLite variable names are escaped by double quotes:
translate_sql(x)
#> <SQL> "x"
# And strings are escaped by single quotes
translate_sql("x")
#> <SQL> 'x'
# Many functions have slightly different names
translate_sql(x == 1 && (y < 2 || z > 3))
#> <SQL> "x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)
translate_sql(x ^ 2 < 10)
#> <SQL> POWER("x", 2.0) < 10.0
translate_sql(x %% 2 == 10)
#> <SQL> "x" % 2.0 = 10.0
# R and SQL have different defaults for integers vs reals.
# In R, 1 is an real, and 1L is an integer
# In SQL, 1 is an integer, and 1.0 is a real
translate_sql(1)
#> <SQL> 1.0
translate_sql(1L)
#> <SQL> 1
dplyr knows how to convert the following R functions to SQL:
+
, -
, *
, /
, %%
, ^
abs
, acos
, acosh
, asin
, asinh
, atan
, atan2
,
atanh
, ceiling
, cos
, cosh
, cot
, coth
, exp
, floor
,
log
, log10
, round
, sign
, sin
, sinh
, sqrt
, tan
, tanh
<
, <=
, !=
, >=
, >
, ==
, %in%
&
, &&
, |
, ||
, !
, xor
mean
, sum
, min
, max
, sd
, var
The basic techniques underying the implementation of translate_sql()
are described in the Advanced R book. translate_sql()
is built on top of R's parsing engine and has been carefully design to generate correct sql. It also protects you against SQL injection attacks by correctly escaping strings and variable names as needed by the database that you're connecting to.
It's not possible to provide a perfect translation because databases don't have all the functions that R does. The goal of dplyr is to provide a semantic translation: to translate what you mean, not the precise details. Even for functions that exist both in databases and R you shouldn't expect results to be exactly the same; database programmers have different priorities to R core. For example, in R, mean()
loops through the data twice in order to get a higher level of numerical accuracy at the cost of being twice as slow. R's mean()
also provides a trim
option for computing trimmed means, which databases do not provide. Databases automatically drop NULLs (their equivalent of missing values) whereas in R you have to ask nicely. This means the essense of simple calls like mean(x)
will be translated accurately, but more complicated calls like mean(x, trim = 0.5, na.rm = TRUE)
will raise an error:
translate_sql(mean(x, trim = T))
# Error: Invalid number of args to SQL AVG. Expecting 1
Any function that dplyr does't know how to convert it leaves as is - that means if you want to use any other function that database provides, you can use it as is. Here a couple of examples that will work with SQLite:
translate_sql(glob(x, y))
#> <SQL> GLOB("x", "y")
translate_sql(x %like% "ab*")
#> <SQL> "x" LIKE 'ab*'
SQLite lacks window functions, which are needed for grouped mutation and filtering. This means that only really useful operation for grouped sqlite tables in summarise()
. The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don't supply na.rm = TRUE
.
planes <- group_by(hflights_sqlite, TailNum)
delay <- summarise(planes,
count = n(),
dist = mean(Distance),
delay = mean(ArrDelay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)
Other databases do support window functions and you can learn about them in the corresponding vignette. It's sometimes possible to simulate grouped filters and mutates using self joins, where you join the original table with a summarised version, but that topic is beyond the scope of this intro.
Using other databases instead of SQLite works similarly, the overall workflow is the same regardless of what database you're connecting to. The following sections go in to more details on the pecularities of each database engine. All of these databases follow a client-server model - as well as your computer which is connecting to the databse, there is another computer actually running it (that might be your computer but usually isn't). Getting one of these databases setup up is beyond the scope of this article, but there are plenty of tutorials available on the web.
src_postgres()
has five arguments: dbname
, host
, port
, user
and password
. If you are running a local postgresql database with the default settings you'll only need dbname
, but in most cases you'll need all five. dplyr uses the RPostgreSQL package to connect to postgres databases, which means you can't currently connect to remote databases that require a SSL connection (e.g. Heroku).
For example, the following code allows me to connect to a local postgresql database that contains a copy of the hflights
data:
if (has_lahman("postgres")) {
hflights_postgres <- tbl(src_postgres("hflights"), "hflights")
}
#> Loading required package: RPostgreSQL
Postgres is a considerably more powerful database than SQLite. It has:
a much wider range of functions built in to the database
support for window functions, which allow grouped subset and mutates to work.
The following examples shows the grouped filter and mutate possible with PostgreSQL. The SQL generated from the grouped filter is quite complex because you can't filter on window functions directly; instead they have to go in a subquery.
if (has_lahman("postgres")) {
daily <- group_by(hflights_postgres, Year, Month, DayofMonth)
# Find the most and least delayed flight each day
bestworst <- filter(daily, ArrDelay == min(ArrDelay) ||
ArrDelay == max(ArrDelay))
bestworst$query
# Rank each flight within a daily
ranked <- mutate(daily, rank = rank(desc(ArrDelay)))
ranked$query
}
You can connect to MySQL and MariaDB (a recent fork of MySQL) through src_mysql()
, mediated by the RMySQL package. Like PostgreSQL, you'll need to provide a dbname
, username
, password
, host
, and port
.
In terms of functionality, MySQL lies somewhere between SQLite and PostgreSQL. It provides a wider range of built-in functions, but it does not support window functions (so you can't do grouped mutates and filters).
Bigquery is a hosted database server provided by google. To connect, you need to provide your project
, dataset
and optionally a project for billing
(if billing for project
isn't enabled). After you create the src, your web browser will open and ask you to authenticate. Your credentials are stored in a local cache, so you should only need to do this once.
Bigquery supports only a single SQL statement: SELECT. Fortunately this is all you need for data analysis, and within SELECT bigquery provides comprehensive coverage similar level to postgresql.
If you don't already have a database, here's some advice from my experiences setting up and running all of them. SQLite is by far the easiest to get started with, but the lack of window functions make it limited for data analysis. PostgreSQL is not too much harder to use and has a wide range of built in functions. Don't bother with MySQL/MariaDB: it's a pain to set up and the documentation is subpar. Google bigquery might be a good fit if you have very large data, or you're willing to pay (a small amount of) money for someone else to look after your database.