The Arrow C++ library includes a generic filesystem interface and specific
implementations for some cloud storage systems. This setup allows various
parts of the project to be able to read and write data with different storage
backends. In the
arrow R package, support has been enabled for AWS S3.
This vignette provides an overview of working with S3 data using Arrow.
In Windows and macOS binary packages, S3 support is included. On Linux when installing from source, S3 support is not enabled by default, and it has additional system requirements. See
vignette("install", package = "arrow")for details.
File readers and writers (
write_feather(), et al.)
accept an S3 URI as the source or destination file,
An S3 URI looks like:
For example, one of the NYC taxi data files used in
vignette("dataset", package = "arrow") is found at
Given this URI, we can pass it to
read_parquet() just as if it were a local file path:
df <- read_parquet("s3://ursa-labs-taxi-data/2019/06/data.parquet")
Note that this will be slower to read than if the file were local, though if you're running on a machine in the same AWS region as the file in S3, the cost of reading the data over the network should be much lower.
Another way to connect to S3 is to create a
FileSystem object once and pass
that to the read/write functions.
S3FileSystem objects can be created with the
s3_bucket() function, which
automatically detects the bucket's AWS region. Additionally, the resulting
FileSystem will consider paths relative to the bucket's path (so for example
you don't need to prefix the bucket path when listing a directory).
This may be convenient when dealing with
long URIs, and it's necessary for some options and authentication methods
that aren't supported in the URI format.
FileSystem object, we can point to specific files in it with the
In the previous example, this would look like:
bucket <- s3_bucket("ursa-labs-taxi-data") df <- read_parquet(bucket$path("2019/06/data.parquet"))
See the help for
FileSystem for a list of options that
endpoint_override can be encoded as query
parameters in the URI (though
region will be auto-detected in
s3_bucket() or from the URI if omitted).
secret_key can also be included,
but other options are not supported in the URI.
The object that
s3_bucket() returns is technically a
SubTreeFileSystem, which holds a path and a file system to which it corresponds.
SubTreeFileSystems can be useful for holding a reference to a subdirectory somewhere, on S3 or elsewhere.
One way to get a subtree is to call the
$cd() method on a
june2019 <- bucket$cd("2019/06") df <- read_parquet(june2019$path("data.parquet"))
SubTreeFileSystem can also be made from a URI:
june2019 <- SubTreeFileSystem$create("s3://ursa-labs-taxi-data/2019/06")
To access private S3 buckets, you need typically need two secret parameters:
access_key, which is like a user id,
secret_key, like a token.
There are a few options for passing these credentials:
Include them in the URI, like
s3://access_key:secret_key@bucket-name/path/to/file. Be sure to URL-encode your secrets if they contain special characters like “/”.
Pass them as
Set them as environment variables named
Define them in a
~/.aws/credentials file, according to the AWS documentation.
You can also use an AccessRole
for temporary access by passing the
role_arn identifier to
S3FileSystem machinery enables you to work with any file system that
provides an S3-compatible interface. For example, MinIO is
and object-storage server that emulates the S3 API. If you were to
minio server locally with its default settings, you could connect to
S3FileSystem like this:
minio <- S3FileSystem$create( access_key = "minioadmin", secret_key = "minioadmin", scheme = "http", endpoint_override = "localhost:9000" )
or, as a URI, it would be
(note the URL escaping of the
Among other applications, this can be useful for testing out code locally before running on a remote S3 bucket.