censusxy package is designed to provide easy access to the U.S. Census Bureau Geocoding Tools in
There do not exist many packages for free or reproducible geocoding in the R environment. However, the Census Bureau Geocoding Tools allow for both unlimited free geocoding as well as an added level of reproducibility compared to commercial geocoders. Many geospatial workflows involve a large quantity of addresses, hence our core focus is on batch geocoding.
The U.S. Census Bureau makes their geocoding API available without any API key, and this package allows for virtually unlimited batch geocoding. Please use this package responsibly, as others will need use of this API for their research.
If you plan on using the single address geocoding tools, your data do not need to be in any specific class. To use the batch geocoder, your data must be in a data.frame (or equivalent class). This package provides Homicides in St Louis City between 2008-2018 as example data.
For use of the batch API, your address data needs to be structured. Meaning, your data contains seperate columns for street address, city, state and zipcode. You may find the
postmastr package useful for this task. Only street address is mandatory, but omission of city, state or zip code drastically lowers the speed and accuracy of the batch geocoder.
The Census Geocoder contains 4 primary functions, 3 for single address geocoding, and 1 for batch geocoding. For interactive use cases, a Shiny application for example, the single line geocoder is recommended. For large quantities of addresses, the batch endpoint is favorable.
If your usecase is locating coordinates within census geometries, only a single coordinate function is available for this task.
If you are interested in census geometries (composed of FIPS codes for state, county, tract and block), you should specify ‘geographies’ in the
return argument. This also necessitates the use of a vintage.
Vintage is only important to consider when geocoding census geographies. It has no impact on geocoding coordinates (location). You can obtain a data.frame of valid benchmarks and vintages with their respective functions. For vintages, you must supply the name or ID of the benchmark you have chosen.
If you are on a UNIX platform (macOS, Linux, etc.), you may take advantage of multiple threads to greatly increase performance. This functionality is not currently supported on Windows Operating Systems, however. All you have to do is specify the number of cores you would like to use and the function will automatically distribute the workload in the most efficient manner. The function will not allow you to specify more cores than are available, and will instead default to the maximum number of available cores.
When using the batch function, you may specify
class to “sf” which will return the results as an sf object, allowing for quick preview or export of the spatial data. However, doing this will only return addresses for which the geocoder could successfully match. A helpful message denoting how many rows were removed will print in the console.
You may also specify
output as “simple” or “full”. Simple returns only coordinates (and a GEOID if
return = "geographies") and this is suitable for most use cases. If you desire all of the raw output from the geocoder, please specify full instead.
The function contains an argument for timeout, which specifies how many minutes until the API query ends as an error. In this implementation, it is per 1000 addresses, not the whole batch size. It is set to default at 30 minutes, which should be appropriate for most internet speeds.
If a batch times out, the function will terminate, and you will lose any geocoding progress.
Be cautious that batches taking a long time may allow your computer to sleep, which may cause a batch to never return. macOS users may find the app caffeine useful (Also available as a Command Line Tool).
If you would like to append census geographies, or have control of the benchmark in order to reproduce geocoding results, you will find it convenient to use the built in functions for doing so. If you are not concerned about reproducibility or geographies, the functions will default to the latest benchmark, and you may ignore this section.
Get the current valid benchmarks, these are used to geocode and show available vintages.
Once, you’ve selected a benchmark, and only if you intend to append geographies, you should choose a vintage based on the benchmark you selected (Either by name or ID).
Both of these should be supplied as arguments to your geocoding function.
In this example, we will use the included
stl_homicides data to show the full process for batch geocoding.
Note, however, that it returns only matched addresses, including those approximated by street length. If there are unmatched addresses, they will be dropped from the output. Use
class = "dataframe" to return all addresses, including those that are unmatched.
Output returned as an
sf object can be previewed with a package like
We’ll investigate a few other use cases, specifically those involving fewer or single addresses.
You would like to geocode a single structured address:
You would like to geocode a single unstructured address and append census geographies:
You would like to append census geographies to a given coordinate:
For a handful of addresses, you may want to iterate using these functions. Two examples using base R are provided here.
Ritself, welcome! Hadley Wickham’s R for Data Science is an excellent way to get started with data manipulation in the tidyverse, which
censusxyis designed to integrate seamlessly with.
R, we strongly encourage you check out the excellent new Geocomputation in R by Robin Lovelace, Jakub Nowosad, and Jannes Muenchow.
censusxy, you are encouraged to use the RStudio Community forums. Please create a
reprexbefore posting. Feel free to tag Chris (
@chris.prener) in any posts about
reprexand then open an issue on GitHub.