`representr`

Record linkage (entity resolution or de-duplication) is used to join multiple databases to remove duplicate entities. While record linkage removes the duplicate entities from the data, many researchers are interested in performing inference, prediction, or post-linkage analysis on the linked data (e.g., regression or capture-recapture), which we call the *downstream task*. Depending on the downstream task, one may wish to find the most representative record before performing the post-linkage analysis. For example, when the values of features used in a downstream task differ for linked data, which values should be used? This is where `representr`

comes in. Before introducing our new package `representr`

from the paper Kaplan, Betancourt, and Steorts, we first provide an introduction to record linkage.

Throughout this vignette, we will use data that is available in the `representr`

package, `rl_reg1`

(rl = record linkage, reg = regression, 1 = amount of noisiness).

```
# load libraries
library(representr)
library(stringdist)
# load data
data("rl_reg1") # data for record linkage and regression
data("identity.rl_reg1") # true identity of each record
```

fname | lname | bm | bd | by | sex | education | income | bp |
---|---|---|---|---|---|---|---|---|

jasmine | sirotic | 3 | 31 | 1972 | F | High school graduates, no college | 32 | 127 |

hugo | white | 6 | 9 | 1958 | M | Bachelor’s degree only | 72 | 134 |

madeline | burgemeifter | 12 | 9 | 1967 | F | Some college or associate degree | 30 | 130 |

kyle | clarke | 4 | 9 | 1952 | M | Advanced degree | 90 | 125 |

livia | braciak | 11 | 6 | 1950 | F | High school graduates, no college | 27 | 134 |

phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |

This is simulated data, which consists of 500 records with 30% duplication and the following attributes:

`fname`

: First name`lname`

: Last name`bm`

: Birth month (numeric)`bd`

: Birth day`by`

: Birth year`sex`

: Sex (“M” or “F”)`education`

: Education level (“Less than a high school diploma”, "“High school graduates, no college”, “Some college or associate degree”, “Bachelor’s degree only”, or “Advanced degree”)`income`

: Yearly income (in $1000s)`bp`

: Systolic blood pressure

Before we perform prototyping to get a representative data set using `representr`

, we must first perform record linkage to remove duplication in the data set. In the absence of unique identifier (such as a social security number), we can use probabilistic methods to perform record linkage. We recommend the use of clustering records to a latent entity, known in the literature as graphical entity resolution. See (Binette and Steorts) for a review.

For the examples in this vignette, we will use `blink`

(Steorts),a Bayesian record linkage model (Steorts 2015). For this example, we will run only a small number of iterations, which will certainly not have converged. For a faster (more scalable version of this), we recommend recent work of (Marchant et al. 2020) using either dblink or dblinkR.

For a real example, we would want to let this model run longer.

```
# load blink
library(blink)
# params for running record linkage
a <- 1; b <- 99 # distortion hyperparams
c <- 1 # string density hyperparams
d <- function(string1, string2){ #jaro-winkler string distance
n1 <- length(string1)
n2 <- length(string2)
res <- matrix(NA, n1, n2)
for(i in seq_len(n1)) {
for(j in seq_len(n2)) {
res[i, j] <- stringdist(string1[i], string2[j], method = "jw")
}
}
res
} # vector string distance function
num.gs <- 10 # number of iterations
M <- nrow(rl_reg1) # upper bound on number of entities
str_idx <- c(1, 2) # string columns
cat_idx <- c(3, 4, 5) # categorical columns
# data prep
# X.c contains the categorical variables
# X.s contains the string variables
X.c <- apply(as.matrix(rl_reg1[, cat_idx]), 2, as.character)
X.s <- as.matrix(rl_reg1[, str_idx])
# X.c and X.s include all files "stacked" on top of each other.
# The vector below keeps track of which rows of X.c and X.s are in which files.
file.num <- rep(1, nrow(rl_reg1))
# perform record linkage
linkage.rl <- rl.gibbs(file.num, X.s, X.c, num.gs=num.gs, a=a, b=b, c=c, d=d, M=M)
```

In fact, we can load the results of running this model for \(100,000\) iterations, which have been stored in the package as a data object called `linkage.rl`

.

After record linkage is complete, one may want to perform analyses of the linked data. This is what we call the “downstream task”. As motivation, consider modeling blood pressure (bp) using the following two features (covariates): income and sex in our example data `rl_reg1`

. We want to fit this model after performing record linkage using the following features: first and last name and full data of birth. Here is an example of four records that represent the same individual (based on the results from record linkage) using data that is in the `representr`

package.

fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|

370 | elenys | reit | 11 | 11 | 1954 | F | High school graduates, no college | 28 | 109 |

371 | elen7 | reicl | 11 | 22 | 1954 | F | Advanced degree | 52 | 118 |

372 | dleny | rejd | 11 | 11 | 1954 | M | Bachelor’s degree only | 63 | 109 |

373 | eleni | reid | 11 | 11 | 1954 | F | Bachelor’s degree only | 52 | 109 |

Examination of this table raises important questions that need to be addressed before performing a particular downstream task, such as which values of bp, income, and sex should be used as the representative features (or covariates) in a regression model? In this vignette, we will provide multiple solutions to this question using a prototyping approach.

We have four methods to choose or create the representative record from linked data included in `representr`

. This process is a function of the data and the linkage structure, and we present both probabilistic and deterministic functions. The result in all cases is a representative data set to be passed on to the downstream task. The prototyping is completed using the `represent()`

function.

Our first proposal to choose a representative record (*prototype*) for a cluster is the simplest and serves as a baseline or benchmark. One simply chooses the representative record uniformly at random or using a more informed distribution.

For demonstration purposes, we can create a representative dataset using the last iteration of the results from running the record linkage model using `blink`

. This is accomplished using the `represent()`

function, and passing through the type of prototyping to be `proto_random`

.

```
# ids for representative records (random)
random_id <- represent(rl_reg1, lambda, "proto_random", parallel = FALSE)
rep_random <- rl_reg1[random_id,] # representative records (random)
```

We can have a look at a few records chosen as representative in this way.

fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|

240 | callum | lecke | 8 | 5 | 1981 | M | Advanced degree | 86 | 126 |

353 | lachlan | ebert | 5 | 15 | 1954 | M | Some college or associate degree | 42 | 150 |

162 | francesco | petito | 6 | 16 | 1972 | M | High school graduates, no college | 41 | 151 |

6 | phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |

315 | peter | chittleborough | 7 | 11 | 1980 | M | High school graduates, no college | 36 | 153 |

Our second proposal to choose a representative record is to select the record that “most closely captures” that of the latent entity. Of course, this is quite subjective. We propose selecting the record whose farthest neighbors within the cluster is closest, where closeness is measured by a record distance function, \(d_r(\cdot)\). We can write this as the record \(r = (i, j)\) within each cluster \(\Lambda_{j'}\) such that \[ r = \arg\min\limits_{(i, j) \in \Lambda_{j'}} \max\limits_{(i^*, j^*) \in \Lambda_{j'}} d_r((i, j), (i^*, j^*)). \] The result is a set of representative records, one for each latent individual, that is closest to the other records in each cluster. When there is a tie within the cluster, we select a record uniformly at random.

There are many distance functions that can be used for \(d_r(\cdot, \cdot)\). We define the distance function to be a weighted average of individual variable-level distances that depend on the column type. Given two records, \((i, j)\) and \((i*, j*)\), we use a weighted average of column-wise distances (based on the column type) to produce the following single distance metric: \[ d_r((i, j), (i*, j*)) = \sum\limits_{\ell = 1}^p w_\ell d_{r\ell}((i, j), (i^*, j^*)), \] where \(\sum\limits_{\ell = 1}^p w_\ell = 1\). The column-wise distance functions \(d_{r\ell}(\cdot, \cdot)\) we use are presented below.

Column | \(d_{r\ell}(\cdot, \cdot)\) |
---|---|

String | Any string distance function, i.e. Jaro-Winkler string distance |

Numeric | Absolute distance, \(d_{r\ell}((i, j), (i^*, j^*)) = \mid x_{ij\ell} - x_{i^*j^*\ell} \mid\) |

Categorical | Binary distance, \(d_{r\ell}((i, j), (i^*, j^*)) = \mathbb{I}(x_{ij\ell} != x_{i^*j^*\ell})\) |

Ordinal | Absolute distance between levels. Let \(\gamma(x_{ij\ell})\) be the order of the value \(x_{ij\ell}\), then \(d_{r\ell}((i, j), (i^*, j^*)) = \mid \gamma(x_{ij\ell}) - \gamma(x_{i^*j^*\ell}) \mid\) |

The weighting of variable distances is used to place importance on individual features according to prior knowledge of the data set and to scale the feature distances to a common range. In this vignette, we scale all column-wise distances to be values between \(0\) and \(1\).

Again, we can create a representative dataset using the last iteration of the results from running the record linkage model using `blink`

. But this time we need to specify some more parameters, like what types the columns are. This is accomplished using the `represent()`

function, and passing through the type of prototyping to be `proto_minimax`

.

```
# additional parameters for minimax prototyping
# need column types, the order levels for any ordinal variables, and column weights
col_type <- c("string", "string", "numeric", "numeric", "numeric", "categorical", "ordinal", "numeric", "numeric")
orders <- list(education = c("Less than a high school diploma", "High school graduates, no college", "Some college or associate degree", "Bachelor's degree only", "Advanced degree"))
weights <- c(.25, .25, .05, .05, .1, .15, .05, .05, .05)
# ids for representative records (minimax)
minimax_id <- represent(rl_reg1, linkage.rl[nrow(linkage.rl),], "proto_minimax",
distance = dist_col_type, col_type = col_type,
weights = weights, orders = orders, scale = TRUE, parallel = FALSE)
rep_minimax <- rl_reg1[minimax_id,] # representative records (minimax)
```

We can have a look at some of the representative records chosen via minimax prototyping.

fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|

240 | callum | lecke | 8 | 5 | 1981 | M | Advanced degree | 86 | 126 |

353 | lachlan | ebert | 5 | 15 | 1954 | M | Some college or associate degree | 42 | 150 |

162 | francesco | petito | 6 | 16 | 1972 | M | High school graduates, no college | 41 | 151 |

6 | phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |

315 | peter | chittleborough | 7 | 11 | 1980 | M | High school graduates, no college | 36 | 153 |

Our third proposal to choose a representative record is by aggregating the records (in each cluster) to form a composite record that includes information from each linked record. The form of aggregation can depend on the column type, and the aggregation itself can be weighted by some prior knowledge of the data sources or use the posterior information from the record linkage model. For quantitative variables, we use a weighted arithmetic mean to combine linked values, whereas for categorical variables, a weighted majority vote is used. For string variables, we use a weighted majority vote for each character, which allows for noisy strings to differ on a continuum. This is accomplished using the `represent()`

function, and passing through the type of prototyping to be `composite`

.

```
# representative records (composite)
rep_composite <- represent(rl_reg1, linkage.rl[nrow(linkage.rl),], "composite", col_type = col_type, parallel = FALSE)
```

We can have a look at some of the representative records.

fname | lname | bm | bd | by | sex | education | income | bp | |
---|---|---|---|---|---|---|---|---|---|

240 | callum | lecke | 8 | 5.0 | 1981 | M | Advanced degree | 86.0 | 126 |

353 | lachlan | ebert | 5 | 15.0 | 1954 | M | Some college or associate degree | 42.0 | 150 |

158 | francesco | petito | 6 | 16.4 | 1972 | M | High school graduates, no college | 52.6 | 150 |

6 | phoebe | green | 8 | 8.0 | 1957 | F | High school graduates, no college | 32.0 | 128 |

315 | peter | chittleborough | 7 | 11.0 | 1980 | M | High school graduates, no college | 36.0 | 153 |

Our fourth proposal to choose a representative record utilizes the minimax prototyping method in a fully Bayesian setting. This is desirable as the posterior distribution of the linkage is used to weight the downstream tasks, which allows the error from the record linkage task to be naturally propagated into the downstream task.

We propose two methods for utilizing the posterior prototyping (PP) weights — a weighted downstream task and a thresholded representative data set based on the weights. As already mentioned, PP weights naturally propagate the linkage error into the downstream task, which we now explain. For each MCMC iteration from the Bayesian record linkage model, we obtain the most representative records using minimax prototyping and then compute the probability of each record being selected over all MCMC iterations. The posterior prototyping (PP) probabilities can then either be used as weights for each record in the regression or as a thresholded variant where we only include records whose PP weights are above \(0.5\). Note that a record with PP weight above 0.5 has a posterior probability greater than 0.5 of being chosen as a prototype and should be included in the final data set.

```
# Posterior prototyping weights
pp_weights <- pp_weights(rl_reg1, linkage.rl[seq(80000, 100000, by = 100), ],
"proto_minimax", distance = dist_col_type,
col_type = col_type, weights = weights, orders = orders,
scale = TRUE, parallel = FALSE)
```

We can look at the minimax PP weights distribution for the true and duplicated records in the data set as an example. Note that the true records consistently have higher PP weights and the proportion of duplicated records with high weights is relatively low.

We can make a representative dataset with these weights by using the cutoff of \(0.5\), and look at some of the records.

fname | lname | bm | bd | by | sex | education | income | bp |
---|---|---|---|---|---|---|---|---|

jasmine | sirotic | 3 | 31 | 1972 | F | High school graduates, no college | 32 | 127 |

hugo | white | 6 | 9 | 1958 | M | Bachelor’s degree only | 72 | 134 |

madeline | burgemeifter | 12 | 9 | 1967 | F | Some college or associate degree | 30 | 130 |

kyle | clarke | 4 | 9 | 1952 | M | Advanced degree | 90 | 125 |

livia | braciak | 11 | 6 | 1950 | F | High school graduates, no college | 27 | 134 |

phoebe | green | 8 | 8 | 1957 | F | High school graduates, no college | 32 | 128 |

These four proposed methods each have potential benefits. The goal of prototyping is to select the correct representations of latent entities as often as possible; however, uniform random selection has no means to achieve this goal. Turning to minimax selection, if a distance function can accurately reflect the distance between pairs of records in the data set, then this method may perform well. Alternatively, composite records necessarily alter the data for all entities with multiple copies in the data, affecting some downstream tasks (like linear regression) heavily. The ability of posterior prototyping to propagate record linkage error to the downstream task is an attractive feature and a great strength of the Bayesian paradigm. In addition, the ability to use the entire posterior distribution of the linkage structure also poses the potential for superior downstream performance.

We can evaluate the performance of our methods by assessing the distributional closeness of the representative dataset to the true records. The distributional closeness of the representative datasets to the true records is useful because one of the benefits of using a two-stage approach to record linkage and downstream analyses is the ability to perform multiple analyses with the same data set. As such, downstream performance of representative records may be dependent on the type of downstream task that is being performed. In order to assess the distributional closeness of the representative data sets to the truth, we use an empirical Kullback-Leibler (KL) divergence metric. Let \(\hat{F}_{rep}(\boldsymbol x)\) and \(\hat{F}_{true}(\boldsymbol x)\) be the empirical distribution functions for the representative data set and true data set, respectively (with continuous variables transformed to categorical using a histogram approach with statistically equivalent data-dependent bins). The empirical KL divergence metric we use is then defined as \[ \hat{D}_{KL}(\hat{F}_{rep} || \hat{F}_{true}) = \sum_{\boldsymbol x} \hat{F}_{rep}(\boldsymbol x) \log\left(\frac{\hat{F}_{rep}(\boldsymbol x)}{\hat{F}_{true}(\boldsymbol x)}\right). \]

This metric is accessed in `representr`

using the `emp_kl_div()`

command.

```
true_dat <- rl_reg1[unique(identity.rl_reg1),] # true records
emp_kl_div(true_dat, rep_random, c("sex"), c("income", "bp"))
#> [1] 0.0204802
emp_kl_div(true_dat, rep_minimax, c("sex"), c("income", "bp"))
#> [1] 0.005069378
emp_kl_div(true_dat, rep_composite, c("sex"), c("income", "bp"))
#> [1] 0.0563834
emp_kl_div(true_dat, rep_pp_thresh, c("sex"), c("income", "bp"))
#> [1] 0.004226951
```

The representative dataset based on the posterior prototyping weights is the closest to the truth using the three variables we might be interested in using for regression. This might indicate that we should use this representation in a downstream model, like linear regression.

Binette, Olivier, and Rebecca C Steorts. “(Almost) All of Entity Resolution.” *arXiv Preprint arXiv:2008.04443*. https://arxiv.org/abs/2008.04443.

Kaplan, Andee, Brenda Betancourt, and Rebecca C. Steorts. “Posterior Prototyping: Bridging the Gap Between Record Linkage and Regression.” *arXiv Preprint arXiv:1810.01538*. https://arxiv.org/abs/1810.01538.

Marchant, Neil G, Andee Kaplan, Daniel N Elazar, Benjamin IP Rubinstein, and Rebecca C Steorts. 2020. “D-Blink: Distributed End-to-End Bayesian Entity Resolution.” *Journal of Computational and Graphical Statistics* just-accepted. Taylor & Francis: 1–42.

Steorts, Rebecca C. 2015. “Entity Resolution with Empirically Motivated Priors.” *Bayesian Analysis* 10 (4). International Society for Bayesian Analysis: 849–75.

———. “Blink: Record Linkage for Empirically Motivated Priors.” https://CRAN.R-project.org/package=blink.