Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python

Jacob Morrier; Sulekha Kishore; R. Michael Alvarez

doi:10.5334/jors.556

(1) Overview

Introduction

Record linkage, also known as “entity resolution,” involves identifying records that refer to the same unit of observation across different datasets when common identifiers are absent. Deduplication, a closely related task, involves detecting duplicate or near-duplicate entries within a single dataset lacking consistent or unique identifiers. In both cases, inconsistencies in data entry, variations in naming conventions, and missing information can lead to fragmented and inaccurate representations of entities. Record linkage and deduplication address these challenges and are essential for ensuring data quality and enabling research in fields such as the social and health sciences.

Record linkage and deduplication generally involve calculating string similarity metrics for all pairs of values within or between datasets. Although these calculations are conceptually simple, the number of comparisons grows quadratically with dataset size. For instance, when linking observations from two datasets, each with 1,000,000 observations, adding just one more observation to either dataset results in an additional 1,000,000 comparisons. This makes record linkage and deduplication prohibitively expensive and lengthy, even for datasets of moderate size.

The Fast-ER package harnesses the computational power of GPUs to accelerate these calculations dramatically. It estimates the widely used Fellegi-Sunter probabilistic model and performs the computationally intensive preprocessing steps, such as calculating string similarity metrics, on CUDA-enabled GPUs, rapidly and accurately.

Implementation and architecture

Fast-ER provides a suite of classes engineered to serve as modules within record linkage and deduplication pipelines, customizable to the specific needs and requirements of users. Each class accepts user-defined inputs and produces relevant outputs, some of which can be passed to other classes for further processing.

Below are two figures illustrating record linkage and deduplication pipelines constructed using Fast-ER’s modular classes.

The record linkage process depicted in Figure 1 begins by providing the following inputs to the Comparison class: (i) two datasets (df_A and df_B), (ii) variables designated for fuzzy matching (Vars_Fuzzy_A and Vars_Fuzzy_B), and (iii) variables designated for exact matching (Vars_Exact_A and Vars_Exact_B). The class evaluates all record pairs across the two datasets, producing an array that records the frequency of each combination of discrete similarity levels across all variables, stored in the Counts attribute. This array serves as the primary input to the Estimation class, which treats it as a sufficient statistic to estimate the conditional match probability for each similarity pattern. Finally, together with the list of indices corresponding to each pattern provided by the Comparison class, these conditional match probabilities are passed to the Linkage class. This class generates a dataset containing all observation pairs with a conditional match probability exceeding a user-defined threshold.

Illustration of Record Linkage Pipeline. The boxes represent the package’s classes, the labels indicate user inputs and computed outputs, and the arrows illustrate the relationships between them.

The deduplication pipeline illustrated in Figure 2 begins by providing the following inputs to the Deduplication class: (i) a dataset (df), (ii) variables designated for fuzzy matching (Vars_Fuzzy), and (iii) variables designated for exact matching (Vars_Exact). The class compares the values of all record pairs in the dataset, producing an array with the frequency of each combination of discrete similarity levels across all variables stored in the Counts attribute. The remainder of the process is like the record linkage pipeline. The Counts array serves as the primary input to the Estimation class, which uses it as a sufficient statistic to estimate the conditional match probability for each similarity pattern. Together with the list of indices corresponding to each pattern from the Comparison class, the conditional match probabilities are the main inputs for the Linkage class, which generates a dataset containing all observation pairs with a conditional match probability exceeding a user-defined threshold.

Illustration of Deduplication Pipeline. The boxes represent the package’s classes, the labels indicate user inputs and computed outputs, and the arrows illustrate the relationships between them.

Quality control

To evaluate the performance of our GPU-accelerated record linkage library in terms of execution time, we compare it to three other existing libraries: fastLink [2, 3], RecordLinkage [4], and splink [5]. From the outset, note that all packages implement identical operations, yield the same results, and achieve the same accuracy. Our experiment consists in linking two extracts of North Carolina voter registration rolls of varying sizes (from 1,000 to 100,000 records), performing fuzzy matching on first names, last names, house numbers, and street names, and exact matching on birth years. The datasets share 50% overlapping records. To simulate real-world data imperfections, we introduced noise into 5% of the records using various transformations: character addition, deletion, substitution, adjacent-character swapping, and random value shuffling.

We benchmarked Fast-ER on a Google Colab instance equipped with a dual-core CPU, 16 GB of RAM, a Tesla T4 GPU, and 8 GB of GPU memory. The other libraries were benchmarked on a MacBook Pro with a 10-core Apple M1 Pro processor and 32 GB of unified memory. Since this setup offers more CPU cores and system memory than the Google Colab environment, fastLink, RecordLinkage, and splink are expected to perform better on it—and our experiments confirm this expectation. In other words, this comparison favors fastLink, RecordLinkage, and splink in terms of available CPU and memory resources and, in fact, more accurately reflects the computational resources typically available to researchers. However, Fast-ER could not be benchmarked on the MacBook Pro because Apple Silicon does not support CUDA. The splink library was executed using its default database backend, DuckDB. For greater comparability, we also report results from a benchmarking experiment in which all packages were executed on a single Google Colab instance equipped with a dual-core CPU, 16 GB of RAM, a Tesla T4 GPU, and 8 GB of GPU memory.

Figure 3 illustrates the execution times of the benchmarked libraries across different dataset sizes. Only comparisons that were completed before running out of memory are included. This figure shows that our GPU-accelerated implementation achieves speeds over 35 times faster than the leading CPU-powered implementation, fastLink, and outperforms all other libraries tested.

Record Linkage Performance Comparison: Fast-ER Runs Over 35 Times Faster Than fastLink.

We also assessed Fast-ER’s performance in deduplicating voter registration roll extracts identical to those described above. As illustrated in Figure 4, our GPU-accelerated implementation runs more than 60 times faster than fastLink.

Deduplication Performance Comparison: Fast-ER Runs Over 60 Times Faster Than fastLink.

Finally, as illustrated in Figures 5 and 6, when all packages were executed on a single Google Colab instance, our package performed even more favorably: it was up to 200 times faster than fastLink and completed execution across all sample sizes without exhausting memory, whereas the other packages failed to do so for the full range of sample sizes.

Record Linkage Performance Comparison: Fast-ER Runs Over 200 Times Faster than fastLink When Both Are Executed on a Google Colab Instance.

Deduplication Performance Comparison: Fast-ER Runs Over 200 Times Faster than fastLink When Both Are Executed on a Google Colab Instance.

(2) Availability

Operating system

Windows and Linux

Programming language

Python

Additional system requirements

A discrete NVIDIA CUDA GPU with a Compute Capability of 3.0 or higher, along with the necessary drivers, is needed to install and use this library. CUDA-enabled GPUs are freely available through Google Colab.

Dependencies

CuPy, Matplotlib, NumPy, and Pandas.

List of contributors

Jacob Morrier led the development of the software library and its accompanying documentation, with contributions and support from Sulekha Kishore. Project management was overseen by R. Michael Alvarez.

Software location

Archive

Name: Python Package Index (PyPI)
Persistent identifier: https://pypi.org/project/fast-er-link/
License: MIT License
Publisher: Jacob Morrier
Version published: 0.3.1
Date published: 11/02/2025

Name: Zenodo
Persistent identifier: https://doi.org/10.5281/zenodo.17507872
License: MIT License
Publisher: Jacob Morrier
Version published: 0.3.1
Date published: 11/02/2025

Code repository

Name: GitHub
Identifier: https://github.com/jacobmorrier/fast-er
License: MIT License
Date published: 11/02/2025

Language

English

(3) Reuse potential

“Hard” merges and deduplication are inherently fragile: even minor variations can invalidate a potential match between two records. This sensitivity poses a critical issue when linking data from diverse sources, each adhering to distinct database management standards and lacking consistent identifiers. Similarly, inconsistencies in database management or human errors can cause “hard” deduplication methods to fail. Probabilistic record linkage and deduplication provide robust solutions to these challenges.

Although record linkage and deduplication methods have existed for some time, their high computational costs and long processing times have made them impractical for even moderately sized datasets. Fast-ER proposes a solution to this challenge by harnessing the computational power of GPUs to dramatically accelerate these calculations. While applicable to any scenario requiring probabilistic record linkage and deduplication, this solution is particularly valuable for handling moderate to large datasets in both academic research and industry. For example, it can be used to evaluate the quality of voter registration data, a critical factor for election integrity [6]. In the health sciences, record linkage can be used to integrate patient data from multiple providers, pharmacies, and laboratories to gain a comprehensive view of patient history, disease progression, and treatment outcomes.

Fast-ER can be easily integrated as a module within ETL pipelines and orchestrated using workflow management tools such as Apache Airflow or Snakemake, provided that appropriate GPU resource constraints are defined. For example, a record linkage operation can be implemented as a task within an Airflow directed acyclic graph (DAG) [7]. Users with access to multiple GPUs can further enhance performance by partitioning record linkage or deduplication workloads into smaller subtasks and executing them concurrently across devices. This parallelization can be efficiently managed through CuPy’s device and memory management features. As GPUs become more affordable and widely available, they will increasingly provide a powerful, low-barrier solution for large-scale entity resolution. In contrast, other solutions (e.g., Splink) have sought to address scalability issues by leveraging big data backends like AWS Athena or Apache Spark [5]. While these tools are widely used in industry, they are less commonly adopted in academic research, where access to such infrastructure is often limited.

Users are encouraged to contribute to the library by submitting pull requests or opening issues on GitHub for feature suggestions, bug reports, or support requests.

Author Contributions

J.M. and S.K. contributed to this article while serving as a Ph.D. candidate and an undergraduate student, respectively, at the California Institute of Technology.

Fast-ER: GPU-Accelerated Record Linkage and Deduplication in Python

Full Article

(1) Overview

Introduction

Implementation and architecture

Figure 1

Figure 2

Quality control

Figure 3

Figure 4

Figure 5

Figure 6