Improved suffix blocking for record linkage and entity resolution

Type
Article

Authors
Allam, Amin
Skiadopoulos, Spiros
Kalnis, Panos

KAUST Department
Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division
Computer Science Program

Online Publication Date
2018-07-30

Print Publication Date
2018-09

Date
2018-07-30

Abstract
Record linkage is the problem that identifies the different records that represent the same real-world object. Entity resolution is the problem that ensures that a real-world object is represented by a single record. The incremental versions of record linkage and entity resolution address the respective problems after the insertion of a new record in the dataset. Record linkage, entity resolution and their incremental versions are of paramount importance and arise in several contexts such as data warehouses, heterogeneous databases and data analysis. Blocking techniques are usually utilized to address these problems in order to avoid comparing all record pairs. Suffix blocking is one of the most efficient and accurate blocking techniques. In this paper, we consider the non-incremental variation of record linkage and present a method that is more than five times faster and achieves similar accuracy to the current state-of-the-art suffix-based blocking method. Then, we consider the incremental variation of record linkage and propose a novel incremental suffix-based blocking mechanism that outperforms existing incremental blocking methods in terms of blocking accuracy and efficiency. Finally, we consider incremental entity resolution and present two novel techniques based on suffix blocking that are able to handle the tested dataset in a few seconds (while a current state-of-the-art technique requires more than eight hours). Our second technique proposes a novel method that keeps a history of the deleted records and the merging process. Thus, we are able to discover alternative matches for the inserted record that are not possible for existing methods and improve the accuracy of the algorithm. We have implemented and extensively experimentally evaluated all our methods. We offer two implementations of our proposals. The first one is memory-based and offers the best efficiency while the second one is disk-based and scales seamlessly to very large datasets.

Citation
Allam A, Skiadopoulos S, Kalnis P (2018) Improved suffix blocking for record linkage and entity resolution. Data & Knowledge Engineering 117: 98–113. Available: http://dx.doi.org/10.1016/j.datak.2018.07.005.

Acknowledgements
This work was supported by King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.

Publisher
Elsevier BV

Journal
Data & Knowledge Engineering

DOI
10.1016/j.datak.2018.07.005

Additional Links
https://www.sciencedirect.com/science/article/pii/S0169023X16300714

Permanent link to this record