TU Berlin

Big Data Management GroupCLRL: Feature Engineering for Cross-Language Record Linkage


Page Content

to Navigation

CLRL: Feature Engineering for Cross-Language Record Linkage


Record linkage aims at identifying duplicate records across datasets. Most existing record linkage techniques have been designed for monolingual datasets.

In this project, we propose a novel approach, CLRL, that links the records in a cross-language setting, where each input dataset is in a different language. CLRL combines monolingual similarity measures with multilingual cross-language word embedding similarities to identify the correspondence of records across datasets. As our experiments show, CLRL outperforms baseline approaches in cross-language data integration settings.

Check out the project repository and contact the author.


Quick Access

Schnellnavigation zur Seite über Nummerneingabe