Inhalt des Dokuments
REDS: Estimating the Performance of Error Detection Strategies Based on Dirtiness Profiles
- © Mohammad Mahdavi
Datasets usually suffer from various data quality problems or data errors. At the same time, there are various error detection strategies to detect different kinds of data errors. To effectively detect the data errors, the user has to deploy and test multiple error detection strategies. However, evaluating each error detection strategy on the new dataset requires tedious human evaluation efforts. Therefore, estimating the performance of each strategy upfront is desirable for a more effective strategy selection.
In this project, we propose a new approach to estimate the performance of error detection strategies. The intuition is that error detection strategies will perform similarly on similarly dirty datasets. Therefore, we introduce the novel concept of dirtiness profiles, which make datasets comparable with respect to their dirtiness. Based on the similarity of dirtiness profiles, we estimate the expected performance of the available error detection strategies on the new dataset.
Check out the project repository and contact the author.