Inhalt des Dokuments
------ Links: ------
Ensembling Curation Strategies
- © Copyright??
Having high-quality data is crucial in data science, machine learning and AI. With the increasing amount of digitally collected data, data scientists are spending a majority of the time on data curation. Different factors, including data generation, incorrect information extraction, erroneous entries made by humans, missing entries or incorrect inference, could negatively influence data quality. Hence, detecting data errors is the crucial and expensive first task while selecting and processing data.
In this project, we address the following questions: How can we effectively combine different error detection strategies? How can the characteristics of the data support the data curation?
We propose a holistic error detection method, which relies on the output of the different data cleaning systems and automatically extracted metadata. We consider aggregating data cleaning systems output as a classification task and augment system features with automatically generated metadata information, which improve the error predicting outcome.
Generally, we consider three different aspects that are relevant to data cleaning: data, existing algorithms and augmentation information. In this project, we investigate the impact of different combinations of these aspects in order to provide effective data curation.
Zusatzinformationen / Extras
Schnellnavigation zur Seite über Nummerneingabe
Copyright TU Berlin 2008