Inhalt des Dokuments
Ensembling Curation Strategies
[1]
- © Copyright??
Having high-quality data is crucial in data
science, machine learning and AI. With the increasing amount of
digitally collected data, data scientists are spending a majority of
the time on data curation. Different factors, including data
generation, incorrect information extraction, erroneous entries made
by humans, missing entries or incorrect inference, could negatively
influence data quality. Hence, detecting data errors is the crucial
and expensive first task while selecting and processing
data.
In this project, we address the following questions: How can
we effectively combine different error detection strategies? How can
the characteristics of the data support the data curation?
We propose a holistic error detection method, which relies on
the output of the different data cleaning systems and automatically
extracted metadata. We consider aggregating data cleaning systems
output as a classification task and augment system features with
automatically generated metadata information, which improve the error
predicting outcome.
Generally, we consider three different aspects that are
relevant to data cleaning: data, existing algorithms and augmentation
information. In this project, we investigate the impact of different
combinations of these aspects in order to provide effective data
curation.
earch/method001.jpeg
Zusatzinformationen / Extras
Quick Access:
Schnellnavigation zur Seite über Nummerneingabe
Auxiliary Functions
Copyright TU Berlin 2008