Data integration is the process of reconciling data from various
and often heterogeneous source into a common schema and
representation. Usually, several transformation steps have to be
performed before data from different sources can be presented in a
The prominent steps of data integration are the following:
- Data curation
- Duplicate detection
- Error detection
- Mapping to alternative representations (e.g., DE for Deutschland)
- Schema matching
- Data fusion
The challenge is to find general approaches to automatize each single task and to build a framework that can support the generation of data integration workflows.
Profiling data is generating meta-data that is
human-perceivable and descriptive of the data and its structure. In
this sense, meta-data can range from easy to obtain meta-data, such
as the number of records and attributes, over more complex
interesting meta-data, such as number of unique values, to
algorithmically very difficult to obtain meta-data, such as unique
column combinations and functional dependencies.
Research in Data Profiling also deals with the development of data summarization and data visualization techniques that improve the manual profiling of a dataset.
Data warehouses, data lakes, and other federated databases of a company grow with time and become heavy and confusing, especially when subsets of the data become obsolete but are kept in fear of losing information. In fact, most companies avoid even re-organizing the database and its schema, as they fear incompatibility with applications and departments that are using these databases.
This circumstance requires data consumers to perform exhausting discovery and curation steps before they can apply their analytics on the data subset that they are interested in.
In our current collaboration with MIT and QCRI, we are working on data-driven solutions to support the user during his discovery task.