TU Berlin

Big Data Management GroupResearch


Page Content

to Navigation

Data Integration


Data integration is the process of reconciling data from various and often heterogeneous source into a common schema and representation. Usually, several transformation steps have to be performed before data from different sources can be presented in a unified schema.
The prominent steps of data integration are the following:

  • Data curation

    • Duplicate detection
    • Error detection

  • Data transformation

    • Formatting
    • Mapping to alternative representations (e.g., DE for Deutschland)

  • Schema matching
  • Data fusion

The challenge is to find general approaches to automatize each single task and to build a framework that can support the generation of data integration workflows.

Data Profiling

Profiling data is generating meta-data that is human-perceivable and descriptive of the data and its structure. In this sense, meta-data can range from easy to obtain meta-data, such as the number of records and attributes,  over more complex interesting meta-data, such as number of unique values,  to algorithmically very difficult to obtain meta-data, such as unique column combinations and functional dependencies.

Research in Data Profiling also deals with the development of data summarization and data visualization techniques that improve the manual profiling of a dataset.



Data Discovery

Data warehouses, data lakes, and other federated databases of a company grow with time and become heavy and confusing, especially when subsets of the data become obsolete but are kept in fear of losing information. In fact, most companies avoid even re-organizing the database and its schema, as they fear incompatibility with applications and departments that are using these databases.

This circumstance requires data consumers to perform exhausting discovery and curation steps before they can apply their analytics on the data subset that they are interested in.

In our current collaboration with MIT and QCRI, we are working on data-driven solutions to support the user during his discovery task.


Quick Access

Schnellnavigation zur Seite über Nummerneingabe