Inhalt des Dokuments
Outlier explanation on data streams
- © Mahdi Esmailoghli
Current sophisticated machine learning algorithms such as Deep Neural Networks (DNNs), give high prediction power to data scientists to train very accurate machine learning models and apply them on new unseen data points.
The problem that most of machine learning algorithms confront with, is their disadvantage in explainability. For instance, deep neural network cannot explain the reason that a specific point classified in a particular class. Not being able to explain trained model by current machine learning algorithms, researchers ended up with new systems dedicated to explanations.
Explanation is needed in different domains. Outlier explanation is one of the most important applications in which, end user expects reasonable explanations. Currently many algorithms and systems are built to describe outlier points using raw feature values.
However, sometimes machine learning model has high accuracy, but explanation systems cannot provide information that distinct outliers and inliers; this happens in case of lack of correlation between feature set and target.
In this research project we aim to solve lack of correlation problem from the perspective of data integration. We try adding external information to the main dataset which brings correlation between feature set and target value.
We built a prototype of our system in BTW Data Science Challenge 2019. This prototype got the first prize in Data Science Challenge and also a workshop paper is published at BTW 2019, which is available here .
The dataset used in BTW Data Science Challenge is pollution sensor data that the most important features are time and location. In order to explain the most polluted areas in the city (Berlin in our case), we integrate the main dataset with weather, air traffic, public events and openstreetmap data. These external data sources add correlation to the pollution dataset and enrich data. External information enables explanation methods to give better explanations.
Currently, we are working on automatically adding external information to any dataset. We use web tables as external source. We find columns in web tables that are similar to columns in main dataset. Finding similar columns, introduce tables that may contain extra information for us. The goal is increasing accuracy of the machine learning models by adding this external information. Added information should bring correlation that were not existed before and they should be comprehendible by human.