How to Solve Impact Analysis with MANTA
Our customers use MANTA for all kinds of projects, impact analysis being one of them. When you have a really complex BI environment and still want to perform a reliable impact analysis, using predicates (and MANTA) is one way! Keep on reading to learn how.
During our pilots and deployments, we often find data warehouse environments that use very general physical models including several big tables like PARTY, BALANCE, ORDER, and others.
These tables contain data obtained from various source systems, and there are a lot of data marts and reports built on top of them. These tables make things difficult during impact analysis because data lineage from almost every report goes through them and into all the sources, making the results hard to use, or even worthless.
Impact Analysis Is Easier Than Ever Before
Let’s take a closer look at an example to understand exactly what happens when you use MANTA for your impact analysis. The table PARTY contains all information about individuals and companies that are somehow related to the organization. Thus, in one table, it is possible to have records for clients, employees, suppliers, and the organization’s branch network. Each type of entity is identified by the unique attribute or source system from which the data is obtained – for example, clients are managed in a different system than employees.
Now, let’s assume that the data from the PARTY table goes into two separate reports – a report EMPL_REPORT that displays information about employees and another report BRANCH_REPORT that displays information about the branch network. If we use the standard data lineage analysis, we can get this picture:
Although only data from the EMPLOYEE source table is relevant for the report EMPL_REPORT, the impact analysis from that report also includes the CLIENT, BRANCH, and SUPPLIER source tables due to the PARTY table. The problem is the same for the report BRANCH_REPORT. From the other side, the impact analysis from the EMPLOYEE source table includes both the EMPL_REPORT and BRANCH_REPORT which is confusing.
The Advantage of Using Data Lineage
Luckily, there is a solution. When data is inserted into the PARTY table from different source systems, there is often a column like PARTY.source_system_id where the identification of the source system is stored as a constant. Similarly, when a report is created that consumes data only from specific source systems, there is a condition in the statement filtering data based on the PARTY.source_system_id column. Thus, it is possible to automatically analyze both the insertion and selection to/from the PARTY table and create predicates such as PARTY.source_system_id = 20 that are then stored together with data lineage in the metadata repository. Therefore, it is possible to include them in the computation during the impact analysis.
Thanks to that, if we perform an impact analysis from the report EMPL_REPORT, the predicate PARTY.source_system_id = 20 is gathered before the table PARTY. When the analysis continues towards source tables, the predicate for each path is selected and compared to what has already been gathered. Therefore, when the path to the source table CLIENT with the predicate PARTY.source_system_id = 10 is tested, the result is that both predicates cannot hold at once, so data for this report cannot come from this source table. Conversely, when the path to the source table EMPLOYEE with the predicate PARTY.souce_system_id = 20 is tested, the result is that data for this report can come from this source table, so it is included in the results of the impact analysis. We can get similar results if we perform an impact analysis for the BRANCH_REPORT and also from sources like the EMPLOYEE table.
The results of the advanced data lineage analysis can look like this (in reality, if we perform the impact analysis from the EMPL_REPORT, we will only see the EMPLOYEE and PARTY tables):
Surely, the situation can be far more complex. For example, the data from the PARTY table can be pre-computed for more source systems first, and then several reports can be created on top of them for only a specific source system, like in this picture: