How to Handle Impact Analyses in Complex DWHs with Predicates
During our pilots and deployments, we often find data warehouse environments that use very general physical models including several big tables like PARTY, BALANCE, ORDER and others. These tables contain data obtained from various source systems, and there are a lot of data marts and reports built on top of them. These tables make things difficult during the impact analysis because data lineage from almost every report goes through them to all sources making the result worthless.
Impact Analyses Do Not Have to Be THIS BIG
Let’s take a look at an example to understand exactly what happens. The table PARTY contains all individuals and companies that are somehow related to the organization. Thus, in one table, it is possible to have records for clients, employees, suppliers and its branch network. Each type of entity is identified by a unique attribute or source system from which data is obtained – for example, clients are managed in a different system than employees.
Now, let’s assume we have two reports based on data from the PARTY table – a report EMPL_REPORT that displays information about employees and another report BRANCH_REPORT that displays information about the branch network. If we use the standard data lineage analysis, we can get this picture:
Although only data from the EMPLOYEE source table is relevant for the report EMPL_REPORT, the impact analysis from that report also includes the CLIENT, BRANCH and SUPPLIER source tables due to the PARTY table. The problem is the same for the report BRANCH_REPORT. From the other side, the impact analysis from the EMPLOYEE source table includes both the EMPL_REPORT and BRANCH_REPORT which is confusing.
In the real environment, there are dozens of source systems and hundreds of reports, which makes the standard data lineage analysis worthless.
The Advanced Data Lineage Analysis
Fortunately, there is a solution. When data is inserted into the PARTY table from different source systems, there is often a column like PARTY.source_system_id where the identification of the source system is stored as a constant. Similarly, when a report is created that consumes data only from specific source systems, there is a condition in the statement filtering data based on the PARTY.source_system_id column. Thus, it is possible to automatically analyze both the insertion and selection to/from the PARTY table and create predicates such as PARTY.source_system_id = 20 that are then stored together with data lineage in the metadata repository. Therefore, it is possible to include them in the computation during the impact analysis.
Thanks to that, if we perform an impact analysis from the report EMPL_REPORT, the predicate PARTY.source_system_id = 20 is gathered before the table PARTY. When the analysis continues towards source tables, the predicate for each path is selected and compared to what has already been gathered. Therefore, when the path to the source table CLIENT with the predicate PARTY.source_system_id = 10 is tested, the result is that both predicates cannot hold at once, so data for this report cannot come from this source table. Conversely, when the path to source table EMPLOYEE with the predicate PARTY.souce_system_id = 20 is tested, the result is that data for this report can come from this source table, so it is included in the result of the impact analysis. We can get similar results if we perform an impact analysis for the BRANCH_REPORT and also from sources like the EMPLOYEE table.
The result of the advanced data lineage analysis can look like this (in reality, if we perform the impact analysis from the EMPL_REPORT, we will only see the EMPLOYEE and PARTY tables):
Surely, the situation can be far more complex. For example, the data from the PARTY table can be pre-computed for more source systems first, and then several reports can be created on top of them for only a specific source system, like in this picture:
This is also something that can be handled and, as you may have expected, even this is a part of the Manta Flow product analysis.
If you have any questions or comments, feel free to contact Lukas at firstname.lastname@example.org. You can try these predicate-based impact analyses in our free trial – just request it using the form on the right.