Column-Level Lineage: Tracing Your Data Back to Its Roots
Data lineage creates a comprehensive map of your data pipelines to bring your application systems to life. But, not all data lineage solutions offer the same levels of granularity and understanding those levels can help you set the right expectations for your projects.
In this post, we’ll dive into column-level lineage, also known as attribute-level lineage, and how to use it to achieve various results. This type of lineage is the most granular level of data lineage and supports the ability to trace data from an enterprise-wide perspective, as well as from table-to-table and column-to-column.
What is Column-Level Lineage?
Column-level lineage traces how individual columns flow and are transformed from their points of origin (e.g., a data warehouse) to their end points (e.g., an analytics report). While higher levels of lineage, such as system or application-level lineage, have their own valuable use cases, they lack much-needed context when it comes to getting to the root of data flow problems.
With column-level lineage, you not only can determine that a dependency exists, but also understand exactly how and where a column is calculated. This means you’ll know which input columns produced which output columns and whether that column has any special characteristics (such as reading or writing personally identifiable information or PII). You’ll also understand how that column is being used, which means you can answer questions like “how was this profitability value derived?”
With column-level lineage, you can unlock powerful insights into how your data moves to enable fine-tuned root cause and impact analysis, reactive troubleshooting, increased visibility into PII data, and more.
Column-level lineage is useful in a variety of industries and for a variety of reasons. A few examples include:
Data Quality Assurance
Column-level data lineage helps focus attention on the data quality issues in your environment. Imagine you suspect that there is, from a data quality perspective, a questionable table somewhere in your system – column-level lineage helps identify where you should apply data quality (dq) checks or quickly zoom in on the location of dq failures for already implemented data quality rules. Column-level lineage helps you do this without spending hours searching. Whether you’re looking to determine the root cause of a data quality incident or conduct impact analysis to see the downstream impacts of a data quality issue, the ability to see data lineage on a column level will get you the information you need.
Regulations such as GDPR, HIPAA, and CCPA have implemented strict requirements around data integrity and privacy in recent years. With column-level lineage, you have complete visibility into the columns that have regulated data and personally identifiable information (PII) and how that data connects to user-facing dashboards or other places where you are sending data downstream.. With that information, you can take precautions to ensure that this data remains secure and in compliance with regulatory requirements.
Detailed Root Cause Analysis
With column-level lineage, root cause analysis doesn’t need to feel like finding a needle in a haystack. In the past, data teams needed to uncover the source of data quality issues by manually combing through datasets to discover the source of a data anomaly. That could take days, weeks, or even months, depending on the size of the data environment. With column-level lineage, your data engineers will be empowered to quickly trace the root cause of data incidents back to their source.
Proactive Impact Analysis
When a database column becomes outdated by newer constructs, it is important to mark it as deprecated so it isn’t used to generate future reports. Column-level lineage makes it possible to identify if that deprecated column is linked to a downstream report prior to making changes to the system that could cause detrimental impact. This is a critical piece of change management, especially for large enterprises.
Troubleshooting Data Anomalies
The granularity of column-level lineage means that you can pinpoint the singular columns or fields that are affecting the numbers on your reports. That eliminates the time it would otherwise take you to hunt down errors and troubleshoot where things went wrong.
Column-level lineage is the most difficult level of lineage to achieve due to code complexity and sheer scale.
Think about it this way: If every system has 100 datasets, and each dataset has 100 columns, column-level lineage needs to track 10,000 columns — and all of their dependencies — in total. Often, this is seen as too detailed to achieve. Still, only column-level lineage enables you to fully take advantage of data lineage to answer questions pertaining to specific columns or attributes.
Given the complexity involved, the best way to implement column-level lineage is to turn to an automated data lineage solution that has the ability to perform column-level lineage. Trying to manually parse or document lineage on a column level requires extensive time, money, and effort from your data engineers and beyond, and is an error-prone process.
What is Manta’s Approach to Column-Level Lineage?
By choosing an automated data lineage solution with column-level lineage, you save yourself the heavy lift of trying to build a data lineage solution on your own.
Here’s how Manta enables column-level lineage:
- Manta enables lineage at a large scale by taking metadata from multiple sources and catalogs across enterprises.
- Manta automates column-level lineage, tracking it at table and column levels to go as deep as possible.
- Manta uses column-level lineage for both design and run-time lineage.
Manta’s customers have already had success thanks to this approach. For example, the nonprofit health system CHRISTUS Health came to Manta because they were experiencing major downstream complications and system outages during required quarterly electronic health records (EHR) system upgrades.
Without visibility into how those upgrades would impact their data environment, the CHRISTUS team was forced to take a reactive rather than a proactive approach to system changes — end users would inform them of outages as they occurred, rather than CHRISTUS flagging upgrade changes and impacts in advance and addressing them before they posed problems. Then, CHRISTUS data engineers needed to backtrack and troubleshoot what had already gone wrong. With Manta, they were able to create a new workflow that proactively approached EHR upgrades through Manta’s revision comparison and impact analysis capabilities.
Today, insights that once required days of tedious work and left the CHRISTUS team in a constant state of reaction now take minutes or hours thanks to Manta.
Column-level lineage enables organizations to trace their data’s journey at the most granular level. With column-level lineage, data teams can quickly trace the source of data quality issues back to their origin, eliminating the need for time-consuming manual investigations. Column-level lineage also enables impact analysis, which can be valuable when identifying the downstream impact of data quality problems or deprecating columns.
While achieving column-level lineage can be challenging due to its highly granular nature, implementing an automated data lineage solution like Manta can help you save time, effort, and resources while providing unparalleled insights into your data flows and dependencies.
P.S. This blog was written by a human.