Data Lineage Granularity
What are the levels or layers of lineage?
It's critical to understand how deep your lineage can go and where to dive in.
Breadth of End-to-End Lineage
We hear the term end-to-end lineage often in the world of data management, but what does this term actually mean? The concept of end-to-end lineage is not very well defined and can mean slightly different things in different cases, and is often used as a marketing ploy. Be aware of this to make sure that you are comparing apples to apples. The overall idea of end-to-end is the breadth of coverage - to simply see lineage from source to target. Let’s discuss what it actually means:
Organization or Enterprise-Wide Lineage
This is the most holistic view to truly cover the organization with the lineage from the very original data sources to the end consumers of the data. Organization-wide lineage actually means to cover lineage across multiple departments, technologies for all the data that you have. This is ideal of breadth of the coverage, and Manta is uniquely equipped to deliver it.
Domain or departmental lineage has two primary use cases. The first is column level detail that is used by the department for maintenance of the environment. The second use case is conceptual, system-level, or dataset detail that is used by other departments. Domain or department-wide lineage is often applied to the concept of Data Mesh and Data Products. You're covering lineage within your domain for your data products only. As part of the distributed concept of Data Mesh, your inputs are either data that your domain produces or other data products produced that you use. Your outputs are the data products that you publish; and along with your data products, you also publish the metadata including the lineage.
This lineage can be limiting. This means that you do not necessarily go against the very source of the data or to understand all the end-users of the data that you produce. You simply define your inputs and outputs and trace what happens to the data in between. Essentially, you are covering what happens in your department. This breadth of lineage is very useful for optimizing processes and increasing efficiency within the department or team.
Use Case Specific Lineage
When lineage is needed for a specific use case, that use case usually defines the scope, or the breadth that needs to be covered -- whether it is true end-to-end or what the use case actually considers as the “source” and “end point” for the data pipeline visibility.
We often see that this is a slice of organization-wide lineage for a specific data domain, data set, or just a few common data elements, or CDEs. When you get to the specific CDE-level that goes across the whole organization, it gets tricky to pick the right approach. Selecting a full automation that captures everything may be like using a sledgehammer to crack a nut, especially if you do not ever plan to scale up and cover more. On the other hand, starting with a manual approach may get you results quickly for a few CDEs but will not scale if you plan to get much more coverage in the future.
Application-Wide Lineage (DWH, BI)
Application-wide lineage is usually what individual teams who manage the application need to optimize their work and reduce risk of introducing unexpected errors.
For example, data warehouses that contain a tremendous amount of business logic for various reporting that has been needed in many organizations for over 30 years. Making changes, optimizing, modernizing or migrating such applications efficiently and in a controlled, low-risk manner, requires untangling the 30-years of complexity. Based on the need, this often means to get very detailed and accurate lineage.
Looking into the future while learning from the past, when building a new data warehouse, it is good to have the application transparency and manageability in mind, and to embed metadata and lineage into the process from the very beginning as the requirements for more transparency are not going away; they are only getting more important. Metadata-driven approach can be one of the ways to achieve this.
Runtime lineage isn't always enough on its own.
Design and run-time are good for different uses. However, there’s a large set of use cases where run-time lineage is not good enough – yet you can’t do any sort of end-to-end or application to application impact analysis with run-time lineage.
The challenge is that you can’t observe something that isn’t running. But that doesn’t mean that something that isn’t running can’t, or shouldn’t, have lineage. If the data is connected to something that doesn’t run yet but will run in the future, you’ll need to use it in your lineage discussions and planning. But, you may think you can’t simply because it's not running yet! In design-time lineage, you’re observing but you don’t see the syntax of the tool that is doing the writing.
Find the Right Level of Lineage for Your Organization
Linking depth, breadth, and approach of various granularities of lineage is not an easy task to do.
Think of it as creating a map of your road network of the US today - now that you have all the roads built, how do you actually create a map? Do you start with coast-to-coast highways, but then are lost on the “last-mile” problem? Or do you start building a detailed map of a town, but miss the big picture of the country? Or do you do it completely differently, e.g. by monitoring traffic and speed using cellphone positions? Different approaches provide different coverage, levels of accuracy and detail. Data lineage is similar - different approaches are suited for different use cases and to address different needs.