How Active Metadata Makes Your Data Lineage Insights Actionable

Written by Tomáš Krátký | Jan 20, 2023 6:05:00 PM

Data is, without question, one of the most critical assets for every organization. Every company consists of a set of processes implemented via manual tasks or software applications. One way or another, for everything we do, we use data on input, and we also create data as a result of our actions. Imagine a software system automating critical tasks in finance or marketing that would actually produce and use no data!

While data gets more and more attention on all levels in every organization, there is something we, unfortunately, do not value enough yet: metadata. It’s also referred to as data about data – or context, if you wish. Metadata is everything. It can help you answer questions like:

What business meaning does the number I see in my report have?
Is it PII / classified data?
In which databases, tables, and columns is it stored? Using what data type?
When was the number last updated? How was it calculated?
What is the quality score associated with the number?
What data sources did it come from?
Which of my colleagues are using the same metric / feature? How?

And we can continue with more and more examples. Metadata is essentially every piece of information about our data. But why is metadata so important? One obvious reason is search. Imagine a library with thousands or millions of books, an e-shop that sells millions of items, or even the internet! Whenever you need to find something, metadata is the key to doing so quickly and efficiently.

Actually, for anyone who is interested in search and metadata, I highly recommend the book The Enterprise Data Catalog by Ole Olesen-Bagneux. Many great books were written about the importance of metadata, information science, taxonomies, data semantics, and knowledge management, and our goal today is not to compete with them.

Can Data Be Active?

Our focus in this article is active metadata, a concept whose definition is still evolving. To help understand it, let's compare and contrast the concept with how we use data. Metadata, after all, is also just "data". Metadata management is a technology market that has existed for decades, going through various phases. The most recent phase started with the rise of data catalogs. There are more than 30 different tools out there (probably even more), and new data catalogs are created almost every month. Yet Gartner, in their Market Guide for Active Metadata, stated that "[t]raditional metadata practices are insufficient.” So what is wrong with metadata and how can activation help?

Our ultimate issue is that we are focused too much on metadata collection, which has resulted in silos of metadata. As each catalog has its specific strengths, it is not uncommon to see multiple tools implemented by one company in different business units, which then leads to a catalog of catalogs. This is funny… and useless. Like with data, just collecting it adds no value to the organization.

Using the data analogy, we typically use data in the following ways.

We query data to get answers to our questions. That is the very basic use case we usually think of first. It can be good old pre-built reports, ad-hoc queries, or smart AI/ML algorithms digging insights from the data we have. By performing queries, we turn data into information and eventually knowledge.
We embed data into places where people or machines naturally need them. We do not force a sales representative to log into our reporting platform and write ad-hoc SQL queries or use pre-built reports. No. Rather, we prepare all the data they may need about a prospect or a customer, turn it into information, and deliver it to their workspace (as a dashboard in an application like the CRM they use daily). On top of that, we also enrich internal data with valuable external data to provide an even more complex view of the customer.
We also use data to automate tasks and processes. Instead of waiting for the sales representative to open their workspace and search for customers who may be a good fit for a new product offering, we have an algorithm running in the background that scores existing customers and sends proactive notifications that suggest who to call and what (or even how) to offer. Or, for an even simpler example, we automatically send a reminder to a sales representative in case they take no action (even if they should).

Obviously, there are more ways that we interact with data (like in data governance in healthcare); the above are the most traditional examples. The first "search" case represents a very "static" experience. Everything is sitting in a silo (e.g., a data warehouse or data lake), and we expect people to come, find what they need, and ask questions they need to ask. Do not get me wrong - it is awesome for some use cases and, when compared to a case with no data available, a huge jump forward.

However, we see that data is put to much better use in the other two examples, actively supporting users with limited data engineering skills and dramatically increasing their productivity. Compared to the first example, the latter are more "active" and thus more useful and accessible to a broader audience. And that is what we want to achieve with active metadata too.

What the Heck Is Active Metadata?

That leads us back to the very first question: what is active metadata? Gartner’s definition in their most recent Market Guide for Active Metadata Management is a bit vague but touches on several key aspects.

Continuous access - metadata is continuously collected. It is not something you do once per month or once per year, as we want to collect every change and every signal and respond to it.
Connecting dots - metadata is not just collected; it is constantly processed to distill information (and knowledge) from all the signals and noise. And with the right feedback loop, your system gets smarter over time, collecting and learning.
Actionable - all the intelligence and insights derived from metadata are not locked into a silo, but rather delivered in the form of recommendations, warnings, and notifications to humans and systems/applications that may need it.
Embedded - actionable information / knowledge is integrated into processes humans and machines perform, embedded into their workspace. People are not forced to go in and look for the insights. Instead, active metadata comes to them – when and where they need it.

How to Activate Your Data Lineage

As mentioned in the beginning of this text, there are various types of metadata. Metadata is almost everything. One obvious question is how to map ALL metadata, and whether there is even a strong business case to do so. We strongly believe that the key to unlocking the true potential of metadata is an intelligent and open standard for metadata exchange and integration. There is a lot to discuss about that topic and I encourage you to start with this article on OpenLineage.

But we at Manta are experts in data lineage and that is what I would like to talk about. For anyone who wants to learn more about data lineage basics, I recommend The Ultimate Guide to Data Lineage. Now the question stands - how can you activate data lineage, and what does it even mean for users? Let's take a look at several examples.

Protect key business / regulatory metrics. Every company has a set of essential metrics they use to make decisions and manage their business. They are usually well-curated and carefully watched. Thanks to data lineage, we fully understand how each and every metric is calculated and where its data comes from. Activated data lineage evaluates every change in the environment to assess its impact on key metrics. For example, if there is a breaking change in an upstream data source or if a quality indicator for one of the sources dropped, warnings are immediately sent to notify those responsible for fixing the issue and those using those key metrics. This stops wrong business decisions from being made.
Obtain contextual information when writing ETL code. We spend a lot of time moving data, transforming data, and running calculations and smart algorithms, just to get better insights. Data pipelines can be built partially in an automated way (this article is another great example of how to use metadata in an active way), but at least some parts are manually built by data engineers, or when using low-code/no-code platforms, by a variety of data users. When building a pipeline, you typically write SQL scripts and/or you drag-and-drop components in your ETL/ELT tool, link them together, and connect them to tables and columns. You have questions like - Where is this column sourced from? Is any PII data used to calculate it? What is the most recent data quality score of the associated data element? And many more. Now, imagine you have all that information as part of your workspace - all the critical context. You will certainly work much faster, and the same can be done for BI or AI/ML tools.
Prevent changes from breaking pipelines. Understanding the impact of changes is a very powerful capability of data lineage. We have used that power for decades as software engineers in our IDEs. Yet, in data, it was nearly impossible. In the ETL use case above, imagine that the developer implementing a change is warned by the ETL development studio that they have implemented a breaking change - a change that will break something downstream. In addition, we can also integrate data lineage into our CI/CD pipeline and make sure that when a developer tries to commit a piece of code, an automated impact analysis is triggered to determine if it is a breaking change and stop the commit or trigger a notification to the right people.
Decommission unused objects. Our environment has many assets, such as tables and columns, reports, APIs, data exports, and more. But do we truly use all of them? If not, they only consume our expensive resources like space or money. They can even contain sensitive data! It is a very frequent issue, especially in the case of M&A projects or migrations. It is best to delete such assets. Unfortunately, measuring "usage" is quite difficult. For example, a column can be accessed by a human writing an SQL query or by a program reading something from it and processing it in some way. Activated data lineage constantly evaluates all data pipelines, and if a "lost object" is detected, the right people are notified.
Clean and simplify our pipelines. Considering the assets decommissioning example above, it is still only a part of the problem. Because by deleting assets, the whole pipeline or its parts may become unnecessary. Think of complex SQL queries, dbt models, stored procedures, or ETL jobs, for example. How often do we actually clean and simplify them because some branches are no longer needed? Activated data lineage recommends which parts of the pipeline can be removed because they do not do anything truly useful.

These are just a few examples among many that show how data lineage can be powerful when it is activated rather than sitting somewhere in your metadata repository.

Integrating Metadata Is The Key

Using the examples above, it is clear that a huge driver of success when it comes to active metadata is the ability to embed and integrate it into other tools. However, a lack of universal standards for metadata exchange and no universal API that vendors can use to embed metadata make it difficult to achieve full integration. In this way, it is similar to Apple’s CarPlay and Google’s Android Auto.

Think of it like this. Today, many new cars come with CarPlay compatibility. At the same time, CarPlay is not truly integrated with the car itself – rather, it projects whatever your phone displays onto the car’s screen, similar to how browser widgets or iframes work. If there were a universal behind-the-scenes API integration for CarPlay, no matter the car type, it would be capable of so much more.

Similarly, while metadata browser plug-ins and widgets are useful, they can’t compete with the value of true integration.

At Manta, we were always pioneers in the space. We integrated actionable metadata years before the term active metadata was coined, and we are big proponents of an open ecosystem with standards. We are part of OpenLineage and Egeria, but those efforts are still evolving. It means that metadata vendors must negotiate and implement point integrations with every single data solution out there, which will clearly never scale. That said, there is still a lot of work to be done. But thinking about the opportunity, we could not be more excited.

What Is Next On Our Data Journey?

Okay, so what next? We have a lot of data and we do a lot with data, and that is not going to change. Building, maintaining, processing, and using data is, however, harder every minute and metadata can save us. The caveat is that it must be metadata not simply sitting in a silo somewhere, but rather metadata actively used by people and machines in everything they do.

Let’s put this even more bluntly. For decades, organizations have collected metadata, forcing or begging users to use their enterprise metadata repository, data catalog (or another industry buzzword) – and epically failed with all metadata-related projects. Now, we may finally start to understand that for metadata to succeed, it must be invisible, intelligent, and smoothly integrated into the lives of people and machines that benefit from its power. That is the true promise of active metadata.

View full post