For the past decade, we’ve been turning data processing upside down. From collecting historical data to building reports and insights about the past, we have progressed to the BigData era where we collect and process every possible data record we can get our hands on, without actually knowing how we will use it. All of this is done in hopes that future generations of smarter algorithms will be able to help us drive the future of our businesses forward. But all those efforts are overshadowed by the ethical questions we should be asking ourselves: how do we make sure we’re using this data for good?
Walking down the path of data infrastructure evolution, you will notice that it’s getting more and more complex with every step. It is not the data that causes problems for us—it’s how we handle and process it. We have built these massive data pipelines over time combining everything one can imagine—batch, real-time, streams, microservices, cloud, noSQL, AI/ML—all to get more valuable insights to drive our business and, at the same time, make sure, and be able to prove, that we do it in an ethical and responsible way. For both new pipelines as well as review and vet the existing ones.
And yet, for all these years, we have had no tools that could help us fully understand and control this complexity. It feels like there’s always something we are missing, something we can’t put our finger on. Our blind spots. And that comes at a price:
That all inevitably leads to missed opportunities (derive more business value from data), wasted investments (billions spent on building uncontrollable data infrastructures), and frustration on both business and technology sides.
Key to our evolution with data is, no surprise, metadata. But not just any metadata. We need to be selective about what metadata we use, and we need to be smart about how we use it. Metadata has been with us for ages but was never really successful. Part of the problem is that we just decided to collect it without thinking if and how it may be useful (similar to what we did with data at the beginning of big-data hype).
To maintain an overview of our data, we look at basic metadata such as data profiles (type of data, business classification, quality score, etc.) or operational characteristics (who accesses data, how often, popular data sets, etc.). While these pieces of information are interesting and useful, they are static and don’t cast any light on blind spots of complex data pipelines. True power comes with a detailed understanding of data lineage. True data lineage, not what is often mistaken for it.
Thanks to vendors offering only very rudimentary data lineage capabilities and trying to hide their deficiencies, most people see data lineage as information about source of data and its journey from one table to another up to reports or dashboards. But true data lineage is far more than that. Data lineage represents a detailed map of all direct and indirect dependencies among data entities in the environment. Why is that so important?
These are just a few examples of how data lineage can change the whole “complexity game” for you and provide a panoramic overview of your data landscape. And yes, data lineage does give you information about the journey of data, helps you with regulatory compliance, or allows you to chase and fix data incidents in seconds.
But even data lineage must be activated to become truly beneficial for an organization that wants to stay relevant in today’s fast-paced environment. What is activation? In layman’s terms, it means for an automated program to take data lineage information that is just sitting somewhere and turn it into actions. Examples?
The underlying theme here is that to beat the complexity we deal with, we need machines, automation, and intelligence to help us. We have customers complaining about how complex their data lineage is even if they visualize only a low level of detail. And I am not surprised, with so massive data systems they have built, there is a limit of what a human brain can process. So, if you ask me what 2022 (and probably also 2023, 2024, 2025, etc.) will be about in data, I would bet on smart activation of carefully selected types of metadata to make life easier for data professionals, ensure we do it in an ethically responsible way, and make data more valuable and useful to all organizations across the globe. Cheers to that!