Return of the Metadata Bubble
The bubble around metadata in BI is back – with all it’s previous sins and even more just around the corner. [LONG READ]
In my view, 2016 and 2017 are definitely the years for metadata management and data lineage specifically. After the first bubble 15 years ago, people were disappointed with metadata. A lot of money was spent on solutions and projects, but expectations were never met (usually because they were not established realistically, as with any other buzzword at its start). Metadata fell into damnation for many years.
But if you look around today, visit few BI events, read some blog posts and comments on social networks, you will see metadata everywhere. How is it possible? Simply, because metadata has been reborn through the bubble of data governance associated with big data and analytics hype. Could you imagine any bigger enterprise today without a data governance program running (or at least in its planning phase)? No! Everyone is talking about a business glossary to track their Critical Data Elements, end-to-end data lineage is once again the holy grail (but this time including the Big Data environment), and we get several metadata related RFPs every few weeks.
Don’t get me wrong, I’m happy about it. I see proper metadata management practice to be a critical denominator for the success of any initiative around data. With huge investments flowing into big data today, it is even more important to have proper governance in place. Without it, no additional revenue, chaos, and lost money would be the only outcome of big (and small) data analytics. My point is that even if everything looks promising on the surface, I feel a lot of enterprises have taken the wrong approach. Why?
A) No Numbers Approach
I have heard so often that you can’t demonstrate with numbers how metadata helps an organisation. I couldn’t disagree more. Always start to measure efficiency before you start a data governance/metadata project. How many days does it take, on average, to do an impact analysis? How long does it take, on average, to do an ad-hoc analysis. How long does it take to get a new person onboard – data analyst, data scientist, developer, architect, etc. How much time do your senior people spend analysing incidents and errors from testing or production and correcting them? My advice is to focus on one or two important teams and gather data for at least several weeks, or better yet, months. If you aren’t doing it already, you should start immediately.
You should also collect as many “crisis” stories as you can. Such as when a junior employee at a bank mistyped an amount in a source system and a bad $1 000 000 transaction went through. They spent another three weeks in a group of 3 tracking it from its source to all its targets and making corrections. Or when a finance company refused to give a customer a big loan and he came to complain five months later. What a surprise when they ran simulations and found out that they were ready to approve his application. They spent another 5 weeks in a group of 2 trying to figure out what exactly happened to finally discover that a risk algorithm in use had been changed several times over the last few months. When you factor in bad publicity related to this incident, your story is more than solid.
Why all this? Because using your numbers to build a business case and comparing them with numbers after a project to demonstrate efficiency improvements and those well-known, terrifying stories that cause so many troubles to your organisation, will be your “never want it to happen again” memento.
B) Big Bang Approach
I saw several companies last year that started too broad and expected too much in very short time. When it comes to metadata and data governance, your vision must be complex and broad, but your execution should be “sliced” – the best approach is simply to move step-by-step. Data governance usually needs some time to demonstrate its value in reduced chaos and better understanding between people in a company. It is tempting to spend a budget quickly, to implement as much functionality as possible and hope for great success. In most cases, however, it becomes a huge failure. Many, good resources are available online on this topic, so I recommend investing your time to read and learn from others’ mistakes first.
I believe that starting with several, critical data elements most often used is the best strategy. Define their business meaning first, than map your business terms to the the real world and use an automated approach to track your data elements both at a business and technical level. When the first, small set of your data elements is mapped, do your best to show their value to others (see the previous section about how to measure efficiency improvements). With success, your experience with other data sets will be much smoother and easier.
C) Monolithic Approach
you collect all your metadata and data governance related requirements from both business and technical teams, include your management and other key stakeholders, prepare a wonderful RFP and share it with all vendors from the top right Gartner Data Governance quadrant (or Forrester wave if you like it more). You meet well-dressed sales people and pre-sales consultants, see amazing demonstrations and marketing papers, hear a lot of promises how all your requirements will be met, pick up a solution you like, implement it, and earn you credit. Prrrrr! Wake up! Marketing papers lie most of the time (see my other post on this subject).
Your environment is probably very complex with hundreds of different and sometimes very old technologies. Metadata and data governance is primarily an integration initiative. To succeed, business and IT has to be put together – people, systems, processes, technologies. You can see how hard it is, and you may already know it! To be blunt, there is no single product or vendor covering all your needs. Great tools are out there for business users with compliance perspectives such as Collibra or Data3Sixty, more big data friendly information catalogs such as Alation, Cloudera Navigator, or Waterline Data, and technical metadata managers such as IBM Governance Catalog, Informatica Metadata Manager, Adaptive, or ASG. Each one of them, of course, overlaps with the others. Smaller vendors then also focus on specific areas not covered well by other players. Such as MANTA, with the unique ability to turn your programming code into both technical and business data lineage and integrate it with other solutions.
Metadata is not an easy beast to tame. Don’t make it worse by falling into the “one-size-fits-all” trap.
D) Manual Approach
I meet a lot of large companies ignoring automation when it comes to metadata and data governance. Especially with big data. Almost everyone builds a metadata portal today, but in most cases it is only a very nice information catalog (the same sort you can buy from Collibra, Data3Sixty, or IBM) without proper support for automated metadata harvesting. The “How to get metadata in” problem is solved in a different way. How? Simply by setting up a manual procedure – whoever wants to load a piece of logic into DWH or Data lake has to provide associated metadata describing meaning, structures, logic, data lineage, etc. Do you see how tricky this is? On the surface, you will have a lot of metadata collected, but every bit of information is not reality – it is a perception of reality and only as good as the information input by a person. What is worse, is that it will cost you a lot of money to keep synchronised with real logic during all updates, upgrades, etc. The history of engineering tells us clearly one fact – any documentation, especially documentation not an integral part of your code/logic, created and maintained manually, is out of date the very same moment it was created.
Sometimes there is a different reason for harvesting metadata manually – typically when you choose a promising DG solution, but it turns out that a lot is missing. Such as when your solution of choice cannot extract metadata from programming code and you end up with an expensive tool without the important pieces of your business and transformation logic inside. Your only chance is to analyse everything remaining by hand, and that means a lot of expense and a slow and error-prone process.
Most of the time I see a combination of a), c) and d), and in rare cases also with b). Why is that? I do not know. I have plenty of opinions but none of them have been substantiated. One thing for sure is that we are doing our best to kill metadata, yet again. This is something I am not ready to accept. Metadata is about understanding, about context, about meaning. Companies like Google and Apple have known it for a long time, which is why they win. The rest of the world is still behind with compliance, regulations being the most important factor why large companies implement data governance programs.
I am asking every single professional out there to fight for metadata, to explain that measuring is necessary and easy to implement, small steps are much safer and easier to manage than a big bang, an ecosystem of integrated tools provides greater coverage of requirements than a huge monolith, and that automation is possible.
Tomas Kratky is the CEO of MANTA and this article was originally published on his LinkedIn Pulse. Let him know what you think on firstname.lastname@example.org.