There is a lot of talk about data these days. Structured, unstructured, governance, security, lineage, silos, integration. Whatever the area, data is a hot topic and there are billions of dollars being spent each year on data-related projects. But the problem that I keep seeing across firms that I work with is that they are focused on a particular area of “data” and not addressing the elephant in the room: their current data flows are a mess and they neither understand nor appreciate the full extent of their problem. If a data lineage exists, it is often out-of-date the moment that it was documented.
It’s a common problem and for a lot of good, legacy reasons. Not the least of which is a continuing lack of continuity in interface design and development across companies, groups, projects, and developers. Fortunately, this is a problem that can be solved by using modern technologies.
Large companies can have hundreds, even thousands, of interfaces that move data throughout their organization. The effort to document data flows to create an initial data lineage can be intimidating and massive. Companies, if they are even willing to undertake the task, often spend months piecing together applications, databases, interfaces, and reports that use their data. Eventually, they create a documented flow of their data, which gives them their data lineage. Often, the VERY NEXT DAY, that data lineage is out-of-date. So, the cycle of updating the data lineage documentation begins. And it never ends. In any organization, this is normal and is often a sign of good health since change can be a sign of adaptation and competitiveness.
Companies often ask themselves how they can best keep their data lineages accurate and ever-green, without the high overhead and manual effort associated with doing so. It is the right question for them to ask. As data-as-an-asset becomes more prevalent in company thinking, data lineage and understanding becomes even more critical. Data and a company’s understanding of what their data is, how it flows, and what it means, has become linked to a company’s survival. Companies that do not manage their data better are at a higher risk of being overcome by competitors that do.
The answer, of course, is to eliminate the manual up-keep of data lineage documentation and to put computers in charge of seeing, understanding, and documenting data lineage. Just like anything else where little judgement is needed to perform repetitive tasks, computers can do it better, faster, and cheaper. Enter Automated Data Lineage.
Lineage generation is a real and attainable goal, using technologies of today. There are different flavors of data lineage and ways that technologies have solved for data lineage. Meta-data-driven data lineage and data-level lineages are common. Many recent, advanced data lineage applications crawl across an organization’s infrastructure to piece together meta data, often by hitting databases or data stores where that meta data lives. The problems with that approach are threefold: 1. They rarely capture all interconnections between systems, 2. They don’t solve the mess of interfaces and integration points that exist within a company, and 3. They, too, are out of date shortly after the automation was run.
The most effective way to a true automated data lineage is to tie data lineage to data integration. Data lineage, by definition, is the ownership and flow of data across an organization. Sourcing data lineage primarily from databases and data stores does not give data lineage, it gives a picture of data-at-rest and a starting point to understand what data lineage MIGHT look like. The moment someone develops a new interface, user-developed system in MS Excel or Access, or report, data lineage is no longer accurate, even within these “automated” tools.
Data lineage, done correctly, should be created and tracked through data integration tools, and there is no excuse for true data integration tools to not provide integrated, automation capabilities in their products. These tools are the brains of data flows throughout companies, and they are in the best position to see and understand a company’s data lineage. Other tools are just outsiders trying to peer into the data window to see and understand what is going on inside. But they are not on the inside. The data integration tools are on the inside, they have intimate knowledge of how data is moving and how and where it is being used. That is where automated data lineage is best handled.
Automated data lineage is an essential component of any modern data strategy. It eliminates manual documentation of lineage, helps reduce business risk, and is ever-green and current when you need it. If you don’t have your data lineage automated, get to it!