/Need Data Fast or Need the Truth?: Google v Wikipedia

Need Data Fast or Need the Truth?: Google v Wikipedia

You may have heard people say that data is the new currency, but they’re wrong.  Information has always existed and it has always been valuable, currency is relatively new. So really currency is the new data.

What has changed is that data is created, captured, and retained as our current society operates—seemingly every action creates data.  But what to do with this data?  How do we derive information and value vs just get deafened by data noise?  

One great example of how to make use of this sea of data is Google.  Google found a way to take a very important subset of data, that which is on the World Wide Web, and made it searchable.  And most importantly, Google allowed any person to search and find a set of sites that might be able to answer their questions.  Over time, based on people’s interactions search capabilities have dramatically improved, as well as exponential increases in the volume of data accessible. In Figure 1 you can see the diversity of products and links for “semantic graphs”.

Figure 1: Google Search

In the context of your organization, this translates to enabling users to find and query as much of the existing data as possible. This is where capabilities like data lakes (physical or virtual), data warehouses, and business intelligence come in.  Coupled with data catalogs and data glossaries, users can query, join, and explore data freely.  Establishing this foundational capability is work that is well worth doing and is a very important step.

Another difficult problem in that data itself is often only an approximation of the “ground truth.” In other words, sometimes data doesn’t capture everything, or the data has errors and omissions that, if not fixed, inhibit understanding.  In the world of the world-wide-web data, a great example of overcoming this challenge is Wikipedia. Wikipedia has created a framework, process, and tools for people to curate data on subjects which turns web information into a more complete understanding of the topic.  And a set of volunteers oversees the process using this framework.  Like any data management process, it is neither perfect nor ever “done,” but it is a great improvement over the alternative of sifting through reams of raw data. In Figure 2 you can see the depth of data for “semantic graph”. Importantly it also includes the references used allowing the user to traverse the full lineage of data.

Figure 2: Wikipedia

The parallel situation is applying data governance and data curation programs to data management processes.  Important capabilities like data glossaries, data cleansing, and data governance become crucial. But the core objective to deliver value is ensuring that people understand “what is good data” and working to actively improve that data.  Some data will be crucial or valuable enough that it will be worthwhile to add manual curation processes to make it closer to truth; other data will be sufficiently improvable through a framework of automation.  An important feature of this is all the data is still searchable.  Some data consumer use cases require the unimproved (unchanged) data, and some data consumers will want to wait for the improvements. On the web, the analog is that sometimes you have a question where Google will give you a good enough answer, and sometimes you really need Wikipedia to locate curated content. 

This article’s title is misleading there simply is no Google “v” Wikipedia, the best is using the two together. This is why, as in the example in Figure 1, Wikipedia links are prominently presented via search engines. Unfortunately organizations often lose sight of this and somehow think of data management and data access projects as separate or worse in conflict with each other.  Having the ability to distribute data easily and making data closer to truth are important and incredibly powerful when done together. Fortunately in today’s data architectures it is possible to either use managed data as part of the data lake or, even more powerful, integrating data management directly into the layers of a data lake. Given the flexibility of modern lakes with tools like virtualization it is even possible to leverage the existing data management tools. Fortunately today there are really no technical blockers to improving data while making it available.