Data Terminology Context
I write this because I have seen too many conversations between business and technical stakeholders confused by the every evolving terminology around data. This unnecessarily leads to annoyance, eye rolling and a mistrust in data. One caveat here is people often talk about these being “industry standard” terms but every enterprise team I have worked with has slightly different boundaries for the items described. This is simply because every organization has different needs and strengths emphasizing different capabilities. So what I am capturing here is a description to capture the intent and help both technical teams and business stakeholders have a more common vocabulary.
There is so much terminology around data that a novel or two could be written, here I’ll focus on a few key terms used to describe data being stored. For example:
- Data Warehouse
- As organizations set up business applications, they ended up with many data sources but no easy way to create reports across all of them.
- To enable cross application reporting IT departments built a “data warehouse” (i.e. database) where enterprise data is modelled and stored to support reporting
- Data Marts
- As Warehouses grew having a single model for all data ended up not being sufficient. So, warehouses were extended by adding databases to support specific domains (e.g. sales, customers, incidents, …), making analysis in these specific areas more effective.
- Data Lake
- As new types of data sources, including “unstructured data” (e.g. PDFs, social media, …) and data science advanced it became important to explore data prior to adding into Warehouses. To support this exploration ”Data Lakes” were created as a place to hold data from sources with rich metadata.
- To better manage data lakes have created zones for cleaning and deriving data which replaced the functionality of the staging are and data warehouses
- Data Fabric
- As data lakes became a source of not only business insights but operational business data fabrics have evolved to allow consistent use of data via API and data sets.
- A goal of data fabrics is to use utilize the same data, (i.e. create once use everywhere). This sometimes involves creating virtual lakes. But the functionality of the lake / warehouse / mart is still important.
Better enterprise wide reporting
Warehouses were created to allow reporting across data sources. The processes for ingesting and updating was typically built around specific cadence of updates for reporting, daily being very common. Staging areas were built to manage loading data into warehouse to deal incremental data loads and connectivity issues. Also, the staging areas would hold data in case data changes needed to be rolled back in the warehouse. Reporting was often done directly from warehouse. Figure 1: Data Warehouse v1 shows the typical pieces in the early data warehouses.
Figure 1: Data Warehouse v1
As the value of the warehouse data became clear that adding specialized versions of data would be very helpful, data marts. For example, having organized by sales, by regions, by products, etc.. This made deeper analytics and data science easier to accomplish. Data marts also became storage for low latency API requests for data (i.e. “real-time” consumption of data). Figure 1: Data Warehouse v2 – Marts shows where the data marts being added.
Figure 2: Data Warehouse v2 – Marts
Enter the Data Lakes
Supporting Advanced Analytics
As data science evolved data scientists found exploring of data in staging area provided value. Because of this the staging data started to be retained for after load driving longer time frames. Also, as new source types became useful it was important for users to be able to explore the data before even loading in warehouse, so data started to be “staged” even when not needed in warehouse or Marts. At this point “pre-staging area” called the “Data Lake” was created.
Finally, since data explored may not immediately be added to warehouse it is vital to actively collect metadata so future users of data can understand what’s in the lake. Figure 3: Data Lake v1 – Raw Data Retention shows a typical set of components for an early data lake implementation.
Figure 3: Data Lake v1 – Raw Data Retention
Data lakes have now evolved into a set of data zones of persisted data in specific states. This enables managing / curating data and reconciling it at any point in time.
Each zone is persisted, but the zones can be virtual (e.g. not necessarily a single data store).
All data is moved along the zones of the data lake via the data movement layer. During movement, a single action could persist data in multiple zones / areas so the work can be done in parallel and enabling real-time use cases.
All data within the data lake should be cataloged to make data more actionable by business users. At this point the “warehouse” and “data marts” seem to disappear!, but really the capability is captured in the zones of the data lake. Also, as lakes evolved master data and referenced data management were added to ensure connections between multiple sources could best be understood.
Figure 4: Data Lake v2 – Zones of Data shows a modern approach to the data lake.
Figure 4: Data Lake v2 – Zones of Data
Beyond Data Lakes
Empowering the data driven enterprise
As data lakes are stood up and create insights even greater value is being found by making that data more readily available both operationally as well as in analytics. This is where data fabrics come into place.
Data lakes are still an important part. Fabrics rely on strong governance and consistent data interfaces for both operational and analytics use of data. The goal of a fabric to create data once and use consistently everywhere. To this end even more importance exists on data mastering, catalogs, glossaries and governance tools. Often adding in more data virtualization becomes desirable to limit copies of data that can be out of sync.
Figure 5: From Data Lake to Data Fabric shows how the lake is becoming part of a more flexible data architecture.
Figure 5: From Data Lake to Data Fabric
What is “Mastered?
Master Data Management (MDM) is a core capability affecting data usage. At its core it is finding the people, product, organizations, etc. that are found in multiple sources. It then enables finding the overlaps and duplicates as data is brought together. It is an important piece of getting the most value out of data.
Table 1: Master Data Management Uses by Component lists how MDM is used by each of the components described earlier.
Table 1: Master Data Management Uses by Component
What’s the impact of all of this?
As busines processes went online warehouses enabled finding new insights but this took time to set up and was difficult to change. Moving to marts speed up many reporting insights.
Then a less structured data and cloud connections became important data lakes became import to provide the flexibility. Both to explore potential sources of data and allow a place to keep unstructured data while determining the best way to extract information. Also, with data lakes the importance of collecting metadata to create catalogs and data glossary became even more advantageous. However, cleaning and managing data still takes time.
As more systems became hungry for data a more proactive approach to governing data has become necessary this is leading to data fabrics. Allowing data to be used while it is in the process of being integrated and making the contents of mastered data lakes more readily available. However, this flexibility makes it critical business users are directly participating in curation / management to ensure data is fit for purpose.
I do hope this short history and definition of terms helps. Please feel free to reach out if you have questions or have other subjects around data you think would help others.
Thank you and all the best,
your friendly Data Pundit