As the amount of data produced increases and the technologies required to process it grow, organisations are looking to advanced data architectures to meet new needs. In this context, the Medallion architecture emerges, a novel perspective that fits perfectly with the data lakehouse approach and promises to promote data quality.
The amount of data continues to grow every year. According to the latest statistics from Forbes (2023), experts anticipate that the total volume of data worldwide will increase from 64.2 to 181 zettabytes in five years (2020-2025).
The exponential increase in the amount of data generated is putting the focus on disciplines such as data governance and data quality. The more data we have, the more complicated it becomes to manage and exploit. On the other hand, the transformation of data into business insights no longer depends on the quantity of data, but on its quality. In a context of over-information, it is understandable that data quality policies become more relevant.
Companies are trying to solve this puzzle with flexible data architectures that allow them to adopt new technologies and approaches to data management as needs arise, which is essential to keep up with a changing environment. On the other hand, flexibility makes it possible to adapt more quickly to market transformations and new customer demands.
Recently, and in line with this, a new approach, the Medallion architecture, is becoming popular, which not only fits in with flexible data architectures, but also promotes guarantees in terms of ensuring optimal quality of the data processed.
Before going on to explain what a Medallion data architecture is and how it works, it is important to introduce other concepts: data lakehouse and data mesh
What is a data mesh?
Data Mesh is an approach that brings flexibility to data management. It is therefore a flexible data architecture.
The main premise of the data mesh approach is to treat data as products, assigning responsibilities to specific teams for particular data domains. This decentralises ownership and ensures that teams have a better understanding of the data they produce. Data is delivered through data products and managed through centralised platforms.
This approach promotes collaboration, data quality and ease of access in complex business environments.
What is a Data Lakehouse?
A Data Lakehouse is a data architecture that combines the flexibility of a Data Lake (for storing raw, unstructured data) with the analytical capabilities of a Data Warehouse (for structured analytics). It enables a variety of data to be stored, processed and analysed in one place, facilitating advanced analytics and providing valuable insights for organisations, all with robust security and governance measures.
In short, it is the combination of a data lake and a data warehouse.
What is Medallion architecture?
In the world of data management, the Medallion architecture, also known as multi-hop architecture, is an approach to data model design that encourages the logical organisation of data within a data lakehouse.
The Medallion architecture structures data in a multi-tier approach —bronze, silver and gold tier— taking into account and encouraging data quality as it moves through the transformation process (from raw data to valuable business insights).
This approach was proposed by Databricks, an authority in the field of data management, which advocates Data as a Product (DaaP) and multi-layered approaches to build a single source of truth in an organisation.
This Medallion architecture ensures data integrity by passing through several stages of validations and transformations that ensure data atomicity, consistency and durability. Once the data has passed through these validations and transformations, it is stored in an optimal layout for effective analysis, ready to be used for strategic decision making.
How is Medallion architecture structured?
Layered Medallion Architecture: Bronze, Silver and Gold
As explained above, the most distinctive feature of the Medallion architecture is that it structures the data in layers: the bronze layer, the silver layer and the gold layer.
Bronze layer: This phase marks the input of raw data, which is stored as it is collected, usually from a variety of sources and in formats such as CSV or JSON. The data is usually raw data and varies in quality and structure.
Silver layer: At this point, the data is processed and transformed to achieve cleaner, more structured data. Tasks such as filtering, validation and normalisation of the data are carried out and stored in efficient formats. This phase may include defined schemas and additional metadata.
Gold layer: This stage contains data already prepared for analysis and business use. In the Gold layer, advanced transformations and aggregations are performed to create rich data sets. The data is structured, optimised for fast queries and can be enriched with additional information or merged with other data sources for deeper insights.
In short, in a Medallion architecture, the quality and structure of data improves as it passes through each layer. The bronze layer contains raw data, the silver layer contains cleansed and enriched data, and the gold layer contains data that is aggregated and ready to be analysed and integrated into business applications.
This modular architecture facilitates large-scale data management and allows for agile adaptation to changing needs.
Medallion Architecture, Data Lakehouse and ELT
In the context of a Medallion architecture with a data lakehouse approach, it is common to use the ELT methodology instead of ETL. This involves performing minimal transformations and applying data cleansing rules during the loading of data into the Silver layer, prioritising speed and agility in the ingestion and delivery of data into the data lake. Complex transformations and specific business rules are applied once the data moves from the Silver layer to the Gold layer.
This allows for greater flexibility to tailor the data to the specific needs of each project and business, making it easier to implement complex business rules and transformations later in the process.
In conclusion, the Medallion architecture presents itself as an innovative solution to meet the needs of organisations in handling large volumes of data. By combining the benefits of the data lakehouse approach with the multi-tier structure of bronze, silver and gold, it promotes data quality and facilitates its transformation into valuable business insights. This architecture enables flexible data management, adapting to changing market demands and providing a single source of truth in an organisation. If you would like to learn more about the Medallion architecture and how it can benefit your business, we invite you to explore this topic further and implement this innovative approach in your data management strategy.