Skip to main content

πŸ—“οΈ 07032025 1618
πŸ“Ž

medallion_architecture

A data design pattern used to logically organize data in a data_lakehouse

Medallion architectures are sometimes also referred to as "multi-hop" architectures

Goal​

Incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze β‡’ Silver β‡’ Gold layer tables)

Bronze layer (raw data)​

Purpose: Quick CDC + ability to provide historical archive of source / data lineage / auditability / reprocessing if needed

  • Where data from external source systems lands
  • The table structures in this layer correspond to the source system table structures "as-is"
  • Along with any additional metadata columns that capture the load date/time, process ID, etc

Silver layer (cleansed and conformed data)​

Data from bronze is conformed and cleansed "just-enojugh"

  • Provides an "Enterprise view" of all its key business entities, concepts and transactions. (e.g. master customers, stores, non-duplicated transactions and cross-reference tables)

  • The Silver layer brings the data from different sources into an Enterprise view and enables self-service analytics for ad-hoc reporting, advanced analytics and ML. It serves as a source for Departmental Analysts, Data Engineers and Data Scientists to further create projects and analysis to answer business problems via enterprise and departmental data projects in the Gold Layer.

In the lakehouse data engineering paradigm, typically the ELT methodology is followed vs. ETL - which means only minimal or "just-enough" transformations and data cleansing rules are applied while loading the Silver layer. Speed and agility to ingest and deliver the data in the data lake is prioritized, and a lot of project-specific complex transformations and business rules are applied while loading the data from the Silver to Gold layer. From a data modeling perspective, the Silver Layer has more 3rd-Normal Form like data models. Data Vault-like, write-performant data models can be used in this layer.

Gold layer (curated business-level tables)​

Data in theΒ Gold layerΒ of the lakehouse is typically organized in consumption-ready "project-specific" databases. The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. The final layer of data transformations and data quality rules are applied here. Final presentation layer of projects such as Customer Analytics, Product Quality Analytics, Inventory Analytics, Customer Segmentation, Product Recommendations, Marking/Sales Analytics etc. fit in this layer. We see a lot of Kimball style star schema-based data models or Inmon style Data marts fit in this Gold Layer of the lakehouse.

So you can see that the data is curated as it moves through the different layers of a lakehouse. In some cases, we also see that lot of Data Marts and EDWs from the traditional RDBMS technology stack are ingested into the lakehouse, so that for the first time Enterprises can do "pan-EDW" advanced analytics and ML - which was just not possible or too cost prohibitive to do on a traditional stack. (e.g. IoT/Manufacturing data is tied with Sales and Marketing data for defect analysis or health care genomics, EMR/HL7 clinical data markets are tied with financial claims data to create a Healthcare Data Lake for timely and improved patient care analytics.)

Benefits of a lakehouse architecture​

  • Simple data model
  • Easy to understand and implement
  • Enables incremental ETL
  • Can recreate your tables from raw data at any time
  • ACID transactions, time travel

References