What is a data warehouse and how does it work, how does it differ from a data lake and what is its architecture like? | Data Warehouse
The concepts of data warehouse and data lake are often confused. However, these two types of data storage are much more different than they may seem at first glance. In fact, the only thing they have in common is that they contain large amounts of data.
What is a data warehouse?
A data warehouse is a data storage infrastructure used to store, organize, and analyze large volumes of structured data —sometimes also semi-structured and unstructured data—. This data can come from a variety of sources within an organisation, such as operational databases, customer relationship management (CRM) systems, enterprise resource management (ERP) systems and other sources.
The main objective of a data warehouse is to provide a centralised environment for business decision making by enabling organisations to consolidate data from different sources and analyse it for meaningful insights and business opportunities. To achieve this, data is transformed and stored in a format optimised for analysis, which facilitates querying and reporting.
The concept of a data warehouse is closely linked to data-driven decision making and the exploitation of corporate information at its source. This type of data warehouse is specifically designed to meet the needs of business intelligence and data analytics, and is often referred to as an Enterprise Data Warehouse (EDW) for this reason.
To be more specific, a data warehouse can be defined as a data storage and integration architecture that facilitates the organisation, transformation, understanding and management of data. In addition, it allows to leverage this data to make more informed business decisions. The creation and development of this architecture, along with the associated operations, is known as data warehousing. This term refers to the process of collecting, integrating and organising data in a data warehouse.
Unlike other databases, the primary goal of a data warehouse is to transform raw data into business insights and to facilitate access to this data by business users quickly and efficiently.
A data warehouse is typically complemented with tools and technologies for extracting, transforming and loading (ETL) data, as well as platforms for data analysis and visualisation. It uses techniques such as indexing, partitioning and query optimisation to ensure that analytical queries are fast and efficient, even when working with large volumes of data.
In short, a data warehouse is a fundamental component of business intelligence, helping organisations to make informed decisions based on accurate and relevant data analysis.
- If your company does not yet have a well-thought-out data strategy, download our free e-book to learn the essential steps and requirements to consolidate a data strategy that allows you to leverage the business value of data.
How does a data warehouse work?
A data warehouse works through a systematic process involving data extraction, transformation, loading (ETL) and analysis.
A data warehouse is usually the central data store of an organisation. After data is extracted from its original sources and integrated into the data warehouse, it is processed, transformed and organised into views and tables of dimensions or facts. The method commonly used for this purpose is the ETL (Extract, Transform and Load) or, more recently, ELT (Extract, Load, and Transform) process.
Once the data has been transformed and organised, users can access it through SQL, business intelligence tools such as Power BI, customer management platforms such as a CRM, among other options.
Below, we detail how a data warehouse works:
Extract: Data is extracted from various sources within an organisation, such as operational databases, CRM, ERP and other information systems. This data can be of different types, such as transactional data, customer data, sales data, etc.
Transformation: Once the data has been extracted, it goes through a transformation process. During this stage, the data is cleaned and transformed into a standard format. This involves correcting errors, removing duplicates and converting the data to a consistent format. In addition, certain business rules and calculations can be applied to create aggregated and derived data that is useful for analysis.
Load: The transformed data is loaded into the data warehouse, where it is stored in structures optimised for analysis. The data is often organised into tables and columns to facilitate queries. This process may involve creating indexes to improve query speed and partitioning data for more efficient management.
Analysis: Once data is in the data warehouse, users can perform complex analysis and generate reports. This is done using analysis and visualisation tools that access the data in the data warehouse and allow users to ask complex questions and discover patterns and trends in the data.
Presentation: The results of the analysis are presented to users in the form of reports, dashboards and interactive visualisations. These reports help organisations make informed decisions based on the analysed data.
It is important to note that the data warehouse process is continuous and iterative. Data is updated regularly to ensure that the information analysed is always up-to-date and relevant to business decision-making.
On the other hand, recently, some corporations are approaching the data extraction, transformation and loading (ETL) process from another perspective, ELT, to gain speed and flexibility.
What are the differences between a data warehouse and a data lake?
Data lake and data warehouse are widely used for big data storage, but, although both are data storehouses, these are not interchangeable terms. A data lake is a large set of raw data, which does not yet have a defined purpose. Instead, a data warehouse is a repository of data that is already structured and filtered and has been processed for a specific purpose.
It is important to make the distinction, as data lakes and data warehousing serve different purposes and therefore require a different approach to be properly optimised.
It is important to make the distinction, as data lake and data warehosue serve different purposes and they require a different approach to be properly optimized.
Both tools are fundamental parts of a data integration process and are often used in ETL processes. Data integration is the foundation of any data strategy. If data is not properly integrated, transforming it into business value will be highly complex.
Some of the main differences between a data warehouse and a data lake are the structure of data, the processing methods, the area they are used at, and what the purpose of the data is.
Thus, a data lake stores raw data that does not yet have a specific purpose. Its end users are data scientists and it has high accessibility. Moreover, in a data lake, because of its easy accessibility, the data can be updated quickly.
On the other hand, a data warehouse has processed data that is already in use and therefore has a specific purpose. The end users of a DWH are usually entrepreneurs and business people and it is somewhat more complicated to make changes in the data it contains.
Data warehouse vs data lake: pros and cons
The main difference between the concepts is surely the variable structure of the raw data versus the processed data. Since data lakes usually store raw data, their storage capacity must be higher than that of data warehouses. Having this raw data has many benefits, such as being able to analyze it quickly and for any purpose. However, without proper data quality and data governance measures, data lakes can become an untreatable data container from which little value can be extracted.
The benefits of a data warehouse are also interesting: since they only store processed data, they save a lot of storage space, which translates into money savings. Furthermore, because the data is processed, it is much more comprehensible and accessible to a less technical public.
Beyond their storage purpose, these two concepts are quite different. Data lakes, because of their unstructured content, can be complex to navigate and require a data scientist, while data warehouses are more suitable for an enterprise and to be handled by less technical users. Due to all these differences, each company must evaluate with experts which of the types is more convenient for them in relation to what they are going to use them for.
When to use a data warehouse instead of a data lake?
The decision to use a data warehouse or a data lake depends on the specific requirements and use cases of the organization. Both data warehouses and data lakes have distinct strengths and weaknesses, so understanding the characteristics of each and considering the organization's data management needs is essential. Here are some scenarios when you might choose a data warehouse over a data lake:
Structured Data and Business Intelligence: If your organization deals primarily with structured data (e.g., transactional data, sales records, financial data) and requires business intelligence and reporting capabilities, a data warehouse is a suitable choice. Data warehouses are optimized for handling structured data and are designed to support complex analytical queries and generate accurate, consistent reports.
Historical Data Analysis: Data warehouses are well-suited for storing historical data and maintaining a record of business transactions over time. They provide a time-variant view of data, allowing historical trend analysis and performance tracking.
Well-Defined and Stable Schema: If your data has a well-defined and stable schema, meaning the structure of data doesn't change frequently, a data warehouse is advantageous. Data warehouses rely on predefined schemas to organize data efficiently, making them less flexible to accommodate schema changes on the fly.
Aggregated and Summarized Data: If your reporting and analytical requirements involve aggregated and summarized data (e.g., sales totals, quarterly revenues, yearly averages), a data warehouse can efficiently store and manage pre-aggregated data for faster query performance.
Integration with Traditional BI Tools: Data warehouses are compatible with traditional Business Intelligence (BI) tools and reporting platforms. If your organization already uses popular BI tools like Tableau, Power BI, or Qlik, integrating a data warehouse into your existing infrastructure is relatively straightforward.
On the other hand, a data lake might be a better choice in the following scenarios:
Variety of Data Types: If your organization deals with diverse and unstructured data types, such as text, images, videos, log files, sensor data, etc., a data lake can accommodate the raw, unprocessed data without the need for a predefined schema.
Exploratory Analytics and Data Science: Data lakes are suitable for data exploration and advanced analytics, as they allow data scientists and analysts to access raw data and apply different data models, algorithms, and machine learning techniques.
Flexibility and Agility: If your data requirements change frequently, and you need to accommodate evolving data sources and formats, a data lake's schema-on-read approach offers more flexibility and agility compared to the rigid schema of a data warehouse.
Big Data Processing: Data lakes can handle massive volumes of data, making them ideal for big data processing and storage.
Ultimately, the best approach may involve a combination of both data warehouse and data lake solutions, known as a "hybrid data architecture." In this setup, the data warehouse serves as a structured repository for curated and refined data, while the data lake acts as a raw data repository for exploratory analysis and data processing before moving relevant data into the data warehouse for business reporting and intelligence purposes.
The architecture of a Data Warehouse
As we have already mentioned, the great differentiating aspect of a data warehouse is its architecture, which is structured in different layers that interact with each other and, in turn, with the data.
The classic architecture of a data warehouse is made up of 3 layers:
Bronze: In the Bronze layer, also known as the Staging layer, data is extracted from its original sources, usually by means of SQL scripts.
Silver: In the Silver layer, also called Core, data from various original sources is integrated into the data warehouse. After storage, the data is transformed and modelled, often using star or snowflake schemas. It is then transferred to an online analytical processing (OLAP) server. Once the transfer is complete, the data is transformed again and loaded into the data warehouse, where it is available for further analysis and decision support.
As mentioned above, these two initial layers are typically implemented through an ETL or ELT process.
Gold: At the Gold layer, data is refined so that it is accessible to users. This involves organising it so that it is ready to be used and exported in various business intelligence, reporting and data visualisation platforms, such as Power BI and other front-end interfaces.
In summary, both the data warehouse and the data lake are fundamental tools in the field of data warehousing, but they differ in their structure, purpose and focus. While the data warehouse focuses on processed and structured data for business decision making, the data lake stores raw data without a defined purpose. Each of these tools has its advantages and disadvantages, so it is important to choose the option that best suits the specific needs of each organisation. While the data warehouse is ideal for working with structured data and performing historical analysis, the data lake offers greater flexibility and accessibility for data scientists. Ultimately, it is essential to have a well thought-out data strategy and experts who can guide in the choice and optimisation of data warehousing.
Don't miss our e-book with the keys to designing and building a business data strategy.