My IBM Log in Subscribe

What is a data lakehouse?

15 May 2025

Authors

Alexandra Jonker

Editorial Content Lead

Alice Gomstyn

IBM Content Contributor

What is a data lakehouse?

A data lakehouse is a data platform that combines the flexible data storage of data lakes with the high-performance analytics capabilities of data warehouses.
 

Data lakes and data warehouses are typically used in tandem. Data lakes act as a catch-all system for new data, and data warehouses apply downstream structure to the data.

However, coordinating these systems to provide reliable data can be costly in both time and resources. Long processing times contribute to data staleness and additional layers of ETL (extract, transform, load) introduce data quality risks. 

Data lakehouses compensate for the flaws within data warehouses and data lakes with capabilities that form a better data management system. They pair the data structures from data warehouses with the low-cost storage and flexibility of data lakes.

Data lakehouses empower data teams to unify their disparate data systems, accelerating data processing for more advanced analytics (such as machine learning (ML)), efficiently accessing big data, and improving data quality. 

The emergence of data lakehouses

Data lakehouses exist to resolve the challenges of data warehouses and data lakes and to bring their benefits under one data architecture.

For instance, data warehouses are more performant than data lakes, both storing and transforming enterprise data. However, data warehousing requires strict schemas (typically the star schema and the snowflake schema). 

Therefore, data warehouses don’t work well with unstructured or semi-structured data, which are critical for artificial intelligence (AI) and ML use cases. They are also limited in their ability to scale.

Data lakes, on the other hand, allow organizations to aggregate all data types—structured data, unstructured data and semi-structured data—from diverse data sources and in one location. They enable more scalable and affordable data storage, but do not have built-in data processing tools.

Data lakehouses merge aspects of data warehouses and data lakes. They use cloud object storage to store data in any format at a low cost. And, on top of that cloud storage sits a warehouse-style analytics infrastructure, which supports high-performance queries, near real-time analytics and business intelligence (BI) efforts.

3D design of balls rolling on a track

The latest AI News + Insights 


Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter. 

Data warehouse vs. data lake vs. data lakehouse

Data warehouses, data lakes and data lakehouses are all data repositories, but with key differences. They are often used together to support an integrated data architecture for a variety of use cases.

Data warehouse

A data warehouse gathers raw data from multiple sources into a central repository and organizes it into a relational database infrastructure. This data management system primarily supports data analytics and business intelligence applications, such as enterprise reporting.

The system uses ETL processes to extract, transform, and load data to its destination. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow.

While data warehouses were traditionally hosted on-premises on mainframes, today many data warehouses are hosted in the cloud and delivered as cloud services.

Data lake

Data lakes were initially built on big data platforms such as Apache Hadoop. But the core of modern data lakes is a cloud object storage service, which allows them to store all types of data. Common services include Amazon Simple Storage Service (Amazon S3), Microsoft Azure Blob Storage, Google Cloud Storage and IBM Cloud Object Storage. 

Since enterprises largely generate unstructured data, this storage capability is an important distinction. It enables more data science and artificial intelligence (AI) projects—which in turn drive more novel insights and better decision-making across the organization. 

However, the size and complexity of data lakes can require the expertise of more technical users, such as data scientists and data engineers. And, because data governance occurs downstream in these systems, data lakes can be prone to data silos, and subsequently evolve into data swamps (where good data is inaccessible due to poor management).

Data lakehouse

Data lakehouses can resolve the core challenges across both data warehouses and data lakes to yield a more ideal data management solution for organizations. They leverage cloud object storage for fast, low-cost storage across a broad range of data types, while also delivering high-performance analytics capabilities. Organizations can use data lakehouses alongside their existing data lakes and data warehouses without a full teardown and rebuild.

Benefits of a data lakehouse

Data lakehouses yield several key benefits to users, they can help:

  • Reduce data redundancy
  • Lower costs
  • Support a variety of workloads
  • Improve data governance
  • Enhance scalability
  • Enable real-time streaming

Reduce data redundancy

A single data storage system creates a streamlined platform to meet all business data demands, reducing data duplication. Data lakehouses also simplify end-to-end data observability by reducing the amount of data moving through data pipelines into various systems.

Lower costs

Data lakehouses capitalize on the lower costs of cloud object storage, so they are more cost-effective than data warehouses. Additionally, the hybrid architecture of a data lakehouse eliminates the need to maintain multiple data storage systems, making it less expensive to operate.

Support a variety of workloads

Data lakehouses can address different use cases across the data management lifecycle. They also support both business intelligence and data-driven visualization workflows or more complex data science ones.

Improve data governance

The data lakehouse architecture mitigates the governance issues of data lakes. For example, as data is ingested and uploaded, the lakehouse can ensure it meets the defined schema requirements, reducing downstream data quality issues.

Enhance scalability

In traditional data warehouses, compute and storage are coupled. Data lakehouses separate storage and compute, allowing data teams to access the same data storage while use different computing nodes for different applications. This decoupling results in more scalability and flexibility.

Enable real-time streaming

The data lakehouse is built for today’s businesses and technology. Many data sources contain real-time streaming data from devices, such as Internet of Things devices. The lakehouse system supports these sources through real-time data ingestion.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

What is Delta Lake?

Developed by Databricks in 2016, Delta Lake is an open source data storage format that combines Apache Parquet data files with a robust metadata log. This format adds key data management functions to data lakes, such as schema enforcement, time travel and ACID transactions. (ACID stands for “atomicity, consistency, isolation and durability,” which are key properties that define a transaction to ensure data integrity.)

These functions help make data lakes more reliable and intuitive. They also allow users to run structured query language (SQL) queries, analytics workloads and other activities on a data lake, streamlining business intelligence, data intelligence (DI), AI and ML.

Delta Lake was open sourced in 2019. Since then, data lakehouses are typically created by building a Delta Lake storage layer on top of a data lake, then integrating it with a data processing engine such as Apache Spark or Hive

Open source-enabled data lakehouses are often referred to as open data lakehouses. Other open table formats include Apache Iceberg (a high-performance format for massive analytic tables) and Apache Hudi (designed for incremental data processing).

Man sitting in the front of a laptop working from home

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Layers of the data lakehouse architecture

The architecture of a data lakehouse typically consists of five layers:

  • Ingestion layer
  • Storage layer
  • Metadata layer
  • API layer
  • Consumption layer

Ingestion layer

This first layer gathers data from a range of sources and transforms it into a data format that a lakehouse can store and analyze. The ingestion layer can use protocols to connect with internal and external sources such as database management systems, NoSQL databases and social media. 

Storage layer

In this layer, structured, unstructured and semi-structured datasets are stored in open-source file formats, such as Parquet or Optimized Row Columnar (ORC). This layer provides a major benefit of the data lakehouse—its ability to accept all data types at an affordable cost.

Metadata layer

The metadata layer is a unified catalog that delivers metadata for every object in the lake storage, helping organize and provide information about data in the system. This layer also offers ACID transactions, file caching and indexing for faster queries. Users can implement predefined schemas here, which enable data governance and auditing capabilities.

API layer

A data lakehouse uses application programming interfaces (APIs) to increase task processing and conduct more advanced analytics. Specifically, this layer gives consumers and/or developers the opportunity to use a range of languages and libraries, such as TensorFlow, on an abstract level. The APIs are optimized for data asset consumption.

Data consumption layer

The final layer of data lakehouse architecture hosts apps and tools, with access to all metadata and data stored in the lake. This opens data access to users across an organization, who can use the lakehouse to perform tasks such as business intelligence dashboards, data visualization and machine learning jobs.

Related solutions

Related solutions

IBM watsonx.data

Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data
Data lake solutions

Resolve today's data challenges with a lakehouse architecture. Connect to data in minutes, quickly get trusted insights and reduce your data warehouse costs.

Explore IBM data lake solutions
Data and analytics consulting services

Unlock the value of enterprise data with IBM Consulting, building an insight-driven organization that delivers business advantage.

Discover analytics services
Take the next step

Unify all your data for AI and analytics with IBM watsonx.data. Put your data to work, wherever it resides, with the hybrid, open data lakehouse for AI and analytics.

Discover watsonx.data Explore data lake solutions