My IBM

What is a data lakehouse?

Authors

Editorial Content Lead

IBM Content Contributor

What is a data lakehouse?

A data lakehouse is a data platform that combines the flexible data storage of data lakes with the high-performance analytics capabilities of data warehouses.

Data lakes and data warehouses are typically used in tandem. Data lakes act as a catch-all system for new data, and data warehouses apply downstream structure to the data.

However, coordinating these systems to provide reliable data can be costly in both time and resources. Long processing times contribute to data staleness and additional layers of ETL (extract, transform, load) introduce data quality risks.

Data lakehouses compensate for the flaws within data warehouses and data lakes with capabilities that form a better data management system. They pair the data structures from data warehouses with the low-cost storage and flexibility of data lakes.

Data lakehouses empower data teams to unify their disparate data systems, accelerating data processing for more advanced analytics (such as machine learning (ML)), efficiently accessing big data, and improving data quality.

The emergence of data lakehouses

Data lakehouses exist to resolve the challenges of data warehouses and data lakes and to bring their benefits under one data architecture.

For instance, data warehouses are more performant than data lakes, both storing and transforming enterprise data. However, data warehousing requires strict schemas (typically the star schema and the snowflake schema).

Therefore, data warehouses don’t work well with unstructured or semi-structured data, which are critical for artificial intelligence (AI) and ML use cases. They are also limited in their ability to scale.

Data lakes, on the other hand, allow organizations to aggregate all data types—structured data, unstructured data and semi-structured data—from diverse data sources and in one location. They enable more scalable and affordable data storage, but do not have built-in data processing tools.

Data lakehouses merge aspects of data warehouses and data lakes. They use cloud object storage to store data in any format at a low cost. And, on top of that cloud storage sits a warehouse-style analytics infrastructure, which supports high-performance queries, near real-time analytics and business intelligence (BI) efforts.

The latest AI News + Insights  

Discover expertly curated insights and news on AI, cloud and more in the weekly Think Newsletter.

Subscribe today

Data warehouse vs. data lake vs. data lakehouse

Data warehouses, data lakes and data lakehouses are all data repositories, but with key differences. They are often used together to support an integrated data architecture for a variety of use cases.

Data warehouse

A data warehouse gathers raw data from multiple sources into a central repository and organizes it into a relational database infrastructure. This data management system primarily supports data analytics and business intelligence applications, such as enterprise reporting.

The system uses ETL processes to extract, transform, and load data to its destination. However, it is limited by its inefficiency and cost, particularly as the number of data sources and quantity of data grow.

While data warehouses were traditionally hosted on-premises on mainframes, today many data warehouses are hosted in the cloud and delivered as cloud services.

Data lake

Data lakes were initially built on big data platforms such as Apache Hadoop. But the core of modern data lakes is a cloud object storage service, which allows them to store all types of data. Common services include Amazon Simple Storage Service (Amazon S3), Microsoft Azure Blob Storage, Google Cloud Storage and IBM Cloud Object Storage.

Since enterprises largely generate unstructured data, this storage capability is an important distinction. It enables more data science and artificial intelligence (AI) projects—which in turn drive more novel insights and better decision-making across the organization.

However, the size and complexity of data lakes can require the expertise of more technical users, such as data scientists and data engineers. And, because data governance occurs downstream in these systems, data lakes can be prone to data silos, and subsequently evolve into data swamps (where good data is inaccessible due to poor management).

Data lakehouse

Data lakehouses can resolve the core challenges across both data warehouses and data lakes to yield a more ideal data management solution for organizations. They leverage cloud object storage for fast, low-cost storage across a broad range of data types, while also delivering high-performance analytics capabilities. Organizations can use data lakehouses alongside their existing data lakes and data warehouses without a full teardown and rebuild.

Benefits of a data lakehouse

Data lakehouses yield several key benefits to users, they can help:

Reduce data redundancy
Lower costs
Support a variety of workloads
Improve data governance
Enhance scalability
Enable real-time streaming

Reduce data redundancy

A single data storage system creates a streamlined platform to meet all business data demands, reducing data duplication. Data lakehouses also simplify end-to-end data observability by reducing the amount of data moving through data pipelines into various systems.

Lower costs

Data lakehouses capitalize on the lower costs of cloud object storage, so they are more cost-effective than data warehouses. Additionally, the hybrid architecture of a data lakehouse eliminates the need to maintain multiple data storage systems, making it less expensive to operate.

Support a variety of workloads

Data lakehouses can address different use cases across the data management lifecycle. They also support both business intelligence and data-driven visualization workflows or more complex data science ones.

Improve data governance

The data lakehouse architecture mitigates the governance issues of data lakes. For example, as data is ingested and uploaded, the lakehouse can ensure it meets the defined schema requirements, reducing downstream data quality issues.

Enhance scalability

In traditional data warehouses, compute and storage are coupled. Data lakehouses separate storage and compute, allowing data teams to access the same data storage while use different computing nodes for different applications. This decoupling results in more scalability and flexibility.

Enable real-time streaming

The data lakehouse is built for today’s businesses and technology. Many data sources contain real-time streaming data from devices, such as Internet of Things devices. The lakehouse system supports these sources through real-time data ingestion.

AI Academy

Is data management the secret to generative AI?

Explore why high-quality data is essential for the successful use of generative AI.

Go to episode

What is Delta Lake?

Developed by Databricks in 2016, Delta Lake is an open source data storage format that combines Apache Parquet data files with a robust metadata log. This format adds key data management functions to data lakes, such as schema enforcement, time travel and ACID transactions. (ACID stands for “atomicity, consistency, isolation and durability,” which are key properties that define a transaction to ensure data integrity.)

These functions help make data lakes more reliable and intuitive. They also allow users to run structured query language (SQL) queries, analytics workloads and other activities on a data lake, streamlining business intelligence, data intelligence (DI), AI and ML.

Delta Lake was open sourced in 2019. Since then, data lakehouses are typically created by building a Delta Lake storage layer on top of a data lake, then integrating it with a data processing engine such as Apache Spark or Hive.

Open source-enabled data lakehouses are often referred to as open data lakehouses. Other open table formats include Apache Iceberg (a high-performance format for massive analytic tables) and Apache Hudi (designed for incremental data processing).

Learn more about Delta Lake

Man sitting in the front of a laptop working from home

The hybrid, open data lakehouse for AI

Simplify data access and automate data governance. Discover the power of integrating a data lakehouse strategy into your data architecture, including cost-optimizing your workloads and scaling AI and analytics, with all your data, anywhere.

Read the ebook

Layers of the data lakehouse architecture

The architecture of a data lakehouse typically consists of five layers:

Ingestion layer
Storage layer
Metadata layer
API layer
Consumption layer

Ingestion layer

This first layer gathers data from a range of sources and transforms it into a data format that a lakehouse can store and analyze. The ingestion layer can use protocols to connect with internal and external sources such as database management systems, NoSQL databases and social media.

Storage layer

In this layer, structured, unstructured and semi-structured datasets are stored in open-source file formats, such as Parquet or Optimized Row Columnar (ORC). This layer provides a major benefit of the data lakehouse—its ability to accept all data types at an affordable cost.

Metadata layer

The metadata layer is a unified catalog that delivers metadata for every object in the lake storage, helping organize and provide information about data in the system. This layer also offers ACID transactions, file caching and indexing for faster queries. Users can implement predefined schemas here, which enable data governance and auditing capabilities.

API layer

A data lakehouse uses application programming interfaces (APIs) to increase task processing and conduct more advanced analytics. Specifically, this layer gives consumers and/or developers the opportunity to use a range of languages and libraries, such as TensorFlow, on an abstract level. The APIs are optimized for data asset consumption.

Data consumption layer

The final layer of data lakehouse architecture hosts apps and tools, with access to all metadata and data stored in the lake. This opens data access to users across an organization, who can use the lakehouse to perform tasks such as business intelligence dashboards, data visualization and machine learning jobs.

Explore IBM data lakehouse patterns for hybrid cloud

The data lakehouse for generative AI

See how watsonx.data gives you the power to access and unify data across disparate data sources, store vectorized embeddings for RAG and more.

Data management for AI and analytics

Explore the value of data architectures and learn how IBM’s database portfolio can help simplify data for all your applications, analytics and AI workflows.

Resources

Managing data for AI and analytics at scale

Learn how an open data lakehouse approach can provide trustworthy data and faster analytics and AI projects execution.

2024 Gartner® Magic Quadrant™ for Data Integration Tools

IBM named a Leader for the 19th year in a row in the 2024 Gartner® Magic Quadrant™ for Data Integration Tools.

The data differentiator

Explore the data leader’s guide to building a data-driven organization and driving business advantage.

Increase AI adoption with AI-ready data

Discover why AI-powered data intelligence and data integration are critical to drive structured and unstructured data preparedness and accelerate AI outcomes.

The hybrid, open data lakehouse for AI

IBM Research® data management publications

Explore how IBM Research is regularly integrated into new features for IBM Cloud Pak® for Data.

Gartner® predicts 2024: How AI will impact analytics users

Gain unique insights into the evolving landscape of ABI solutions, highlighting key findings, assumptions and recommendations for data and analytics leaders.

What is a data lakehouse?

Authors

What is a data lakehouse?

A data lakehouse is a data platform that combines the flexible data storage of data lakes with the high-performance analytics capabilities of data warehouses.

The emergence of data lakehouses

The latest AI News + Insights

Data warehouse vs. data lake vs. data lakehouse

Data warehouse

Data lake

Data lakehouse

Benefits of a data lakehouse

Reduce data redundancy

Lower costs

Support a variety of workloads

Improve data governance

Enhance scalability

Enable real-time streaming

Is data management the secret to generative AI?

What is Delta Lake?

The hybrid, open data lakehouse for AI

Layers of the data lakehouse architecture

Ingestion layer

Storage layer

Metadata layer

API layer

Data consumption layer

Resources

Related solutions

The latest AI News + Insights