Delta lake

Author:

Armbrust Michael1,Das Tathagata1,Sun Liwen1,Yavuz Burak1,Zhu Shixiong1,Murthy Mukul1,Torres Joseph1,van Hovell Herman1,Ionescu Adrian1,Łuszczak Alicja1,Świtakowski Michał1,Szafrański Michał1,Li Xiao1,Ueshin Takuya1,Mokhtar Mostafa1,Boncz Peter2,Ghodsi Ali3,Paranjpye Sameer1,Senster Pieter1,Xin Reynold1,Zaharia Matei4

Affiliation:

1. Databricks

2. CWI

3. UC Berkeley

4. Stanford University

Abstract

Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: metadata operations such as listing objects are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.

Publisher

Association for Computing Machinery (ACM)

Reference51 articles.

1. Amazon Athena. https://aws.amazon.com/athena/. Amazon Athena. https://aws.amazon.com/athena/.

2. Amazon Kinesis. https://aws.amazon.com/kinesis/. Amazon Kinesis. https://aws.amazon.com/kinesis/.

3. Amazon Redshift. https://aws.amazon.com/redshift/. Amazon Redshift. https://aws.amazon.com/redshift/.

4. Amazon S3. https://aws.amazon.com/s3/. Amazon S3. https://aws.amazon.com/s3/.

5. Apache Hadoop. https://hadoop.apache.org. Apache Hadoop. https://hadoop.apache.org.

Cited by 139 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. On the compressibility of large-scale source code datasets;Journal of Systems and Software;2025-09

2. Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond;Companion of the 2025 International Conference on Management of Data;2025-06-22

3. SAP HANA Cloud: Data Management for Modern Enterprise Applications;Companion of the 2025 International Conference on Management of Data;2025-06-22

4. Databricks Lakeguard: Supporting Fine-grained Access Control and Multi-user Capabilities for Apache Spark Workloads;Companion of the 2025 International Conference on Management of Data;2025-06-22

5. A Big Data Approach for Efficient Processing of Machine Operational Data;Proceedings of the 37th International Conference on Scalable Scientific Data Management;2025-06-22

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.7亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2025 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3