Delta lake-Reference-Cited by-同舟云学术

Delta lake

Published:2020-08 Issue:12 Volume:13 Page:3411-3424
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Armbrust Michael¹,Das Tathagata¹,Sun Liwen¹,Yavuz Burak¹,Zhu Shixiong¹,Murthy Mukul¹,Torres Joseph¹,van Hovell Herman¹,Ionescu Adrian¹,Łuszczak Alicja¹,Świtakowski Michał¹,Szafrański Michał¹,Li Xiao¹,Ueshin Takuya¹,Mokhtar Mostafa¹,Boncz Peter²,Ghodsi Ali³,Paranjpye Sameer¹,Senster Pieter¹,Xin Reynold¹,Zaharia Matei⁴

Affiliation:

1. Databricks

2. CWI

3. UC Berkeley

4. Stanford University

Abstract

Cloud object stores such as Amazon S3 are some of the largest and most cost-effective storage systems on the planet, making them an attractive target to store large data warehouses and data lakes. Unfortunately, their implementation as key-value stores makes it difficult to achieve ACID transactions and high performance: metadata operations such as listing objects are expensive, and consistency guarantees are limited. In this paper, we present Delta Lake, an open source ACID table storage layer over cloud object stores initially developed at Databricks. Delta Lake uses a transaction log that is compacted into Apache Parquet format to provide ACID properties, time travel, and significantly faster metadata operations for large tabular datasets (e.g., the ability to quickly search billions of table partitions for those relevant to a query). It also leverages this design to provide high-level features such as automatic data layout optimization, upserts, caching, and audit logs. Delta Lake tables can be accessed from Apache Spark, Hive, Presto, Redshift and other systems. Delta Lake is deployed at thousands of Databricks customers that process exabytes of data per day, with the largest instances managing exabyte-scale datasets and billions of objects.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3415478.3415560

Reference51 articles.

1. Amazon Athena. https://aws.amazon.com/athena/. Amazon Athena. https://aws.amazon.com/athena/.

2. Amazon Kinesis. https://aws.amazon.com/kinesis/. Amazon Kinesis. https://aws.amazon.com/kinesis/.

3. Amazon Redshift. https://aws.amazon.com/redshift/. Amazon Redshift. https://aws.amazon.com/redshift/.

4. Amazon S3. https://aws.amazon.com/s3/. Amazon S3. https://aws.amazon.com/s3/.

5. Apache Hadoop. https://hadoop.apache.org. Apache Hadoop. https://hadoop.apache.org.

Cited by 139 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. On the compressibility of large-scale source code datasets;Journal of Systems and Software;2025-09

2. Unity Catalog: Open and Universal Governance for the Lakehouse and Beyond;Companion of the 2025 International Conference on Management of Data;2025-06-22

3. SAP HANA Cloud: Data Management for Modern Enterprise Applications;Companion of the 2025 International Conference on Management of Data;2025-06-22

4. Databricks Lakeguard: Supporting Fine-grained Access Control and Multi-user Capabilities for Apache Spark Workloads;Companion of the 2025 International Conference on Management of Data;2025-06-22

5. A Big Data Approach for Efficient Processing of Machine Operational Data;Proceedings of the 37th International Conference on Scalable Scientific Data Management;2025-06-22