Separation Is for Better Reunion: Data Lake Storage at Huawei

Xin Tang, Chengliang Chai*, Dawei Zhao, Haohai Ma, Yong Zheng, Zhenyong Fan, Xin Wu, Jiaquan Zhang, Rui Zhang, Duanshun Li, Yi He, Keji Huang, Guangbin Meng, Yidong Wang, Yuefeng Zhou, Tao Tao, Lirong Jian, Jiwu Shu, Yuping Wang*, Ye YuanGuoren Wang, Guoliang Li

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.

源语言英语
主期刊名Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
出版商IEEE Computer Society
5142-5155
页数14
ISBN(电子版)9798350317152
DOI
出版状态已出版 - 2024
活动40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, 荷兰
期限: 13 5月 202417 5月 2024

出版系列

姓名Proceedings - International Conference on Data Engineering
ISSN(印刷版)1084-4627
ISSN(电子版)2375-0286

会议

会议40th IEEE International Conference on Data Engineering, ICDE 2024
国家/地区荷兰
Utrecht
时期13/05/2417/05/24

指纹

探究 'Separation Is for Better Reunion: Data Lake Storage at Huawei' 的科研主题。它们共同构成独一无二的指纹。

引用此