Separation Is for Better Reunion: Data Lake Storage at Huawei

Xin Tang; Chengliang Chai; Dawei Zhao; Haohai Ma; Yong Zheng; Zhenyong Fan; Xin Wu; Jiaquan Zhang; Rui Zhang; Duanshun Li; Yi He; Keji Huang; Guangbin Meng; Yidong Wang; Yuefeng Zhou; Tao Tao; Lirong Jian; Jiwu Shu; Yuping Wang; Ye Yuan; Guoren Wang; Guoliang Li

doi:10.1109/ICDE60146.2024.00386

Separation Is for Better Reunion: Data Lake Storage at Huawei

Xin Tang, Chengliang Chai^*, Dawei Zhao, Haohai Ma, Yong Zheng, Zhenyong Fan, Xin Wu, Jiaquan Zhang, Rui Zhang, Duanshun Li, Yi He, Keji Huang, Guangbin Meng, Yidong Wang, Yuefeng Zhou, Tao Tao, Lirong Jian, Jiwu Shu, Yuping Wang^*, Ye YuanGuoren Wang, Guoliang Li

^*此作品的通讯作者

计算机学院

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

摘要

Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.

源语言	英语
主期刊名	Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024
出版商	IEEE Computer Society
页	5142-5155
页数	14
ISBN（电子版）	9798350317152
DOI	https://doi.org/10.1109/ICDE60146.2024.00386
出版状态	已出版 - 2024
活动	40th IEEE International Conference on Data Engineering, ICDE 2024 - Utrecht, 荷兰期限: 13 5月 2024 → 17 5月 2024

出版系列

姓名	Proceedings - International Conference on Data Engineering
ISSN（印刷版）	1084-4627
ISSN（电子版）	2375-0286

会议

会议	40th IEEE International Conference on Data Engineering, ICDE 2024
国家/地区	荷兰
市	Utrecht
时期	13/05/24 → 17/05/24

访问文件

10.1109/ICDE60146.2024.00386

其它文件与链接

链接到 Scopus 的出版物

引用此

Tang, X., Chai, C., Zhao, D., Ma, H., Zheng, Y., Fan, Z., Wu, X., Zhang, J., Zhang, R., Li, D., He, Y., Huang, K., Meng, G., Wang, Y., Zhou, Y., Tao, T., Jian, L., Shu, J., Wang, Y., ... Li, G. (2024). Separation Is for Better Reunion: Data Lake Storage at Huawei. 在 Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024 (页码 5142-5155). (Proceedings - International Conference on Data Engineering). IEEE Computer Society. https://doi.org/10.1109/ICDE60146.2024.00386

@inproceedings{646a7bfcead74ad9bf9c7741624f1d6b,

title = "Separation Is for Better Reunion: Data Lake Storage at Huawei",

abstract = "Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.",

author = "Xin Tang and Chengliang Chai and Dawei Zhao and Haohai Ma and Yong Zheng and Zhenyong Fan and Xin Wu and Jiaquan Zhang and Rui Zhang and Duanshun Li and Yi He and Keji Huang and Guangbin Meng and Yidong Wang and Yuefeng Zhou and Tao Tao and Lirong Jian and Jiwu Shu and Yuping Wang and Ye Yuan and Guoren Wang and Guoliang Li",

note = "Publisher Copyright: {\textcopyright} 2024 IEEE.; 40th IEEE International Conference on Data Engineering, ICDE 2024 ; Conference date: 13-05-2024 Through 17-05-2024",

year = "2024",

doi = "10.1109/ICDE60146.2024.00386",

language = "English",

series = "Proceedings - International Conference on Data Engineering",

publisher = "IEEE Computer Society",

pages = "5142--5155",

booktitle = "Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024",

address = "United States",

}

Tang, X, Chai, C, Zhao, D, Ma, H, Zheng, Y, Fan, Z, Wu, X, Zhang, J, Zhang, R, Li, D, He, Y, Huang, K, Meng, G, Wang, Y, Zhou, Y, Tao, T, Jian, L, Shu, J, Wang, Y, Yuan, Y, Wang, G & Li, G 2024, Separation Is for Better Reunion: Data Lake Storage at Huawei. 在 Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. Proceedings - International Conference on Data Engineering, IEEE Computer Society, 页码 5142-5155, 40th IEEE International Conference on Data Engineering, ICDE 2024, Utrecht, 荷兰, 13/05/24. https://doi.org/10.1109/ICDE60146.2024.00386

Separation Is for Better Reunion: Data Lake Storage at Huawei. / Tang, Xin; Chai, Chengliang; Zhao, Dawei 等.
Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024. IEEE Computer Society, 2024. 页码 5142-5155 (Proceedings - International Conference on Data Engineering).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Separation Is for Better Reunion

T2 - 40th IEEE International Conference on Data Engineering, ICDE 2024

AU - Tang, Xin

AU - Chai, Chengliang

AU - Zhao, Dawei

AU - Ma, Haohai

AU - Zheng, Yong

AU - Fan, Zhenyong

AU - Wu, Xin

AU - Zhang, Jiaquan

AU - Zhang, Rui

AU - Li, Duanshun

AU - He, Yi

AU - Huang, Keji

AU - Meng, Guangbin

AU - Wang, Yidong

AU - Zhou, Yuefeng

AU - Tao, Tao

AU - Jian, Lirong

AU - Shu, Jiwu

AU - Wang, Yuping

AU - Yuan, Ye

AU - Wang, Guoren

AU - Li, Guoliang

PY - 2024

Y1 - 2024

N2 - Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.

AB - Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.

UR - http://www.scopus.com/inward/record.url?scp=85200473692&partnerID=8YFLogxK

U2 - 10.1109/ICDE60146.2024.00386

DO - 10.1109/ICDE60146.2024.00386

M3 - Conference contribution

AN - SCOPUS:85200473692

T3 - Proceedings - International Conference on Data Engineering

SP - 5142

EP - 5155

BT - Proceedings - 2024 IEEE 40th International Conference on Data Engineering, ICDE 2024

PB - IEEE Computer Society

Y2 - 13 May 2024 through 17 May 2024

ER -

Separation Is for Better Reunion: Data Lake Storage at Huawei

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此