LHF: A new archive based approach to accelerate massive small files access performance in HDFS

Wenjun Tao, Yanlong Zhai, Jude Tchaye-Kondi

科研成果: 书/报告/会议事项章节会议稿件同行评审

11 引用 (Scopus)

摘要

As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

源语言英语
主期刊名Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies
出版商Institute of Electrical and Electronics Engineers Inc.
40-48
页数9
ISBN(电子版)9781728100593
DOI
出版状态已出版 - 4月 2019
活动5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019 - Newark, 美国
期限: 4 4月 20199 4月 2019

出版系列

姓名Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies

会议

会议5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019
国家/地区美国
Newark
时期4/04/199/04/19

指纹

探究 'LHF: A new archive based approach to accelerate massive small files access performance in HDFS' 的科研主题。它们共同构成独一无二的指纹。

引用此