TY - GEN
T1 - LHF
T2 - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019
AU - Tao, Wenjun
AU - Zhai, Yanlong
AU - Tchaye-Kondi, Jude
N1 - Publisher Copyright:
© 2019 IEEE.
PY - 2019/4
Y1 - 2019/4
N2 - As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.
AB - As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.
KW - HDFS
KW - Linear hashing
KW - Massive small files
UR - http://www.scopus.com/inward/record.url?scp=85073245602&partnerID=8YFLogxK
U2 - 10.1109/BigDataService.2019.00012
DO - 10.1109/BigDataService.2019.00012
M3 - Conference contribution
AN - SCOPUS:85073245602
T3 - Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies
SP - 40
EP - 48
BT - Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 4 April 2019 through 9 April 2019
ER -