LHF: A new archive based approach to accelerate massive small files access performance in HDFS

Wenjun Tao; Yanlong Zhai; Jude Tchaye-Kondi

doi:10.1109/BigDataService.2019.00012

LHF: A new archive based approach to accelerate massive small files access performance in HDFS

Wenjun Tao, Yanlong Zhai, Jude Tchaye-Kondi

School of Cyberspace Science and Technology

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

11 Citations (Scopus)

Abstract

As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

Original language	English
Title of host publication	Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	40-48
Number of pages	9
ISBN (Electronic)	9781728100593
DOIs	https://doi.org/10.1109/BigDataService.2019.00012
Publication status	Published - Apr 2019
Event	5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019 - Newark, United States Duration: 4 Apr 2019 → 9 Apr 2019

Publication series

Name	Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies

Conference

Conference	5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019
Country/Territory	United States
City	Newark
Period	4/04/19 → 9/04/19

Keywords

HDFS
Linear hashing
Massive small files

Access to Document

10.1109/BigDataService.2019.00012

Cite this

Tao, W., Zhai, Y., & Tchaye-Kondi, J. (2019). LHF: A new archive based approach to accelerate massive small files access performance in HDFS. In Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies (pp. 40-48). Article 8848237 (Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigDataService.2019.00012

Tao, Wenjun ; Zhai, Yanlong ; Tchaye-Kondi, Jude. / LHF : A new archive based approach to accelerate massive small files access performance in HDFS. Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 40-48 (Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies).

@inproceedings{aacad2b3752f44d18185650735cdb41d,

title = "LHF: A new archive based approach to accelerate massive small files access performance in HDFS",

abstract = "As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.",

keywords = "HDFS, Linear hashing, Massive small files",

author = "Wenjun Tao and Yanlong Zhai and Jude Tchaye-Kondi",

note = "Publisher Copyright: {\textcopyright} 2019 IEEE.; 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019 ; Conference date: 04-04-2019 Through 09-04-2019",

year = "2019",

month = apr,

doi = "10.1109/BigDataService.2019.00012",

language = "English",

series = "Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "40--48",

booktitle = "Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies",

address = "United States",

}

Tao, W, Zhai, Y & Tchaye-Kondi, J 2019, LHF: A new archive based approach to accelerate massive small files access performance in HDFS. in Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies., 8848237, Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies, Institute of Electrical and Electronics Engineers Inc., pp. 40-48, 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Newark, United States, 4/04/19. https://doi.org/10.1109/BigDataService.2019.00012

LHF: A new archive based approach to accelerate massive small files access performance in HDFS. / Tao, Wenjun; Zhai, Yanlong; Tchaye-Kondi, Jude.
Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies. Institute of Electrical and Electronics Engineers Inc., 2019. p. 40-48 8848237 (Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - LHF

T2 - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019

AU - Tao, Wenjun

AU - Zhai, Yanlong

AU - Tchaye-Kondi, Jude

PY - 2019/4

Y1 - 2019/4

N2 - As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

AB - As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

KW - HDFS

KW - Linear hashing

KW - Massive small files

UR - http://www.scopus.com/inward/record.url?scp=85073245602&partnerID=8YFLogxK

U2 - 10.1109/BigDataService.2019.00012

DO - 10.1109/BigDataService.2019.00012

M3 - Conference contribution

AN - SCOPUS:85073245602

T3 - Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies

SP - 40

EP - 48

BT - Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies

PB - Institute of Electrical and Electronics Engineers Inc.

Y2 - 4 April 2019 through 9 April 2019

ER -

Tao W, Zhai Y, Tchaye-Kondi J. LHF: A new archive based approach to accelerate massive small files access performance in HDFS. In Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies. Institute of Electrical and Electronics Engineers Inc. 2019. p. 40-48. 8848237. (Proceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies). doi: 10.1109/BigDataService.2019.00012

LHF: A new archive based approach to accelerate massive small files access performance in HDFS

Abstract

Publication series

Conference

Keywords

Access to Document

Other files and links

Fingerprint

Cite this