LHF: A new archive based approach to accelerate massive small files access performance in HDFS

Wenjun Tao, Yanlong Zhai, Jude Tchaye-Kondi

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

10 Citations (Scopus)

Abstract

As one of the most popular open source projects, Hadoop is considered nowadays as the de-facto framework for managing and analyzing huge amounts of data. HDFS (Hadoop Distributed File System) is one of the core components in Hadoop framework to store big data, especially semi-structured and unstructured data. HDFS provides high scalability and reliability when handling large files across thousands of machines. But the performance will be severely degraded while dealing with massive small files. Although some effort was spent to investigate this well-known issue, existing approaches, such as HAR, SequenceFile, and MapFile, are limited in their ability to reduce the memory consumption of the NameNode and optimize the access performance in the meantime. In this paper, we presented LHF, a solution to handle massive small files in HDFS by merging small files into big files and building a linear hashing based extendable index to speed up the process of locating a small file. The advantages of our approach are (1) it significantly reduces the size of the metadata, (2) it does not require sorting the files at the client side, (3) it supports appending more small files to the merged file afterwards and (4) it achieves good access performance. A series of experiments were performed to demonstrate the effectiveness and efficiency of LHF as well, which takes less time while accessing files compared with other methods.

Original languageEnglish
Title of host publicationProceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages40-48
Number of pages9
ISBN (Electronic)9781728100593
DOIs
Publication statusPublished - Apr 2019
Event5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019 - Newark, United States
Duration: 4 Apr 20199 Apr 2019

Publication series

NameProceedings - 5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019, Workshop on Big Data in Water Resources, Environment, and Hydraulic Engineering and Workshop on Medical, Healthcare, Using Big Data Technologies

Conference

Conference5th IEEE International Conference on Big Data Service and Applications, BigDataService 2019
Country/TerritoryUnited States
CityNewark
Period4/04/199/04/19

Keywords

  • HDFS
  • Linear hashing
  • Massive small files

Fingerprint

Dive into the research topics of 'LHF: A new archive based approach to accelerate massive small files access performance in HDFS'. Together they form a unique fingerprint.

Cite this