Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS

Yanlong Zhai; Jude Tchaye-Kondi; Kwei Jay Lin; Liehuang Zhu; Wenjun Tao; Xiaojiang Du; Mohsen Guizani

doi:10.1016/j.jpdc.2021.05.011

Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS

Yanlong Zhai^*, Jude Tchaye-Kondi, Kwei Jay Lin, Liehuang Zhu, Wenjun Tao, Xiaojiang Du, Mohsen Guizani

^*此作品的通讯作者

网络空间安全学院

科研成果: 期刊稿件 › 文章 › 同行评审

20 引用（Scopus）

摘要

HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.

源语言	英语
页（从-至）	119-130
页数	12
期刊	Journal of Parallel and Distributed Computing
卷	156
DOI	https://doi.org/10.1016/j.jpdc.2021.05.011
出版状态	已出版 - 10月 2021

访问文件

10.1016/j.jpdc.2021.05.011

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{be0c85be452244caa3b1373f635c4976,

title = "Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS",

abstract = "HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.",

keywords = "Distributed file system, Fast access, HDFS, Massive small files",

author = "Yanlong Zhai and Jude Tchaye-Kondi and Lin, {Kwei Jay} and Liehuang Zhu and Wenjun Tao and Xiaojiang Du and Mohsen Guizani",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier Inc.",

year = "2021",

month = oct,

doi = "10.1016/j.jpdc.2021.05.011",

language = "English",

volume = "156",

pages = "119--130",

journal = "Journal of Parallel and Distributed Computing",

issn = "0743-7315",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Hadoop Perfect File

T2 - A fast and memory-efficient metadata access archive file to face small files problem in HDFS

AU - Zhai, Yanlong

AU - Tchaye-Kondi, Jude

AU - Lin, Kwei Jay

AU - Zhu, Liehuang

AU - Tao, Wenjun

AU - Du, Xiaojiang

AU - Guizani, Mohsen

PY - 2021/10

Y1 - 2021/10

N2 - HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.

AB - HDFS faces several issues when it comes to handling a large number of small files. These issues are well addressed by archive systems, which combine small files into larger ones. They use index files to hold relevant information for retrieving a small file content from the big archive file. However, existing archive-based solutions require significant overheads when retrieving a file content since additional processing and I/Os are needed to acquire the retrieval information before accessing the actual file content, therefore, deteriorating the access efficiency. This paper presents a new archive file named Hadoop Perfect File (HPF). HPF minimizes access overheads by directly accessing metadata from the part of the index file containing the information. It consequently reduces the additional processing and I/Os needed and improves the access efficiency from archive files. Our index system uses two hash functions. Metadata records are distributed across index files using a dynamic hash function. We further build an order-preserving perfect hash function that memorizes the position of a small file's metadata record within the index file.

KW - Distributed file system

KW - Fast access

KW - HDFS

KW - Massive small files

UR - http://www.scopus.com/inward/record.url?scp=85108089031&partnerID=8YFLogxK

U2 - 10.1016/j.jpdc.2021.05.011

DO - 10.1016/j.jpdc.2021.05.011

M3 - Article

AN - SCOPUS:85108089031

SN - 0743-7315

VL - 156

SP - 119

EP - 130

JO - Journal of Parallel and Distributed Computing

JF - Journal of Parallel and Distributed Computing

ER -

Hadoop Perfect File: A fast and memory-efficient metadata access archive file to face small files problem in HDFS

摘要

访问文件

其它文件与链接

指纹

引用此