Design of RDD Persistence Method in Spark for SSDs

Kezhong Lu; Jinbin Zhu; Zhengmin Li; Xiufeng Sui

doi:10.7544/issn1000-1239.2017.20170108

Design of RDD Persistence Method in Spark for SSDs

Kezhong Lu, Jinbin Zhu, Zhengmin Li^*, Xiufeng Sui

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD's persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD's storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

源语言	英语
页（从-至）	1381-1390
页数	10
期刊	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
卷	54
期	6
DOI	https://doi.org/10.7544/issn1000-1239.2017.20170108
出版状态	已出版 - 1 6月 2017
已对外发布	是

访问文件

10.7544/issn1000-1239.2017.20170108

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{34698306b1a3433788fea0c89c4922fa,

title = "Design of RDD Persistence Method in Spark for SSDs",

abstract = "SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD's persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD's storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.",

keywords = "Big data, Hybrid storage, Persistence, Solid-state drive (SSD), Spark",

author = "Kezhong Lu and Jinbin Zhu and Zhengmin Li and Xiufeng Sui",

year = "2017",

month = jun,

day = "1",

doi = "10.7544/issn1000-1239.2017.20170108",

language = "English",

volume = "54",

pages = "1381--1390",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science China Press",

number = "6",

}

TY - JOUR

T1 - Design of RDD Persistence Method in Spark for SSDs

AU - Lu, Kezhong

AU - Zhu, Jinbin

AU - Li, Zhengmin

AU - Sui, Xiufeng

PY - 2017/6/1

Y1 - 2017/6/1

N2 - SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD's persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD's storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

AB - SSD (solid-state drive) and HDD (hard disk drive) hybrid storage system has been widely used in big data computing datacenters. The workloads should be able to persist data of different characteristics to SSD or HDD on demand to improve the overall performance of the system. Spark is an industry-wide efficient data computing framework, especially for the applications with multiple iterations. The reason is that Spark can persist data in memory or hard disk, and persisting data to the hard disk can break the insufficient memory limits on the size of the data set. However, the current Spark implementation does not specifically provide an explicit SSD-oriented persistence interface, although data can be distributed proportionally to different storage mediums based on configuration information, and the user can not specify RDD's persistence locations according to the data characteristics, and thus the lack of relevance and flexibility. This has not only become a bottleneck to further enhance the performance of Spark, but also seriously affected the played performance of hybrid storage system. This paper presents the data persistence strategy for SSD for the first time as we know. We explore the data persistence principle in Spark, and optimize the architecture based on hybrid storage system. Finally, users can specify RDD's storage mediums explicitly and flexibly leveraging the persistence API we provided. Experimental results based on SparkBench shows that the performance can be improved by an average of 14.02%.

KW - Big data

KW - Hybrid storage

KW - Persistence

KW - Solid-state drive (SSD)

KW - Spark

UR - http://www.scopus.com/inward/record.url?scp=85029564167&partnerID=8YFLogxK

U2 - 10.7544/issn1000-1239.2017.20170108

DO - 10.7544/issn1000-1239.2017.20170108

M3 - Article

AN - SCOPUS:85029564167

SN - 1000-1239

VL - 54

SP - 1381

EP - 1390

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 6

ER -

Design of RDD Persistence Method in Spark for SSDs

摘要

访问文件

其它文件与链接

指纹

引用此