Characterizing and subsetting big data workloads

Zhen Jia; Jianfeng Zhan; Lei Wang; Rui Han; Sally A. McKee; Qiang Yang; Chunjie Luo; Jingwei Li

doi:10.1109/IISWC.2014.6983058

Characterizing and subsetting big data workloads

Zhen Jia, Jianfeng Zhan^*, Lei Wang, Rui Han, Sally A. McKee, Qiang Yang, Chunjie Luo, Jingwei Li

^*此作品的通讯作者

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

59 引用（Scopus）

摘要

Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.

源语言	英语
主期刊名	IISWC 2014 - IEEE International Symposium on Workload Characterization
出版商	Institute of Electrical and Electronics Engineers Inc.
页	191-201
页数	11
ISBN（电子版）	9781479964536
DOI	https://doi.org/10.1109/IISWC.2014.6983058
出版状态	已出版 - 11 12月 2014
已对外发布	是
活动	2014 IEEE International Symposium on Workload Characterization, IISWC 2014 - Raleigh, 美国期限: 26 10月 2014 → 28 10月 2014

出版系列

姓名	IISWC 2014 - IEEE International Symposium on Workload Characterization

会议

会议	2014 IEEE International Symposium on Workload Characterization, IISWC 2014
国家/地区	美国
市	Raleigh
时期	26/10/14 → 28/10/14

访问文件

10.1109/IISWC.2014.6983058

其它文件与链接

链接到 Scopus 的出版物

引用此

Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S. A., Yang, Q., Luo, C., & Li, J. (2014). Characterizing and subsetting big data workloads. 在 IISWC 2014 - IEEE International Symposium on Workload Characterization (页码 191-201). 文章 6983058 (IISWC 2014 - IEEE International Symposium on Workload Characterization). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IISWC.2014.6983058

@inproceedings{bf87ab04189d47f7b1252fbf8fd71931,

title = "Characterizing and subsetting big data workloads",

abstract = "Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.",

author = "Zhen Jia and Jianfeng Zhan and Lei Wang and Rui Han and McKee, {Sally A.} and Qiang Yang and Chunjie Luo and Jingwei Li",

note = "Publisher Copyright: {\textcopyright} 2014 IEEE.; 2014 IEEE International Symposium on Workload Characterization, IISWC 2014 ; Conference date: 26-10-2014 Through 28-10-2014",

year = "2014",

month = dec,

day = "11",

doi = "10.1109/IISWC.2014.6983058",

language = "English",

series = "IISWC 2014 - IEEE International Symposium on Workload Characterization",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "191--201",

booktitle = "IISWC 2014 - IEEE International Symposium on Workload Characterization",

address = "United States",

}

Jia, Z, Zhan, J, Wang, L, Han, R, McKee, SA, Yang, Q, Luo, C & Li, J 2014, Characterizing and subsetting big data workloads. 在 IISWC 2014 - IEEE International Symposium on Workload Characterization., 6983058, IISWC 2014 - IEEE International Symposium on Workload Characterization, Institute of Electrical and Electronics Engineers Inc., 页码 191-201, 2014 IEEE International Symposium on Workload Characterization, IISWC 2014, Raleigh, 美国, 26/10/14. https://doi.org/10.1109/IISWC.2014.6983058

Characterizing and subsetting big data workloads. / Jia, Zhen; Zhan, Jianfeng; Wang, Lei 等.
IISWC 2014 - IEEE International Symposium on Workload Characterization. Institute of Electrical and Electronics Engineers Inc., 2014. 页码 191-201 6983058 (IISWC 2014 - IEEE International Symposium on Workload Characterization).

科研成果: 书/报告/会议事项章节 › 会议稿件 › 同行评审

TY - GEN

T1 - Characterizing and subsetting big data workloads

AU - Jia, Zhen

AU - Zhan, Jianfeng

AU - Wang, Lei

AU - Han, Rui

AU - McKee, Sally A.

AU - Yang, Qiang

AU - Luo, Chunjie

AU - Li, Jingwei

PY - 2014/12/11

Y1 - 2014/12/11

N2 - Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.

AB - Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.

UR - http://www.scopus.com/inward/record.url?scp=84946024376&partnerID=8YFLogxK

U2 - 10.1109/IISWC.2014.6983058

DO - 10.1109/IISWC.2014.6983058

M3 - Conference contribution

AN - SCOPUS:84946024376

T3 - IISWC 2014 - IEEE International Symposium on Workload Characterization

SP - 191

EP - 201

BT - IISWC 2014 - IEEE International Symposium on Workload Characterization

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 2014 IEEE International Symposium on Workload Characterization, IISWC 2014

Y2 - 26 October 2014 through 28 October 2014

ER -

Characterizing and subsetting big data workloads

摘要

出版系列

会议

访问文件

其它文件与链接

指纹

引用此