TY - GEN
T1 - Characterizing and subsetting big data workloads
AU - Jia, Zhen
AU - Zhan, Jianfeng
AU - Wang, Lei
AU - Han, Rui
AU - McKee, Sally A.
AU - Yang, Qiang
AU - Luo, Chunjie
AU - Li, Jingwei
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/12/11
Y1 - 2014/12/11
N2 - Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.
AB - Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.
UR - http://www.scopus.com/inward/record.url?scp=84946024376&partnerID=8YFLogxK
U2 - 10.1109/IISWC.2014.6983058
DO - 10.1109/IISWC.2014.6983058
M3 - Conference contribution
AN - SCOPUS:84946024376
T3 - IISWC 2014 - IEEE International Symposium on Workload Characterization
SP - 191
EP - 201
BT - IISWC 2014 - IEEE International Symposium on Workload Characterization
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 IEEE International Symposium on Workload Characterization, IISWC 2014
Y2 - 26 October 2014 through 28 October 2014
ER -