Characterizing and subsetting big data workloads

Zhen Jia, Jianfeng Zhan*, Lei Wang, Rui Han, Sally A. McKee, Qiang Yang, Chunjie Luo, Jingwei Li

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

59 Citations (Scopus)

Abstract

Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates these challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/ simulatorversion/.

Original languageEnglish
Title of host publicationIISWC 2014 - IEEE International Symposium on Workload Characterization
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages191-201
Number of pages11
ISBN (Electronic)9781479964536
DOIs
Publication statusPublished - 11 Dec 2014
Externally publishedYes
Event2014 IEEE International Symposium on Workload Characterization, IISWC 2014 - Raleigh, United States
Duration: 26 Oct 201428 Oct 2014

Publication series

NameIISWC 2014 - IEEE International Symposium on Workload Characterization

Conference

Conference2014 IEEE International Symposium on Workload Characterization, IISWC 2014
Country/TerritoryUnited States
CityRaleigh
Period26/10/1428/10/14

Fingerprint

Dive into the research topics of 'Characterizing and subsetting big data workloads'. Together they form a unique fingerprint.

Cite this