ARP: An adaptive runtime mechanism to partition shared cache in SMT architecture

Xiufeng Sui; Junmin Wu; Guoliang Chen

ARP: An adaptive runtime mechanism to partition shared cache in SMT architecture

Xiufeng Sui^*, Junmin Wu, Guoliang Chen

^*此作品的通讯作者

University of Science and Technology of China

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.

源语言	英语
页（从-至）	1269-1277
页数	9
期刊	Jisuanji Yanjiu yu Fazhan/Computer Research and Development
卷	45
期	7
出版状态	已出版 - 7月 2008
已对外发布	是

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{ce473310dbcb48d79a8306ab12c91914,

title = "ARP: An adaptive runtime mechanism to partition shared cache in SMT architecture",

abstract = "Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.",

keywords = "Cache fairness, Dynamic partitioning, Dynamic set sampling, Shared cache, Simultaneous multithreading",

author = "Xiufeng Sui and Junmin Wu and Guoliang Chen",

year = "2008",

month = jul,

language = "English",

volume = "45",

pages = "1269--1277",

journal = "Jisuanji Yanjiu yu Fazhan/Computer Research and Development",

issn = "1000-1239",

publisher = "Science China Press",

number = "7",

}

TY - JOUR

T1 - ARP

T2 - An adaptive runtime mechanism to partition shared cache in SMT architecture

AU - Sui, Xiufeng

AU - Wu, Junmin

AU - Chen, Guoliang

PY - 2008/7

Y1 - 2008/7

N2 - Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.

AB - Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.

KW - Cache fairness

KW - Dynamic partitioning

KW - Dynamic set sampling

KW - Shared cache

KW - Simultaneous multithreading

UR - http://www.scopus.com/inward/record.url?scp=54749085890&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:54749085890

SN - 1000-1239

VL - 45

SP - 1269

EP - 1277

JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development

IS - 7

ER -

ARP: An adaptive runtime mechanism to partition shared cache in SMT architecture

摘要

其它文件与链接

指纹

引用此