TY - JOUR
T1 - ARP
T2 - An adaptive runtime mechanism to partition shared cache in SMT architecture
AU - Sui, Xiufeng
AU - Wu, Junmin
AU - Chen, Guoliang
PY - 2008/7
Y1 - 2008/7
N2 - Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.
AB - Simultaneous multithreading is a latency-tolerant architecture that usually employs shared L2 cache. It can execute multiple instructions from multiple threads each cycle, thus increasing the pressure on memory hierarchy. In this paper, the problem of partitioning a shared cache between multiple concurrently executing threads in the SMT architecture, especially the issue of fairness in cache sharing and its relation to throughput are studied. The commonly used LRU policy implicitly partitions a shared cache on a demand basis, giving more cache resources to the application that has a high demand. LRU manages the cache unfairly, and will lead to some serious problems, such as thread starvation and priority inversion. An adaptive runtime partition (ARP) mechanism is implemented to manage the shared cache. ARP takes fairness as the metric of cache partitioning, and uses dynamic partitioning algorithm to optimize fairness. The dynamic partitioning algorithm is easy to implement, requires little or no profiling. Meanwhile it uses a classical monitor circuit to collect the stack distance information of each thread, and requires less than 0.25% of storage overhead. The evaluation shows that on the average, ARP improves the fairness of a 2-way SMT by a factor of 2.26, while increasing the throughput by 14.75%, compared with LRU-based cache partitioning.
KW - Cache fairness
KW - Dynamic partitioning
KW - Dynamic set sampling
KW - Shared cache
KW - Simultaneous multithreading
UR - http://www.scopus.com/inward/record.url?scp=54749085890&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:54749085890
SN - 1000-1239
VL - 45
SP - 1269
EP - 1277
JO - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
JF - Jisuanji Yanjiu yu Fazhan/Computer Research and Development
IS - 7
ER -