TY - JOUR
T1 - Reducing cache and TLB power by exploiting memory region and privilege level semantics
AU - Fang, Zhen
AU - Zhao, Li
AU - Jiang, Xiaowei
AU - Lu, Shih Lien
AU - Iyer, Ravi
AU - Li, Tong
AU - Lee, Seung Eun
PY - 2013
Y1 - 2013
N2 - The L1 cache in today's high-performance processors accesses all ways of a selected set in parallel. This constitutes a major source of energy inefficiency: at most one of the N fetched blocks can be useful in an N-way set-associative cache. The other N-1 cachelines will all be tag mismatches and subsequently discarded. We propose to eliminate unnecessary associative fetches by exploiting certain software semantics in cache design, thus reducing dynamic power consumption. Specifically, we use memory region information to eliminate unnecessary fetches in the data cache, and ring level information to optimize fetches in the instruction cache. We present a design that is performance-neutral, transparent to applications, and incurs a space overhead of mere 0.41% of the L1 cache. We show significantly reduced cache lookups with benchmarks including SPEC CPU, SPECjbb, SPECjApp- Server, PARSEC, and Apache. For example, for SPEC CPU 2006, the proposed mechanism helps to reduce cache block fetches from the data and instruction caches by an average of 29% and 53% respectively, resulting in power savings of 17% and 35% in the caches, compared to the aggressively clock-gated baselines.
AB - The L1 cache in today's high-performance processors accesses all ways of a selected set in parallel. This constitutes a major source of energy inefficiency: at most one of the N fetched blocks can be useful in an N-way set-associative cache. The other N-1 cachelines will all be tag mismatches and subsequently discarded. We propose to eliminate unnecessary associative fetches by exploiting certain software semantics in cache design, thus reducing dynamic power consumption. Specifically, we use memory region information to eliminate unnecessary fetches in the data cache, and ring level information to optimize fetches in the instruction cache. We present a design that is performance-neutral, transparent to applications, and incurs a space overhead of mere 0.41% of the L1 cache. We show significantly reduced cache lookups with benchmarks including SPEC CPU, SPECjbb, SPECjApp- Server, PARSEC, and Apache. For example, for SPEC CPU 2006, the proposed mechanism helps to reduce cache block fetches from the data and instruction caches by an average of 29% and 53% respectively, resulting in power savings of 17% and 35% in the caches, compared to the aggressively clock-gated baselines.
KW - First-level cache
KW - Memory regions
KW - Ring level
KW - Simulation
KW - Translation lookaside buffer
UR - http://www.scopus.com/inward/record.url?scp=84880347279&partnerID=8YFLogxK
U2 - 10.1016/j.sysarc.2013.04.002
DO - 10.1016/j.sysarc.2013.04.002
M3 - Article
AN - SCOPUS:84880347279
SN - 1383-7621
VL - 59
SP - 279
EP - 295
JO - Journal of Systems Architecture
JF - Journal of Systems Architecture
IS - 6
ER -