3 processors
15 Cache¶
Goal: minimize the average memory access time by building a hierarchical memory system that had both low latency and high capacity.
Choice: block size -> replacement policy -> write policy
Direct-mapped cache problem: conflict misses¶
Assuming a 1024-line DM cache with a block size of 1:
Consider running the 3-instruction LOOP A code with the instructions located starting at word address 1024 and the data starting at word address 2048 where the program is making alternating accesses to instruction and data, e.g., a loop of LD instructions.
Address conflict will cause the current resident of a cache line to be evicted in favor of the new request.
Conflict misses occur when two address call from cpu reflects to the same cache line.
If we use fully-associative cache instead, there will be no conflict misses since every address call reflects to different cache line at the high cost of specs/time.
To fix above problem, it required a cache that could hold two 3-word blocks in DM cache while FA cache needs two of its cache lines and achieve a 100% hit ratio.
N-way set-associative cache = N DM cache¶
improvements compared to DM cache: reduce confilcting addresses
N-way SA cache can accommodate up to N blocks whose addresses map to the same cache index.
Associativitity: increase the number of cache locations checked on each access.
similar to how big the block size to achieve higher hit ratio, how many ways do we need to avoid the cache line conflicts in DM cache?
这里的way主要指cache中有多少条线路能够同时进行tag查询。
The mapping from addresses to cache lines is designed to avoid conflicts between neighboring locations.
we only need to worry about conflicts between the different regions: code, stack and data.
Associativity tradeoffs: there’s little additional impact on the miss ratio beyond 4 to 8 ways.
Associativity implies choices¶
when conflict misses occur, which location in cache to store new data fetched from main memory? This question leads to the replacement policies with the goal of minimizing the impact on the hit ration in the future.
Idea: If a block has not been recently accessed, it’s less likely to be accessed in the near future. (Principle of locality)
LRU replacement policy: replace the block that was accessed furthest in the past.
Except random policy and LRU, other strategies will occasionally cause a particular program to execute much more slowly than expected.
Write policy¶
How to handle memory writes in the cache when updating main memory with the new data?
-
write-through: whenever the CPU sends a write request to the cache, the cache then performs the same write to main memory.
- pro:CPU locations: up-to-date value
- Con:DRAM write is slow: bottleneck could be
what if the program is constantly writing a particular memory location, e.g., updating the value of a local variable in the current stack frame?
-
Write-behind: let the CPU continue execution while the cache waits for the write to main memory to complete
if there’s another cache miss while the write is still pending, everything will have to wait at that point until both the write and subsequent refill read finish, since the CPU can’t proceed until the cache miss is resolved.
-
Write-back(best write policy): the contents of the cache are updated and the CPU continues execution immediately.
The updated cache value is only written to main memory when the cache line is chosen as the replacement line for a cache miss.
write-back minimizes the number of accesses to main memory, preserving the memory bandwidth for other operations.
Implementation:
- how to replace a cache line?
We can avoid unnecessary write-backs by adding another state bit to each cache line: the dirty bit. The dirty bit is set to 0 when a cache line is filled during a cache miss.