Limoncello
1. Notes
- Hardware prefetchers are too coarse-grained
- Reduces cache misses
- Results in higher memory latency when bandwidth utilization is high
1.1. Background
- Trend is that bandwidth is stagnant
- Hardware prefetchers waste bandwidth
- Paper gets experiment results by turning a set of prefetchers off
- Software prefetching is more fine grained, wastes less bandwidth on wrong fetches and less latency due to queueing delays
1.2. Motivation
- CPU becomes throttled while waiting for memory
- OS avoids scheduling extra processes to avoid performance cliff
- Hardware prefetching wastes bandwidth by 10%
- prefetching tuned to reduces cache miss rates but greatly increase bandwidth utilization
1.3. Solution
1.3.1. Hard Limoncello
- Turn off prefetchers when bandwidth usage is high, turn on when its too low
1.3.2. Soft Limoncello
- Inserts software prefetches
- Can profile at the function granularity
- Hardware prefetch-friendly = targets for software prefetching
- Predictable / continuous access
- Can be more precise since we know how the data will be accessed
- Design parameters
- Prefetch address: where data will be prefetched
- Prefetch distance: how far in advance data will be prefetched
- Longer = less accurate, but more timely
- Prefetch degree: how much to prefetch, how fine grained it is
- Determining where to add prefetches
- Prioritize large data accesses
- Look at metrics at the instruction level
- Use microbenchmarks and sweep for best parameters
- e.g. memcpy copy, choose large data accesses and find the optimal prefetch distance/degree
1.4. Results
- For CPUs with high utilization rate, the throughput increased
- Due to less wasted memory bandwidth
- Lowers latency to memory
- More CPU utilization
- More precise prefetching
- No need for warmup period
1.5. Related work
1.5.1. Throttling
- Turns it off for everything
- Coarse grained
- Hardware can be inaccurate
1.5.2. Software prefetching
- Compiler doesn’t have dynamic execution information
- Profiling: uses fleetwide profiling tools
1.5.3. HW-SW interaction
- Most require hardware modifciations
Modified 2025-03-16 Sun 21:33