Limoncello

Table of Contents

1. Notes

  • Hardware prefetchers are too coarse-grained
    • Reduces cache misses
    • Results in higher memory latency when bandwidth utilization is high

1.1. Background

  • Trend is that bandwidth is stagnant
  • Hardware prefetchers waste bandwidth
  • Paper gets experiment results by turning a set of prefetchers off
  • Software prefetching is more fine grained, wastes less bandwidth on wrong fetches and less latency due to queueing delays

1.2. Motivation

  • CPU becomes throttled while waiting for memory
  • OS avoids scheduling extra processes to avoid performance cliff
  • Hardware prefetching wastes bandwidth by 10%
  • prefetching tuned to reduces cache miss rates but greatly increase bandwidth utilization

1.3. Solution

1.3.1. Hard Limoncello

  • Turn off prefetchers when bandwidth usage is high, turn on when its too low

1.3.2. Soft Limoncello

  • Inserts software prefetches
  • Can profile at the function granularity
  • Hardware prefetch-friendly = targets for software prefetching
    • Predictable / continuous access
    • Can be more precise since we know how the data will be accessed
  1. Design parameters
    • Prefetch address: where data will be prefetched
    • Prefetch distance: how far in advance data will be prefetched
      • Longer = less accurate, but more timely
    • Prefetch degree: how much to prefetch, how fine grained it is
  2. Determining where to add prefetches
    • Prioritize large data accesses
    • Look at metrics at the instruction level
    • Use microbenchmarks and sweep for best parameters
    • e.g. memcpy copy, choose large data accesses and find the optimal prefetch distance/degree

1.4. Results

  • For CPUs with high utilization rate, the throughput increased
    • Due to less wasted memory bandwidth
    • Lowers latency to memory
  • More CPU utilization
  • More precise prefetching
  • No need for warmup period

1.5. Related work

1.5.1. Throttling

  • Turns it off for everything
  • Coarse grained
  • Hardware can be inaccurate

1.5.2. Software prefetching

  • Compiler doesn’t have dynamic execution information
  • Profiling: uses fleetwide profiling tools

1.5.3. HW-SW interaction

  • Most require hardware modifciations
Modified 2025-03-16 Sun 21:33