TCMalloc

Table of Contents

1. Notes

1.1. Organization

  • Per-CPU caches
    • Stores header information and a list of pointers (of objects with determined sizes) of free memory
  • Transfer cache in the middle of it
    • Transfers these pointers from different caches
  • Central free list
    • Stores spans (contiguous, aligned regions), in linked lists
    • Extracts objects from spans to satisfy requests
    • Have same size class in the same span
  • Pageheap
    • Pageheap aligned memory blocks
    • Extracts spans (aligned contiguous memory) from hugepages

1.2. Characterization Methodology

1.3. Characterization

  • Measures latency, cpu cycles, fragmentation (mostly central freelist and pageheap),
  • Allocated object size
    • Small objects don’t take a lot of space, but account for most of the objects allocated
    • Optimize for small objects
  • Smaller objects tend to have smaller lifetimes

1.4. Improvements on design

1.4.1. PerCPU cache

  • Allocate only for the cpus that need to be allocated with vCPU
  • Noticed that caches with higher vCPU ids missed less
    • Not used often
  • Dynamically stretch the cache size of the CPU caches (by stealing capacity)

1.4.2. Transfer Cache

  • Slow data transfer rates for things on different chips
  • Due to CPU cache being in different locations
  • This results in the transfer cache being slow when transferring data across different chips (i.e. CPUs)
  • Solution: NUCA aware transfer caches only transfer within the same cache domain

1.4.3. Central free list

  • Tries to prioritize allocating in spans that are likely to live quite long
  • Differentiates spans between different live allocation levels
  • Prioritizes allocation of objects that have more live allocations

1.4.4. Pageheap

  • Three different ways to allocate hugepages (depending on the size of the allocation request)
  • Page filler (for less than hugepage sized alloc requests) accounts for most of the fragmentation
  • Idea: Place similar lifetime spans on the same hugepage
    • Noticed that spans that can store more things tend to live longer
    • Place spans that store a similar amount of objects together
    • Statically determined size
    • Reduces TLB misses as stuff is more densely packed (small objects with high counts are accessed more frequently, so TLB hits that hugepage more often)

1.5. Discussion

  • Optimize for common datacenter libraries
  • Metrics from applications are useful, e.g. cache hit rate
  • NUMA aware will help
  • Uses kernel features
    • Restartable sequences
    • Virtual CPU ids
    • Hugepages
  • Lifetime aware allocation helps a lot
Modified 2025-03-15 Sat 17:54