TCMalloc
1. Notes
1.1. Organization
- Per-CPU caches
- Stores header information and a list of pointers (of objects with determined sizes) of free memory
- Transfer cache in the middle of it
- Transfers these pointers from different caches
- Central free list
- Stores spans (contiguous, aligned regions), in linked lists
- Extracts objects from spans to satisfy requests
- Have same size class in the same span
- Pageheap
- Pageheap aligned memory blocks
- Extracts spans (aligned contiguous memory) from hugepages
1.2. Characterization Methodology
1.3. Characterization
- Measures latency, cpu cycles, fragmentation (mostly central freelist and pageheap),
- Allocated object size
- Small objects don’t take a lot of space, but account for most of the objects allocated
- Optimize for small objects
- Smaller objects tend to have smaller lifetimes
1.4. Improvements on design
1.4.1. PerCPU cache
- Allocate only for the cpus that need to be allocated with vCPU
- Noticed that caches with higher vCPU ids missed less
- Dynamically stretch the cache size of the CPU caches (by stealing capacity)
1.4.2. Transfer Cache
- Slow data transfer rates for things on different chips
- Due to CPU cache being in different locations
- This results in the transfer cache being slow when transferring data across different chips (i.e. CPUs)
- Solution: NUCA aware transfer caches only transfer within the same cache domain
1.4.3. Central free list
- Tries to prioritize allocating in spans that are likely to live quite long
- Differentiates spans between different live allocation levels
- Prioritizes allocation of objects that have more live allocations
1.4.4. Pageheap
- Three different ways to allocate hugepages (depending on the size of the allocation request)
- Page filler (for less than hugepage sized alloc requests) accounts for most of the fragmentation
- Idea: Place similar lifetime spans on the same hugepage
- Noticed that spans that can store more things tend to live longer
- Place spans that store a similar amount of objects together
- Statically determined size
- Reduces TLB misses as stuff is more densely packed (small objects with high counts are accessed more frequently, so TLB hits that hugepage more often)
1.5. Discussion
- Optimize for common datacenter libraries
- Metrics from applications are useful, e.g. cache hit rate
- NUMA aware will help
- Uses kernel features
- Restartable sequences
- Virtual CPU ids
- Hugepages
- Lifetime aware allocation helps a lot