TCMalloc

1. Notes

Per-CPU caches
- Stores header information and a list of pointers (of objects with determined sizes) of free memory
Transfer cache in the middle of it
- Transfers these pointers from different caches
Central free list
- Stores spans (contiguous, aligned regions), in linked lists
- Extracts objects from spans to satisfy requests
- Have same size class in the same span
Pageheap
- Pageheap aligned memory blocks
- Extracts spans (aligned contiguous memory) from hugepages

Measures latency, cpu cycles, fragmentation (mostly central freelist and pageheap),
Allocated object size
- Small objects don’t take a lot of space, but account for most of the objects allocated
- Optimize for small objects
Smaller objects tend to have smaller lifetimes

Slow data transfer rates for things on different chips
Due to CPU cache being in different locations
This results in the transfer cache being slow when transferring data across different chips (i.e. CPUs)
Solution: NUCA aware transfer caches only transfer within the same cache domain

Three different ways to allocate hugepages (depending on the size of the allocation request)
Page filler (for less than hugepage sized alloc requests) accounts for most of the fragmentation
Idea: Place similar lifetime spans on the same hugepage
- Noticed that spans that can store more things tend to live longer
- Place spans that store a similar amount of objects together
- Statically determined size
- Reduces TLB misses as stuff is more densely packed (small objects with high counts are accessed more frequently, so TLB hits that hugepage more often)