TL;DR #
- Glibc’s allocator (
malloc
/free
) manages memory as chunks and reuses freed chunks from per-thread caches and per-arena bins before asking the kernel for more. That’s why a process often doesn’t shrink immediately after free()
. - Fast path: a per-thread tcache (default: up to 7 chunks per size class) serves small allocations with zero locking; then come fastbins, the unsorted bin, and the size-segregated small/large bins.
- Large allocations (by default, >~128 KiB) use
mmap()
and are returned to the OS immediately with munmap()
when freed. Regular heap memory goes back to the OS only when the top of the heap can be trimmed past a threshold (default ~128 KiB, often twice the current mmap threshold). - Multi-threading: glibc uses multiple arenas (heaps) to reduce contention. On 64-bit systems the default hard limit is typically 8 × #CPU cores (tunable).
- You can tune behavior with
mallopt()
and GLIBC tunables (e.g., MALLOC_ARENA_MAX
, glibc.malloc.tcache_count
, trim_threshold
, mmap_threshold
). Use sparingly; the defaults are good for most apps.
Detailed explanation of memory Allocation and Deallocation in glibc #
This post explains how modern glibc (on Linux) handles dynamic memory: what happens on malloc()
, calloc()
, realloc()
, and free()
. It’s written for experienced Elixir developers with light-to-intermediate C familiarity—no historical detours, just what matters today.
Key mental model: glibc tries hard to satisfy allocations from memory it already has—via per-thread caches and per-arena free lists. It asks the kernel for more only when it must, and it returns memory to the kernel only when it’s practical (large mmap()
d blocks, or heap “trimming” at the top).
The basic building block: chunks #
Every allocation is a chunk: a small header (size + flags; extra pointers when free) followed by your user data. Glibc rounds your request up to alignment and internal size classes. Freed chunks aren’t immediately returned to the OS—they’re linked into internal lists/caches for reuse.
If there’s insufficient free space, glibc either splits the “top chunk” (free space at the end of a heap) or grows memory (via brk()
for the main heap or mmap()
for others). Conversely, if the top of the heap becomes big enough, glibc can trim it back to the OS.
Arenas: concurrency without pain #
To reduce lock contention, glibc uses multiple arenas—independent heaps with their own free lists. Threads are assigned arenas; work proceeds mostly in parallel. The hard limit of arenas is implementation-defined and generally proportional to CPU cores (commonly 8× cores on 64-bit systems). You can cap it with M_ARENA_MAX
or MALLOC_ARENA_MAX
.
Implication: memory freed in one arena isn’t instantly usable by another; long-lived multi-threaded apps can appear to “hold on” to memory across threads. Tuning the arena limit can reduce per-thread overhead (at some contention cost).
Free lists and caches (fast to slower) #
Glibc organizes freed chunks into several layers, searched roughly in this order:
- Tcache (Thread-local cache) — per-thread, no locks, default: 7 chunks per size class; covers small allocations up to
tcache_max
(default often ~1032 bytes on 64-bit). This is the hottest path for small, frequent (de)allocations. - Fastbins — per-arena singly-linked lists for very small sizes; fast insertion/removal; no immediate coalescing on
free()
to keep the path hot. - Unsorted bin — freshly freed (non-fastbin) chunks land here; next
malloc
can grab from it quickly, otherwise chunks get moved into size-segregated bins. - Small / Large bins — sorted free lists for precise size classes (small) or ranged classes (large). Glibc may split a too-large chunk and return the remainder to a bin.
This layered design keeps the common path lock-free and O(1) while still managing fragmentation for the general case.
What happens on malloc(size)
#
- Normalize size: round up for alignment + metadata. If
size == 0
, glibc returns a unique pointer you can safely free()
. - Large request? If the (adjusted) size exceeds the mmap threshold (default ~128 KiB, dynamically adjusted), glibc
mmap()
s a dedicated region. These chunks skip the bins and go straight back to the OS on free()
via munmap()
. - Pick an arena for this thread; lock if needed. Multiple arenas reduce contention.
- Try tcache (exact size class). If present, pop and return—no locks.
- Try fastbins (if in fastbin range). Pop and return.
- Try unsorted bin; if a big enough chunk is found, possibly split it. Otherwise, the allocator promotes unsorted entries into small/large bins and continues.
- Search small/large bins for a best fit (splitting if needed).
- No fit? Grow memory by carving from the top chunk, extending the heap with
brk()
(or mapping a new heap region) when necessary.
What happens on free(ptr)
#
- NULL? Do nothing.
- Identify the chunk (metadata sits just before your pointer).
- If it was
mmap()
ed, glibc munmap()
s it immediately—memory returns to the OS right away. - Otherwise, fast path: put small chunks into tcache (if not full) — O(1), no locks. If tcache is full or size is out of range, consider fastbins.
- For non-fastbin sizes, glibc may coalesce with adjacent free neighbors to reduce fragmentation, then insert into the unsorted bin (later migrating to small/large bins).
- Top-of-heap trimming: if coalescing grows the top chunk beyond the trim threshold (default ~128 KiB; often set to 2× the current mmap threshold), glibc shrinks the heap with
brk()
to give memory back to the OS. Trimming only applies to contiguous free space at the end of the heap.
Note: free()
does not zero memory. That’s by design for speed; calloc()
is the zero-initializing API.
How calloc()
and realloc()
fit in #
calloc(n, sz)
allocates n*sz
bytes and zero-fills the user area. Under the hood, it still draws from the same caches/bins; glibc may use optimizations (like mmap
+ kernel-zeroed pages) for larger blocks.realloc(p, new)
tries to resize in place when possible; otherwise it allocates a new chunk (possibly from tcache/bins), copies data, and frees the old chunk (which then follows the normal free path).
Returning memory to the OS #
- Always returned: standalone
mmap()
allocations are freed with munmap()
and leave the process immediately. Great for big, bursty allocations. - Sometimes returned: the heap top can be trimmed when large contiguous free space accumulates at the end (over
trim_threshold
). Otherwise, freed memory stays in the process for future reuse—which is usually good for performance.
Tuning knobs you might care about #
Use mallopt()
(C) or environment variables / GLIBC tunables to adjust behavior. Common ones:
MALLOC_ARENA_MAX
/ M_ARENA_MAX
— cap the number of arenas. Helpful for memory-sensitive, highly threaded services (e.g., set to a small number to reduce per-thread heap overhead). Default hard limit is proportional to CPU count (often 8× cores on 64-bit).glibc.malloc.tcache_count
— max cached chunks per size class in a thread (default 7, upper limit 65535). Set to 0 to disable tcache.glibc.malloc.tcache_max
— largest tcache-eligible allocation (default ~1032 bytes on 64-bit).M_TRIM_THRESHOLD
/ trim_threshold
— size beyond which the top chunk is trimmed. Defaults to ~128 KiB and is often updated dynamically to 2× the current mmap_threshold
. Lower it to return memory more aggressively.M_MMAP_THRESHOLD
/ mmap_threshold
— sizes above this use mmap()
directly (~128 KiB by default; dynamically adjusted upward after very large requests).
Observability tip: call malloc_info()
to dump allocator state (arenas, bins) as XML; handy for debugging/tuning in production.
Practical examples #
1) Reuse of a freed small chunk (tcache/fast path) #
#include <stdio.h>
#include <stdlib.h>
int main(void) {
void *p1 = malloc(100);
void *p2 = malloc(100);
free(p1);
void *p3 = malloc(100);
printf("p1=%p\np2=%p\np3=%p\n", p1, p2, p3);
return 0;
}
Typical output shows p3 == p1
: glibc reused the recently freed chunk from tcache/bins—no heap growth required.
2) Large allocations use mmap()
and return fast #
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
int main(void) {
size_t big = 512 * 1024;
void *p = malloc(big);
free(p);
printf("freed large block\n");
return 0;
}
Because this crosses the default mmap_threshold
, freeing returns the mapping to the OS immediately.
3) calloc()
zero-init and realloc()
growth #
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void) {
int *a = calloc(4, sizeof(int));
a = realloc(a, 1024 * sizeof(int));
printf("a[0]=%d\n", a[0]);
free(a);
return 0;
}
calloc()
returns zeroed memory; realloc()
may resize in place or allocate+copy+free, depending on fit.
Tips for Elixir developers #
- FFI/native code (NIFs/Ports/C-Nodes): if your NIFs or linked C libraries allocate via glibc, your BEAM process memory profile follows these rules. Expect plateaus after frees unless you free big
mmap
chunks or the heap top can be trimmed. Consider explicit malloc_trim(0)
at quiescent points if memory return matters. - Threading model: the BEAM runs schedulers and potentially many threads; unbounded arenas can waste memory. If you see high but flat RSS, try lowering
MALLOC_ARENA_MAX
cautiously and measure contention. - Allocation patterns: batch allocations/frees of similar sizes benefit most from tcache. On memory-tight systems, consider reducing
tcache_count
or disabling tcache to reduce per-thread caches (trade-off: slower alloc/free). - Measure before tuning: use
malloc_info()
and OS tools (/proc/self/smaps
, pmap
, perf) to confirm where memory sits—arenas vs mmaps—before changing thresholds.
Sources #
All details refer to current glibc behavior and official documentation/manpages at the time of writing; specifics can vary slightly by distribution and glibc version.
Since you've made it this far, sharing this article on your favorite social media network would be highly appreciated 💖! For feedback, please ping me on Twitter.
Published