We evaluated a few allocators for some of our Linux apps and found (modern) tcmalloc to consistently win in time and space. Our applications are primarily written in Rust and the allocators were linked in statically (except for glibc). Unfortunately I didn't capture much context on the allocation patterns. I think in general the apps allocate and deallocate at a higher rate than most Rust apps (or more than I'd like at least).
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
app1:
glibc: 215,580 KB, 133 ms
mimalloc 2.1.7: 144,092 KB, 91 ms
mimalloc 2.2.4: 173,240 KB, 280 ms
tcmalloc: 138,496 KB, 96 ms
jemalloc: 147,408 KB, 92 ms
app2, bench1
glibc: 1,165,000 KB, 1.4 s
mimalloc 2.1.7: 1,072,000 KB, 5.1 s
mimalloc 2.2.4:
tcmalloc: 1,023,000 KB, 530 ms
app2, bench2
glibc: 1,190,224 KB, 1.5 s
mimalloc 2.1.7: 1,128,328 KB, 5.3 s
mimalloc 2.2.4: 1,657,600 KB, 3.7 s
tcmalloc: 1,045,968 KB, 640 ms
jemalloc: 1,210,000 KB, 1.1 s
app3
glibc: 284,616 KB, 440 ms
mimalloc 2.1.7: 246,216 KB, 250 ms
mimalloc 2.2.4: 325,184 KB, 290 ms
tcmalloc: 178,688 KB, 200 ms
jemalloc: 264,688 KB, 230 ms
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.
I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.
on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.
tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.
for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.
1. tcmalloc is actually the only allocator I tested which was not using thread local caches. even glibc malloc has tcache.
2. async executors typically shouldn’t have tasks jumping willy nilly between threads. i see the issue u describe more often with the use of thread pools (like rayon or tokio’s spawn_blocking). i’d argue that the use of thread pools isn’t necessarily an inherent feature of async executors. certainly tokio relies on its threadpool for fs operations, but io-uring (for example) makes that mostly unnecessary.
This is similar to what I experienced when I tested mimalloc many years ago. If it was faster, it wasn't faster by much, and had pretty bad worst cases.
Disclaimer: I don't really use Zig (primarily a Rust dev) but I do think it's quite cool.
If you're willing to dive right into it, I'd first read a bit about the comptime system [0] then have a go at reading the source for `MultiArrayList` [1], a container which internally stores elements in SoA format.
I am going to ask a question that is is definitely not the place for, but I am not involved with Zig in any way and am curious, so I hope you'll indulge me.
I noticed the following comment was added to lib/std/multi_array_list.zig [0] with this change:
/// This pointer is always aligned to the boundary `sizes.big_align`; this is not specified
/// in the type to avoid `MultiArrayList(T)` depending on the alignment of `T` because this
/// can lead to dependency loops. See `allocatedBytes` which `@alignCast`s this pointer to
/// the correct type.
How could relying on `@alignOf(T)` in the definition of `MultiArrayList(T)` cause a loop? Even with `T` itself being a MultiArrayList, surely that is a fully distinct, monomorphized type? I expect I am missing something obvious.
I had to search for this, but managed to find the relevant mlugg@ comment[0] on the ZSF zulip:
> i had to change the bytes field from [*]align(@alignOf(T)) u8 to just [*]u8 (and cast the alignment back in the like one place that field is accessed). this wasn't necessary for MultiArrayList in and of itself, but it was necessary for embedding a MultiArrayList(T) inside of T without a dependency loop, like
const T = struct {
children: MultiArrayList(T),
};
// reproduced for completeness:
fn MultiArrayList(comptime T: type) type {
return struct {
bytes: [*]align(@alignOf(T)) u8,
// ...
};
}
I’m curious what issues people were running into with Swift’s built in C++ interop? I haven’t had the chance to use it myself, but it seemed reasonable to me at a surface level.
Just yesterday someone was telling me xmake does a lot of what bazel can do (hermetic, deterministic, optionally remote builds) while being easier to use.
I took a look at the docs later and couldn’t find a direct comparison. But there does seem to be a remote build system. And there were a few mentions of sandboxing.
Can anyone provide a head to head comparison?
Does xmake strictly enforce declared dependencies? Do actions run in their own sandboxes?
Can you define a target whose dependency tree is multi language, multi toolchain, multi target platform and which is built across multiple remote execution servers?
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.i don't recall which jemalloc was tested.
reply