Hacker Newsnew | past | comments | ask | show | jobs | submit | skavi's commentslogin

We evaluated a few allocators for some of our Linux apps and found (modern) tcmalloc to consistently win in time and space. Our applications are primarily written in Rust and the allocators were linked in statically (except for glibc). Unfortunately I didn't capture much context on the allocation patterns. I think in general the apps allocate and deallocate at a higher rate than most Rust apps (or more than I'd like at least).

Our results from July 2025:

rows are <allocator>: <RSS>, <time spent for allocator operations>

  app1:
  glibc: 215,580 KB, 133 ms
  mimalloc 2.1.7: 144,092 KB, 91 ms
  mimalloc 2.2.4: 173,240 KB, 280 ms
  tcmalloc: 138,496 KB, 96 ms
  jemalloc: 147,408 KB, 92 ms

  app2, bench1
  glibc: 1,165,000 KB, 1.4 s
  mimalloc 2.1.7: 1,072,000 KB, 5.1 s
  mimalloc 2.2.4:
  tcmalloc: 1,023,000 KB, 530 ms

  app2, bench2
  glibc: 1,190,224 KB, 1.5 s
  mimalloc 2.1.7: 1,128,328 KB, 5.3 s
  mimalloc 2.2.4: 1,657,600 KB, 3.7 s
  tcmalloc: 1,045,968 KB, 640 ms
  jemalloc: 1,210,000 KB, 1.1 s

  app3
  glibc: 284,616 KB, 440 ms
  mimalloc 2.1.7: 246,216 KB, 250 ms
  mimalloc 2.2.4: 325,184 KB, 290 ms
  tcmalloc: 178,688 KB, 200 ms
  jemalloc: 264,688 KB, 230 ms
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.

i don't recall which jemalloc was tested.


I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).

Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.

Are you using async rust, or sync rust?


modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.

[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...


How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?

on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.

tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.

for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.


Correct. The Linux scheduler has been NUMA aware + sticky for awhile (which is more or less what this reduces to in common scenarios).

> I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).

Indeed, it's not the old gperftools version.

Blog: https://abseil.io/blog/20200212-tcmalloc

History / Diffs: https://google.github.io/tcmalloc/gperftools.html


also:

1. tcmalloc is actually the only allocator I tested which was not using thread local caches. even glibc malloc has tcache.

2. async executors typically shouldn’t have tasks jumping willy nilly between threads. i see the issue u describe more often with the use of thread pools (like rayon or tokio’s spawn_blocking). i’d argue that the use of thread pools isn’t necessarily an inherent feature of async executors. certainly tokio relies on its threadpool for fs operations, but io-uring (for example) makes that mostly unnecessary.


That’s a considerable regression for mimalloc between 2.1 and 2.2 – did you track it down or report it upstream?

Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.


nope.

This is similar to what I experienced when I tested mimalloc many years ago. If it was faster, it wasn't faster by much, and had pretty bad worst cases.

Disclaimer: I don't really use Zig (primarily a Rust dev) but I do think it's quite cool.

If you're willing to dive right into it, I'd first read a bit about the comptime system [0] then have a go at reading the source for `MultiArrayList` [1], a container which internally stores elements in SoA format.

At least, that was what got me interested.

[0]: https://ziglang.org/documentation/master/#comptime

[1]: https://codeberg.org/ziglang/zig/src/branch/master/lib/std/m...


I am going to ask a question that is is definitely not the place for, but I am not involved with Zig in any way and am curious, so I hope you'll indulge me.

I noticed the following comment was added to lib/std/multi_array_list.zig [0] with this change:

        /// This pointer is always aligned to the boundary `sizes.big_align`; this is not specified
        /// in the type to avoid `MultiArrayList(T)` depending on the alignment of `T` because this
        /// can lead to dependency loops. See `allocatedBytes` which `@alignCast`s this pointer to
        /// the correct type.
How could relying on `@alignOf(T)` in the definition of `MultiArrayList(T)` cause a loop? Even with `T` itself being a MultiArrayList, surely that is a fully distinct, monomorphized type? I expect I am missing something obvious.

[0]: https://codeberg.org/ziglang/zig/pulls/31403/files#diff-a6fc...


I had to search for this, but managed to find the relevant mlugg@ comment[0] on the ZSF zulip:

> i had to change the bytes field from [*]align(@alignOf(T)) u8 to just [*]u8 (and cast the alignment back in the like one place that field is accessed). this wasn't necessary for MultiArrayList in and of itself, but it was necessary for embedding a MultiArrayList(T) inside of T without a dependency loop, like

    const T = struct {
        children: MultiArrayList(T),
    };
    // reproduced for completeness:
    fn MultiArrayList(comptime T: type) type {
        return struct {
            bytes: [*]align(@alignOf(T)) u8,
            // ...
        };
    }
[0]: https://zsf.zulipchat.com/#narrow/channel/454360-compiler/to...

Ah, that makes sense. Thanks for pulling this up!

He went into it a bit more over on Ziggit today too. I only noticed it way after I went digging: https://ziggit.dev/t/devlog-type-resolution-redesign-with-la...

blends

Since Geekbench 5, the single threaded benchmark scores have aligned pretty well with those from the industry standard SPEC benchmark.

thank you. TFA is just an awful read.


I’m curious what issues people were running into with Swift’s built in C++ interop? I haven’t had the chance to use it myself, but it seemed reasonable to me at a surface level.


There's a list of unsolved problems in this Ladybird issue, now closed because they dropped Swift: https://github.com/LadybirdBrowser/ladybird/issues/933

for example: "Swift fails to import clang modules with #include <math.h> with libstdc++-15 installed. Workaround: None (!!)"


Just yesterday someone was telling me xmake does a lot of what bazel can do (hermetic, deterministic, optionally remote builds) while being easier to use.

I took a look at the docs later and couldn’t find a direct comparison. But there does seem to be a remote build system. And there were a few mentions of sandboxing.

Can anyone provide a head to head comparison?

Does xmake strictly enforce declared dependencies? Do actions run in their own sandboxes?

Can you define a target whose dependency tree is multi language, multi toolchain, multi target platform and which is built across multiple remote execution servers?


These guys aren’t going to use L4T. ACPI compliant, standard GPU drivers. They’ve also upstreamed a lot of the L4T patches.

See DGX Spark.


In Nix (and, I’d assume, for Guix) you can go the other way around: https://mitchellh.com/writing/nix-with-dockerfiles.

As a side benefit, the generated docker image can be very tiny.


Yeah Guix has a Dockerfile export


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: