More

aegis_camera · 2026-04-01T22:17:10 1775081830

Thanks, we are not large R&D lab, limited resources. We were working on a product with is a Local VLM first BYOD when you want Video Security application, our users requested to have a MLX backend benchmark comparison, we tried hard to not deliver with Python in the application bundle, so we searched for a pure binary based MLX implementation the results shown we need to build one. It took us two weeks to get it working and we had been testing with multiple models. As a reference, you can see the result here: https://www.sharpai.org/benchmark/

Then we saw the announcement from Google about TurboQuant, it's so cool, so we started to integrate them (along with SSD/Flash streaming). It's a non-trivial process and thanks for your support and understanding. When we saw the mobile application alive with QWEN 3 1.7B model, we thought it worth.

If we get anything similar with well maintains, we will definitely adopt it since our target is the production delivery, if this one gets good support from the community, we will continue to support.

I think all the posts here gave us a reason to continue.

aegis_camera · 2026-04-01T22:10:11 1775081411

Tried, but wrong time to post, it got zero attention . :)

aegis_camera · 2026-04-01T20:51:41 1775076701

One of my user requested MLX comparison with GGUF, he wanted to run the benchmark, I was thinking about how to get MLX support without bundling the python code together with SharpAI Aegis, a Local or BYOK local security agent https://www.sharpai.org. Then I had to pick up the Swift and create it.

The benchmark shows a benefit of MLX engine, so it's user's choice which engine to use, aegis-ai supports both : )

aegis_camera · 2026-04-01T20:38:25 1775075905

Here is a reference https://www.sharpai.org/benchmark/ For specific tasks, local model could achieve workable level.

aegis_camera · 2026-04-01T20:31:08 1775075468

I've ran this on an IPHONE 13 pro (6GB) memory, QWEN 3 1.7B runs good. So local will get more intelligent for the task you want it done soon or already.

aegis_camera · 2026-04-01T20:27:12 1775075232

Yes, I've ran it on IOS, IPHONE 13 pro beside M5 pro, I'll test it on my M2 Mini and M3 Air.

aegis_camera · 2026-04-01T20:23:01 1775074981

the Python mlx-metal trick is actually what's crashing it. The mlx.metallib from pip is a different version of MLX than what your Swift binary was built against. It gets past the startup error but then corrupts the GPU memory allocator at inference time → freed pointer was not the last allocation.

Use the version-matched metallib that's already in the repo:

cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/ .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 This is the exact metallib that was compiled alongside the Swift code — no version mismatch. Future pre-built releases will bundle it automatically.

aegis_camera · 2026-04-01T20:15:06 1775074506

git clone https://github.com/SharpAI/SwiftLM # no --recursive needed cd SwiftLM swift build -c release ### Please let me know if this fix the issue:

# Copy metallib next to the binary (one-time step) cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/

aegis_camera · 2026-04-01T19:52:19 1775073139

Yes, this is a reference project, the main different is we don't use os swap ( it introduces latency, will add https://github.com/danveloper/flash-moe to the original reference as well ).

aegis_camera · 2026-04-01T18:07:56 1775066876

We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM

altruios · 2026-04-01T19:07:46 1775070466

what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md

aegis_camera · 2026-04-01T20:32:39 1775075559

https://www.sharpai.org/benchmark/ The MLX part is what we've done with SwiftLM, the local result is still being verified more details are on-going.

aegis_camera · 2026-04-01T19:48:51 1775072931

I'll add more details. We just wired up the pipeline on both MAC and IOS.

gigatexal · 2026-04-01T19:29:55 1775071795

yeah this I'd like to see added to teh readme.

anemll · 2026-04-01T20:03:41 1775073821

Check it out, you might be able to speed it up using this https://github.com/Anemll/anemll-flash-mlx https://x.com/anemll/status/2038684375425200360

aegis_camera · 2026-04-01T20:37:20 1775075840

Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version. Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration. My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.