Hacker Newsnew | past | comments | ask | show | jobs | submit | aegis_camera's commentslogin

Thanks, we are not large R&D lab, limited resources. We were working on a product with is a Local VLM first BYOD when you want Video Security application, our users requested to have a MLX backend benchmark comparison, we tried hard to not deliver with Python in the application bundle, so we searched for a pure binary based MLX implementation the results shown we need to build one. It took us two weeks to get it working and we had been testing with multiple models. As a reference, you can see the result here: https://www.sharpai.org/benchmark/

Then we saw the announcement from Google about TurboQuant, it's so cool, so we started to integrate them (along with SSD/Flash streaming). It's a non-trivial process and thanks for your support and understanding. When we saw the mobile application alive with QWEN 3 1.7B model, we thought it worth.

If we get anything similar with well maintains, we will definitely adopt it since our target is the production delivery, if this one gets good support from the community, we will continue to support.

I think all the posts here gave us a reason to continue.


Tried, but wrong time to post, it got zero attention . :)

One of my user requested MLX comparison with GGUF, he wanted to run the benchmark, I was thinking about how to get MLX support without bundling the python code together with SharpAI Aegis, a Local or BYOK local security agent https://www.sharpai.org. Then I had to pick up the Swift and create it.

The benchmark shows a benefit of MLX engine, so it's user's choice which engine to use, aegis-ai supports both : )


Here is a reference https://www.sharpai.org/benchmark/ For specific tasks, local model could achieve workable level.

I've ran this on an IPHONE 13 pro (6GB) memory, QWEN 3 1.7B runs good. So local will get more intelligent for the task you want it done soon or already.

Yes, I've ran it on IOS, IPHONE 13 pro beside M5 pro, I'll test it on my M2 Mini and M3 Air.

the Python mlx-metal trick is actually what's crashing it. The mlx.metallib from pip is a different version of MLX than what your Swift binary was built against. It gets past the startup error but then corrupts the GPU memory allocator at inference time → freed pointer was not the last allocation.

Use the version-matched metallib that's already in the repo:

cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/ .build/release/SwiftLM \ --model mlx-community/Qwen3.5-122B-A10B-4bit \ --stream-experts \ --port 5413 This is the exact metallib that was compiled alongside the Swift code — no version mismatch. Future pre-built releases will bundle it automatically.


git clone https://github.com/SharpAI/SwiftLM # no --recursive needed cd SwiftLM swift build -c release ### Please let me know if this fix the issue:

# Copy metallib next to the binary (one-time step) cp LocalPackages/mlx-swift/Source/Cmlx/mlx/mlx/backend/metal/kernels/default.metallib \ .build/release/


Yes, this is a reference project, the main different is we don't use os swap ( it introduces latency, will add https://github.com/danveloper/flash-moe to the original reference as well ).

We implemented two techniques to run massive 100B+ parameter MoE models natively on the M5 Pro 64GB MacBook Pro:

TurboQuant KV compression: We ported the V3 Lloyd-Max codebooks from the TurboQuant paper (Zandieh et al., ICLR 2026) into native C++ and fused dequantization into Metal shaders. This achieves a measured 4.3× KV cache compression at runtime, completely eliminating Python overhead.

SSD Expert Streaming: To fit a 122B parameter model (e.g., Qwen3.5-122B MoE) without triggering macOS VM swapping or Watchdog kernel kills, the full ~60 GB weight file remains on NVMe. Only the top-k active expert pages are streamed to the GPU per forward pass at ~9 GB/s. As a result, inference runs with only 2,694 MB of active GPU VRAM on the M5 Pro 64GB, while the OS page cache automatically handles hot-expert reuse.

By combining these two approaches, we can comfortably run massive models in memory-constrained environments on Apple Silicon.

Also tested QWEN 4B on IPHONE 13 Pro.

Code and implementation details: https://github.com/SharpAI/SwiftLM


what tokens/s are you getting with a 122B MoE model in this setup? I didn't see any benchmarks in the benchmarks section on the readme.md

https://www.sharpai.org/benchmark/ The MLX part is what we've done with SwiftLM, the local result is still being verified more details are on-going.

I'll add more details. We just wired up the pipeline on both MAC and IOS.

yeah this I'd like to see added to teh readme.


Thanks, pure Swift was the design idea and since I found nothing could be used for my project https://www.sharpai.org then I created Swift version. Python is too heavy to be delivered with application, user mentioned they want to use MLX, that's why I've been working on it for 1-2 weeks for bug fixing and testing , then suddenly TurboQuant proposed, I had a quick integration. My 64GB M5 Pro is already good for my local security task, now it's able to use M1/M2 Mini w/ 8GB memory.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: