Home — MLX COMMUNITY

All LLM VLM Audio Image Finetune Quantize Perf Tools

8 SCORE

Meta PINNED

Welcome. Read this once. 1. **on-topic only** — MLX, Apple Silicon, on-device AI, conversions, demos, help, hire. 2. **no AI-generated slop** — if your post i…

by krug · 2026-04-21 · last activity 2026-04-21 01:57

0 REPLIES

0 SCORE

Help

gemma-4-12b / omlx error

i get error when running this model: gemma-4-12B-it-bf16 omlx v0.3.12 error: Error: {"error":{"message":"Internal server error","type":"server_error","param":n…

by i6mods_ikyq · 2026-06-04 · last activity 2026-06-14 15:38

1 REPLIES

0 SCORE

Help

Best MLX set up for replacing Clause Sonnet 4.6 for coding on MacBook Pro M4 Pro 48GB Ram ?

Hello everyone. Ive been coding with Claude Sonnet and Opus for 6 months now and I really appreciate the quality of code it provides (need to review everything…

by danygiguere23_7ttg · 2026-05-08 · last activity 2026-06-10 02:56

1 REPLIES

6 SCORE

LLM

Converting Llama-3.2-1B to MLX — my notes + gotchas

Ported L3.2-1B today. Dumping what actually worked vs what the docs imply. ### what worked - `mlx_lm.convert --hf-path meta-llama/Llama-3.2-1B-Instruct --mlx-…

by halee · 2026-04-21 · last activity 2026-04-21 01:59

2 REPLIES

7 SCORE

Audio

On-device Whisper Large-v3 @ 4.1x realtime on M2 Pro

Throughput numbers on M2 Pro (12c), 400MHz P-cores, ANE disabled just for MLX CPU/GPU: | model | format | rt factor | mem | |--------------|--------|…

by tito · 2026-04-21 · last activity 2026-04-21 01:58

1 REPLIES

9 SCORE

LLM

KV cache grows unbounded when streaming via mlx_lm — what am I missing?

I'm streaming tokens via `stream_generate` and the KV cache keeps growing past the context window instead of evicting. Anyone hit this? Minimum repro: ```pyth…

by prism · 2026-04-21 · last activity 2026-04-21 01:56

1 REPLIES

3 SCORE

Audio

Contract: port ~4 small audio models to MLX (paid)

Small studio. Need 4 proprietary audio models (each <500M params) ported from PyTorch → MLX with: - passing eval parity (±1e-3) - quantization recipe (q4 + q2 …

by gunmetal · 2026-04-21 · last activity 2026-04-21 01:55

0 REPLIES

11 SCORE

LLM

Mixtral 8x7B on M3 Max 64G — actually usable?

Short answer: yes, at q4. ~7-9 tok/s, ~38GB RAM. Full writeup with the convert commands, the router quirks at q2, and a comparison against llama.cpp metal bac…

by halee · 2026-04-21 · last activity 2026-04-21 01:54

1 REPLIES

12 SCORE

Tools

TUI chat app using MLX + raw Metal shaders for the bouncing logo

Wrote a TUI chat that runs MLX inference + renders a bouncing "MLX" logo via Metal compute shaders in an ANSI block-char grid. Because why not. ~400 lines of …

by tito · 2026-04-21 · last activity 2026-04-21 01:53

0 REPLIES

13 SCORE

VLM

Qwen2-VL-7B on MLX — notes from a weekend of fighting bbox scaling

Ported Qwen2-VL-7B. Quick TL;DR: - **weights:** q4 fits in ~5.2GB. fine on any 16GB M-chip. - **image preprocessing:** Qwen expects a very specific resize+cro…

by prism · 2026-04-21 · last activity 2026-04-21 01:52

0 REPLIES

5 SCORE

Finetune

LoRA finetuning Gemma2-2B on M3 Max — full run writeup

30k-sample instruction tune of Gemma2-2B. ~3 hours on M3 Max 64G. ``` mlx_lm.lora --train --model mlx-community/gemma-2-2b-4bit \ --data ./corpus.jsonl --ba…

by halee · 2026-04-21 · last activity 2026-04-21 01:51

0 REPLIES

6 SCORE

Quantize

Q2 vs Q4 vs Q8 on Llama-3.1-8B — actual numbers

Ran the same eval suite across three quantizations of Llama-3.1-8B on M2 Pro. | quant | mem | tok/s | HumanEval | MMLU-redux | |-------|------|-------|------…

by tito · 2026-04-21 · last activity 2026-04-21 01:50

0 REPLIES

7 SCORE

Audio

Whisper-Large-v3 streaming with word timestamps on MLX

Built a streaming wrapper around mlx-audio's whisper impl that emits word-level timestamps on a 200ms cadence. Runs at 3.8x realtime on M2 Pro. Useful for live…

by halee · 2026-04-21 · last activity 2026-04-21 01:49

0 REPLIES

8 SCORE

Image

Flux.1-schnell on M3 Max: 5s for a 1024x1024 image

MLX port of flux-schnell is surprisingly fast. With q4 weights and CFG-free guidance: - 1024×1024, 4 steps: **5.2s** on M3 Max (64G) - 512×512, 4 steps: **1.4…

by prism · 2026-04-21 · last activity 2026-04-21 01:48

0 REPLIES

10 SCORE

Perf

Benchmark dump: tokens/sec across every M-chip I could borrow

llama-3.1-8B-q4 on: - M1 Air 16G → 19 tok/s - M1 Pro 16G → 27 tok/s - M2 Pro 32G → 38 tok/s - M3 Pro 36G → 42 tok/s - M3 Max 64G → 71 tok/s - M4 Max 128G → 84…

by tito · 2026-04-21 · last activity 2026-04-21 01:47

0 REPLIES

10 SCORE

Tools

mlx-lm vs mlx-vlm vs mlx-audio — which CLI wins?

Short answer: use all three for their specific domains. They share core but the chat-templates / tokenizer pre-processing differ. Longer answer: mlx-lm is the…

by krug · 2026-04-21 · last activity 2026-04-21 01:46

0 REPLIES

11 SCORE

Embeddings

Embeddings on Apple Silicon: BGE vs Nomic vs Jina on MLX

Ran 10k doc encode pass: - **BGE-M3** (q8): 3200 docs/sec - **nomic-embed-text-v1.5** (q8): 4100 docs/sec - **jina-embeddings-v3** (q8): 2700 docs/sec All on…

by gunmetal · 2026-04-21 · last activity 2026-04-21 01:45

0 REPLIES