Gemma 4 on Apple Silicon: 85 tok/s with a pip install
Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenA...

Source: DEV Community
Last week Google released Gemma 4 — their most capable open-weight model family. Within hours I had it running locally on my Mac at 85 tokens/second, with full tool calling, streaming, and an OpenAI-compatible API that works with every major AI framework. Here's how, and what the benchmarks actually look like. Setup: 2 commands pip install rapid-mlx rapid-mlx serve gemma-4-26b That's it. The server downloads the 4-bit MLX-quantized model (~14 GB) and starts an OpenAI-compatible API on http://localhost:8000/v1. Benchmarks: Gemma 4 26B on M3 Ultra I benchmarked three engines on the same machine (M3 Ultra, 192GB), same model (Gemma 4 26B-A4B 4-bit), same prompt: Engine Decode (tok/s) TTFT Notes Rapid-MLX 85 tok/s 0.26s MLX-native, prompt cache mlx-vlm 84 tok/s 0.31s VLM library (no tool calling) Ollama 75 tok/s 0.08s llama.cpp backend Rapid-MLX is 13% faster than Ollama on decode. Ollama has faster TTFT (it uses llama.cpp's Metal kernels for prefill), but for interactive use the decode sp