NPU-first runtime
The fastest, most efficient LLM inference on NPUs
FastFlowLM (FLM) delivers an Ollama-style developer experience optimized for tile-structured NPU accelerators. Install in seconds, stream tokens instantly, and run context windows up to 256k — all with dramatically better efficiency than GPU-first stacks. Our GA release for AMD Ryzen™ AI NPUs is available today, with betas for Qualcomm Snapdragon and Intel Core Ultra coming soon.
-
Runtime size
~16 MB
-
Context
Up to 256k tokens
-
Supported chips
Ryzen™ AI (Strix, Strix Halo, Kraken)
GPT-OSS on NPU
GPT-OSS-20B streaming fully on the Ryzen™ AI NPU
Install
From download to first token in under a minute
FastFlowLM ships as a 16 MB runtime with an Ollama-style CLI and a server compatible with the OpenAI API. No drivers, no guesswork—just run the installer, pull a model, and start chatting.
-
Zero-conf installer
igned installers for every Ryzen™ AI 300 laptop — download, run, done.
-
Drop-in APIs
OpenAI-compatible APIs — plug in your existing tools instantly.
-
Secure by default
On-device security: local tokens and full offline mode.
Quickstart
Invoke-WebRequest https://github.com/FastFlowLM/FastFlowLM/releases/latest/download/flm-setup.exe `
-OutFile flm-setup.exe
Start-Process .\flm-setup.exe -Wait
flm pull llama3.2:3b
flm run llama3.2:3b --ctx-len 131072
Models
One runtime, every Ryzen-ready model
Pull curated FastFlowLM recipes. The runtime streams tokens via OpenAI-compatible API, so existing apps work without rewrites.
Flagship reasoning
GPT-OSS · DeepSeek-R1 · Qwen3
Optimized kernels with smart context reuse.
Vision & speech
Gemma3 · Qwen3-VL · Whisper
VLM and audio pipelines run on the NPU, enabling private multimodal assistants.
Local private edge database
Retrieval-Augmented Generation (RAG) · Embedding Model
Build and run a complete RAG workflow fully on the NPU, without relying on the CPU or GPU.
Benchmarks
Proof on silicon, not slides
FastFlowLM is tuned on real Ryzen™ AI hardware with synthetic and application-level workloads. Expect steady 20–80 tok/s on models at < 2 W (CPU+NPU), plus deterministic latency for agentic chains.
-
Full-stack telemetry
See exactly where compute goes with NPU, CPU, and memory counters.
-
Scenario-driven suites
Instruction tuning, RAG, chat, and multimodal tests with real workloads.
Llama3.2 1B @ Q4_1 (4-bit with bias)
66 tok/s
Ryzen™ AI 9 HX 350 · ms-level latency
Gemma 3 4B Vision
~3 sec
Image understanding on XDNA2 NPU
Power draw (CPU + NPU)
< 2 W
Full assistant stack vs ~25 W GPU baseline
Remote test drive
No Ryzen™ AI hardware yet? Launch the hosted FastFlowLM + Open WebUI sandbox and stream from a live AMD Ryzen™ AI box (Kraken Point).
-
Live hardware
Same builds we use internally, refreshed with every release.
-
Guest access
Instant login with rotating demo credentials.