NPU-first runtime

The fastest, most efficient LLM inference on NPUs

FastFlowLM (FLM) delivers an Ollama-style developer experience optimized for tile-structured NPU accelerators. Install in seconds, stream tokens instantly, and run context windows up to 256k — all with dramatically better efficiency than GPU-first stacks. Our GA release for AMD Ryzen™ AI NPUs is available today, with betas for Qualcomm Snapdragon and Intel Core Ultra coming soon.

  • Runtime size

    ~16 MB

  • Context

    Up to 256k tokens

  • Supported chips

    Ryzen™ AI (Strix, Strix Halo, Kraken)

GPT-OSS on NPU

GPT-OSS-20B streaming fully on the Ryzen™ AI NPU

GPT-OSS on NPU

Runs GPT-OSS-20B at 19 TPS (token per second) with 10× GPU efficiency — the fastest MoE on any NPU.

Gemma3 (Vision) on NPU

Gemma3 (Vision) understand and describe the image

Gemma3 (Vision) on NPU

Understand and describe images instantly — FastFlowLM runs Google Gemma3 fully on the NPU for fast, private, and efficient vision inference.

Whisper on-device

Transcribe and summarize long-form audio locally

Whisper on-device

Transcribe hours of audio locally — FLM runs OpenAI Whisper fully on the NPU — fast, private, and efficient.

Llama 3.2 on NPU

Interact with Llama 3.2-3B via Open WebUI

Llama 3.2 on NPU

Runs Meta Llama 3.2-3B at 28 TPS with over 10× GPU efficiency — the fastest on any NPU.

Install

From download to first token in under a minute

FastFlowLM ships as a 16 MB runtime with an Ollama-style CLI and a server compatible with the OpenAI API. No drivers, no guesswork—just run the installer, pull a model, and start chatting.

  • Zero-conf installer

    igned installers for every Ryzen™ AI 300 laptop — download, run, done.

  • Drop-in APIs

    OpenAI-compatible APIs — plug in your existing tools instantly.

  • Secure by default

    On-device security: local tokens and full offline mode.

Quickstart

CLI
Invoke-WebRequest https://github.com/FastFlowLM/FastFlowLM/releases/latest/download/flm-setup.exe `
  -OutFile flm-setup.exe
Start-Process .\flm-setup.exe -Wait
flm pull llama3.2:3b
flm run llama3.2:3b --ctx-len 131072

Models

One runtime, every Ryzen-ready model

Pull curated FastFlowLM recipes. The runtime streams tokens via OpenAI-compatible API, so existing apps work without rewrites.

Flagship reasoning

GPT-OSS · DeepSeek-R1 · Qwen3

Optimized kernels with smart context reuse.

Vision & speech

Gemma3 · Qwen3-VL · Whisper

VLM and audio pipelines run on the NPU, enabling private multimodal assistants.

Local private edge database

Retrieval-Augmented Generation (RAG) · Embedding Model

Build and run a complete RAG workflow fully on the NPU, without relying on the CPU or GPU.

Benchmarks

Proof on silicon, not slides

FastFlowLM is tuned on real Ryzen™ AI hardware with synthetic and application-level workloads. Expect steady 20–80 tok/s on models at < 2 W (CPU+NPU), plus deterministic latency for agentic chains.

  • Full-stack telemetry

    See exactly where compute goes with NPU, CPU, and memory counters.

  • Scenario-driven suites

    Instruction tuning, RAG, chat, and multimodal tests with real workloads.

Llama3.2 1B @ Q4_1 (4-bit with bias)

66 tok/s

Ryzen™ AI 9 HX 350 · ms-level latency

Gemma 3 4B Vision

~3 sec

Image understanding on XDNA2 NPU

Power draw (CPU + NPU)

< 2 W

Full assistant stack vs ~25 W GPU baseline

Remote test drive

No Ryzen™ AI hardware yet? Launch the hosted FastFlowLM + Open WebUI sandbox and stream from a live AMD Ryzen™ AI box (Kraken Point).

  • Live hardware

    Same builds we use internally, refreshed with every release.

  • Guest access

    Instant login with rotating demo credentials.