LLMChat: Running LLMs in Your Browser with WebGPU

The Zero-Server Vision

In Part 1, we built a chat interface that connects to a self-hosted LLM server. In Parts 2 and Part 3, we added RAG, vision, and web search. All of this requires a FastAPI backend and, more importantly, a model serving engine running somewhere.

But what if you don't have a GPU server running? What if you're on a laptop at a coffee shop? What if you want to chat with an LLM and have absolutely zero data leave your machine, not even to localhost?

That's the promise of in-browser LLM inference. And with WebGPU maturing in modern browsers and HuggingFace's Transformers.js library, it's finally practical for small models.

The Privacy Spectrum

Think of it as a spectrum of privacy vs. performance:

Level	How it works	Data leaves your machine?
Cloud APIs (OpenAI, Claude)	Prompts sent to remote servers	Yes, to the provider
Self-hosted server (vLLM, Ollama)	Model runs on your hardware, API on localhost	No (stays on your machine)
Browser inference (WebGPU/WASM)	Model runs inside the browser tab	No (stays in the browser)

Browser inference is the ultimate privacy play. The model weights are downloaded once from a CDN and cached in the browser. After that, everything happens inside your browser process. Inference, tokenization, generation, all of it. No network requests, no servers, no logs.

How Browser LLM Inference Works

WebGPU vs WASM

LLMChat supports two compute backends for in-browser inference:

Backend	How it works	Speed	Compatibility
WebGPU	Uses your GPU directly from the browser	Fast	Chrome/Edge 113+, experimental in Firefox
WASM	Runs on CPU via WebAssembly	Slower	All modern browsers

WebGPU is the newer, faster option. It gives browser JavaScript access to GPU compute, similar to how WebGL gives access to GPU rendering. WASM (WebAssembly) is the fallback for browsers that don't support WebGPU yet, running everything on the CPU.

LLMChat auto-detects which backend to use:

javascript

1async function isWebGPUAvailable() {
2    if (typeof navigator === 'undefined' || !navigator.gpu) return false;
3    try {
4        const adapter = await navigator.gpu.requestAdapter();
5        return adapter !== null;
6    } catch {
7        return false;
8    }
9}

Transformers.js: The Engine

The actual inference is powered by HuggingFace's Transformers.js, a JavaScript port of the Transformers library that runs ONNX-exported models in the browser:

javascript

1import { pipeline, env, TextStreamer } from 
2    'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.2.4';
3
4// No local model files — everything comes from the CDN
5env.allowLocalModels = false;
6
7// Disable multi-threading to avoid SecurityError with CDN scripts
8env.backends.onnx.wasm.numThreads = 1;

The library downloads model weights from HuggingFace's CDN on first use and caches them in the browser's storage. Subsequent loads use the cache, no re-download needed.

The Models You Can Run

LLMChat comes pre-configured with 5 local models, split between WASM and WebGPU:

Model	Backend	Icon	Notes
TinyLlama 1.1B	WASM	💻	Lightweight, fast, good for quick tests
Phi-3 mini 4K	WASM	💻	Better quality, larger download
Llama 3.2 1B	WebGPU (q4)	🚀	Best quality-to-speed ratio
SmolLM2 360M	WebGPU (q4)	🚀	Ultra-fast, tiny model
Gemma 3 1B	WebGPU (q4)	🚀	Google's latest compact model

In the model dropdown, local models are clearly labeled with 💻 (WASM) or 🚀 (WebGPU) icons so you know what you're getting:

javascript

1// WASM models
2const localOpt = document.createElement('option');
3localOpt.value = "Local: Xenova/TinyLlama-1.1B-Chat-v1.0";
4localOpt.textContent = "💻 Local: TinyLlama-1.1B-Chat (Browser)";
5
6// WebGPU models
7const localOptGPU = document.createElement('option');
8localOptGPU.value = "Local: onnx-community/Llama-3.2-1B-Instruct-ONNX";
9localOptGPU.textContent = "🚀 Local: Llama-3.2-1B (WebGPU)";

The WebGPU models use q4 quantization, meaning 4-bit weights that are roughly 4x smaller than full precision. This is critical for browser inference because you're limited by browser memory and download bandwidth. A 1B parameter model at q4 weighs around 700MB instead of 2.8GB at fp16.

Model selection dropdown with local models

Models That Didn't Make It

Not everything works in the browser. I tried and removed:

Qwen models: Incompatible ONNX file structure
OpenELM-270M: Produces only empty tokens
LFM2.5 (LiquidAI): "Unsupported model type: lfm2" error in Transformers.js

Browser inference is still early. The model zoo is growing, but compatibility varies.

The WebWorker Architecture

Running inference on the main thread would freeze the entire UI. LLMChat solves this with a WebWorker, a background thread that handles all model loading and generation:

javascript

1const worker = new Worker("/static/worker.js", { type: "module" });

The worker communicates with the main thread via message passing:

cpp

1Main Thread                    WebWorker
2    |                              |
3    |-- { type: 'load', data }-->  |  Load model
4    |                              |
5    |<-- { status: 'loading' }---- |  "Loading TinyLlama..."
6    |<-- { status: 'progress' }--- |  "Downloading: 45%"
7    |<-- { status: 'ready' }------ |  Model loaded
8    |                              |
9    |-- { type: 'generate' }---->  |  Start generation
10    |                              |
11    |<-- { status: 'update' }----- |  Token: "Hello"
12    |<-- { status: 'update' }----- |  Token: " world"
13    |<-- { status: 'update' }----- |  Token: "!"
14    |<-- { status: 'complete' }--- |  Done
15    |                              |
16    |-- { type: 'interrupt' }-->   |  Stop!

The PipelineSingleton

The worker uses a singleton pattern to cache loaded models. You don't want to re-download a 700MB model every time you send a message:

javascript

1class PipelineSingleton {
2    static task = 'text-generation';
3    static model = null;
4    static instance = null;
5    static loading = false;
6    static webgpuAvailable = null;
7
8    static async getInstance(progress_callback = null, model_id = null) {
9        // Check WebGPU availability once
10        if (this.webgpuAvailable === null) {
11            this.webgpuAvailable = await isWebGPUAvailable();
12        }
13
14        // If switching models, reset the instance
15        if (model_id && this.model !== model_id) {
16            this.instance = null;
17            this.model = model_id;
18        }
19
20        // Return cached instance if available
21        if (this.instance && typeof this.instance.then !== 'function') {
22            return this.instance;
23        }
24
25        // Determine backend based on model
26        const useWebGPU = this.webgpuAvailable && 
27            WEBGPU_MODELS.some(m => this.model.includes(m));
28        const device = useWebGPU ? 'webgpu' : undefined;
29
30        const pipelineOptions = { progress_callback };
31        if (device) {
32            pipelineOptions.device = device;
33            pipelineOptions.dtype = 'q4';  // 4-bit quantization for WebGPU
34        }
35
36        this.instance = await pipeline(this.task, this.model, pipelineOptions);
37        return this.instance;
38    }
39}

Key decisions:

Lazy loading: Models are loaded on first use, not at page load
Model switching: Switching to a different local model resets the instance and loads the new one
Backend auto-selection: WebGPU models get device: 'webgpu' and dtype: 'q4'; WASM models get default settings

Streaming Tokens

Just like the server-side streaming, local inference streams tokens one at a time using TextStreamer:

javascript

1const streamer = new TextStreamer(generator.tokenizer, {
2    skip_prompt: true,
3    skip_special_tokens: true,
4    callback_function: (token) => {
5        if (stopping) {
6            throw new Error("Generation interrupted");
7        }
8        self.postMessage({
9            status: 'update',
10            output: token,
11            delta: true
12        });
13    }
14});
15
16const output = await generator(text, {
17    max_new_tokens: 2048,
18    temperature: 0.7,
19    do_sample: true,
20    top_k: 20,
21    streamer: streamer,
22});

Each token is sent to the main thread via postMessage, where it's appended to the chat bubble and rendered as markdown in real-time. The experience feels the same as server-side streaming, just slower.

Interrupt Handling

Users might want to stop generation mid-stream. The worker supports this through a flag:

javascript

1let stopping = false;
2
3self.addEventListener('message', async (event) => {
4    const { type } = event.data;
5    switch (type) {
6        case 'interrupt':
7            stopping = true;
8            break;
9    }
10});

When the interrupt message arrives, the TextStreamer callback throws an error that cleanly exits the generation loop. The main thread switches the "Stop" button back to "Send".

Chat Template Handling

Different models expect different prompt formats. The frontend handles this with a buildPrompt() function that formats the user's message according to the model's expected template:

javascript

1function buildPrompt(modelId, userText) {
2    if (modelId.includes("TinyLlama")) {
3        // TinyLlama uses system/user/assistant tags
4        return systemTag + "\nYou are a friendly assistant.\n" + endTag 
5             + "\n" + userTag + "\n" + userText + "\n" + endTag 
6             + "\n" + assistantTag + "\n";
7    } else if (modelId.includes("Llama-3")) {
8        // Llama 3.x uses begin_of_text and header tokens
9        return bosToken + headerStart + "system" + headerEnd 
10             + "\n\nYou are a helpful assistant." + eot 
11             + headerStart + "user" + headerEnd 
12             + "\n\n" + userText + eot 
13             + headerStart + "assistant" + headerEnd + "\n\n";
14    } else if (modelId.toLowerCase().includes("gemma")) {
15        // Gemma uses start/end_of_turn markers
16        return "<start_of_turn>user\n" + userText 
17             + "<end_of_turn>\n<start_of_turn>model\n";
18    }
19    // ... more templates for Qwen, Phi-3, SmolLM2, etc.
20}

Getting the template wrong means garbled output. Each model family has its own special tokens, and the frontend needs to know which ones to use. This is one of the less glamorous parts of browser inference. Server-side LLM serving engines handle chat templates automatically, but in the browser, we're on our own.

Cross-Origin Isolation Headers

There's an annoying but necessary technical detail for browser inference: Cross-Origin Isolation headers.

WASM-based inference can use multi-threading via SharedArrayBuffer, but browsers require specific security headers for this:

python

1@app.middleware("http")
2async def add_process_time_header(request: Request, call_next):
3    response = await call_next(request)
4    response.headers["Cross-Origin-Opener-Policy"] = "same-origin"
5    response.headers["Cross-Origin-Embedder-Policy"] = "require-corp"
6    return response

These headers tell the browser: "This page is isolated. Don't share memory with other origins." Without them, SharedArrayBuffer is disabled, and WASM multi-threading falls back to single-threaded mode.

We disable multi-threading anyway (env.backends.onnx.wasm.numThreads = 1) because of SecurityError issues with CDN-hosted worker scripts. But the headers are still set for future compatibility when multi-threading matures.

Performance Reality Check

Let's be honest about what browser inference can and can't do today.

What Works

SmolLM2 360M (WebGPU): Fast enough for simple questions. Responses in 5-15 seconds.
TinyLlama 1.1B (WASM): Usable for quick tests. Slower but works on any browser.
Gemma 3 1B (WebGPU): Decent quality for a browser model. Good for summarization tasks.
Short responses: Anything under 100 tokens is practical for all models.

What Doesn't (Yet)

Long-form generation: Generating 500+ tokens takes minutes, not seconds.
Complex reasoning: 1B models can't do what 7B+ models do on a server.
Mobile devices: Browser inference on phones is painfully slow and drains battery.
First load: Downloading 700MB of model weights takes time (cached after first load).
Memory pressure: Running a 1B model uses 1-2GB of browser memory. Tabs may crash on low-RAM devices.

When To Use Browser Inference

✅ Quick questions when your server is offline
✅ Privacy-sensitive prompts you don't want on any server
✅ Testing and experimentation
✅ Demonstrating LLM capabilities without infrastructure

❌ Production workloads
❌ Long conversations with history
❌ Anything requiring high-quality output
❌ Mobile or low-spec devices

Browser inference demo

The Full Picture

Step back and look at what LLMChat offers as a complete system:

Loading diagram...

Three tiers of inference, one interface. You pick the point on the privacy/performance tradeoff that works for your situation:

Need the best quality? Point at GPT-4 or Claude.
Want control and good quality? Run Qwen or Gemma on your vLLM server.
Need absolute privacy? Use browser inference with SmolLM2 or Llama 3.2.

The interface stays the same. Markdown rendering, streaming, chat history, the Send/Stop button, all of it works identically regardless of where the model runs.

What's Next

In Part 5, we'll cover deployment: exposing LLMChat to the internet with Cloudflared, hardening it for production use, and tying the entire self-hosted AI ecosystem together.

P.S.: If you're testing WebGPU inference and it feels slow, try SmolLM2 360M first. It's the smallest and fastest model in the list. If even that feels too slow, your browser might not have WebGPU enabled. Check chrome://gpu in Chrome to verify. And remember: the first load downloads the model, but subsequent loads use the browser cache. Give it a second chance 🚀.

LLMChat: Running LLMs in Your Browser with WebGPU

The Zero-Server Vision

The Privacy Spectrum

How Browser LLM Inference Works

WebGPU vs WASM

Transformers.js: The Engine

The Models You Can Run

Models That Didn't Make It

The WebWorker Architecture

The PipelineSingleton

Streaming Tokens

Interrupt Handling

Chat Template Handling

Cross-Origin Isolation Headers

Performance Reality Check

What Works

What Doesn't (Yet)

When To Use Browser Inference

The Full Picture

What's Next

LLMChat: Vision Models, Web Search, and Smart Fallbacks

LLMChat: Deploying Your Self-Hosted AI Stack