LLMChat: Running LLMs in Your Browser with WebGPU
What if your chat interface could run LLMs without any server at all? LLMChat supports in-browser inference via WebGPU and WASM using Transformers.js. No backend, no API calls, no data leaving your machine.

The Zero-Server Vision
In Part 1, we built a chat interface that connects to a self-hosted LLM server. In Parts 2 and Part 3, we added RAG, vision, and web search. All of this requires a FastAPI backend and, more importantly, a model serving engine running somewhere.
But what if you don't have a GPU server running? What if you're on a laptop at a coffee shop? What if you want to chat with an LLM and have absolutely zero data leave your machine, not even to localhost?
That's the promise of in-browser LLM inference. And with WebGPU maturing in modern browsers and HuggingFace's Transformers.js library, it's finally practical for small models.
The Privacy Spectrum
Think of it as a spectrum of privacy vs. performance:
| Level | How it works | Data leaves your machine? |
|---|---|---|
| Cloud APIs (OpenAI, Claude) | Prompts sent to remote servers | Yes, to the provider |
| Self-hosted server (vLLM, Ollama) | Model runs on your hardware, API on localhost | No (stays on your machine) |
| Browser inference (WebGPU/WASM) | Model runs inside the browser tab | No (stays in the browser) |
Browser inference is the ultimate privacy play. The model weights are downloaded once from a CDN and cached in the browser. After that, everything happens inside your browser process. Inference, tokenization, generation, all of it. No network requests, no servers, no logs.
How Browser LLM Inference Works
WebGPU vs WASM
LLMChat supports two compute backends for in-browser inference:
| Backend | How it works | Speed | Compatibility |
|---|---|---|---|
| WebGPU | Uses your GPU directly from the browser | Fast | Chrome/Edge 113+, experimental in Firefox |
| WASM | Runs on CPU via WebAssembly | Slower | All modern browsers |
WebGPU is the newer, faster option. It gives browser JavaScript access to GPU compute, similar to how WebGL gives access to GPU rendering. WASM (WebAssembly) is the fallback for browsers that don't support WebGPU yet, running everything on the CPU.
LLMChat auto-detects which backend to use:
1async function isWebGPUAvailable() {
2 if (typeof navigator === 'undefined' || !navigator.gpu) return false;
3 try {
4 const adapter = await navigator.gpu.requestAdapter();
5 return adapter !== null;
6 } catch {
7 return false;
8 }
9}Transformers.js: The Engine
The actual inference is powered by HuggingFace's Transformers.js, a JavaScript port of the Transformers library that runs ONNX-exported models in the browser:
1import { pipeline, env, TextStreamer } from
2 'https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.2.4';
3
4// No local model files — everything comes from the CDN
5env.allowLocalModels = false;
6
7// Disable multi-threading to avoid SecurityError with CDN scripts
8env.backends.onnx.wasm.numThreads = 1;The library downloads model weights from HuggingFace's CDN on first use and caches them in the browser's storage. Subsequent loads use the cache, no re-download needed.
The Models You Can Run
LLMChat comes pre-configured with 5 local models, split between WASM and WebGPU:
| Model | Backend | Icon | Notes |
|---|---|---|---|
| TinyLlama 1.1B | WASM | 💻 | Lightweight, fast, good for quick tests |
| Phi-3 mini 4K | WASM | 💻 | Better quality, larger download |
| Llama 3.2 1B | WebGPU (q4) | 🚀 | Best quality-to-speed ratio |
| SmolLM2 360M | WebGPU (q4) | 🚀 | Ultra-fast, tiny model |
| Gemma 3 1B | WebGPU (q4) | 🚀 | Google's latest compact model |
In the model dropdown, local models are clearly labeled with 💻 (WASM) or 🚀 (WebGPU) icons so you know what you're getting:
1// WASM models
2const localOpt = document.createElement('option');
3localOpt.value = "Local: Xenova/TinyLlama-1.1B-Chat-v1.0";
4localOpt.textContent = "💻 Local: TinyLlama-1.1B-Chat (Browser)";
5
6// WebGPU models
7const localOptGPU = document.createElement('option');
8localOptGPU.value = "Local: onnx-community/Llama-3.2-1B-Instruct-ONNX";
9localOptGPU.textContent = "🚀 Local: Llama-3.2-1B (WebGPU)";The WebGPU models use q4 quantization, meaning 4-bit weights that are roughly 4x smaller than full precision. This is critical for browser inference because you're limited by browser memory and download bandwidth. A 1B parameter model at q4 weighs around 700MB instead of 2.8GB at fp16.

Models That Didn't Make It
Not everything works in the browser. I tried and removed:
- Qwen models: Incompatible ONNX file structure
- OpenELM-270M: Produces only empty tokens
- LFM2.5 (LiquidAI): "Unsupported model type: lfm2" error in Transformers.js
Browser inference is still early. The model zoo is growing, but compatibility varies.
The WebWorker Architecture
Running inference on the main thread would freeze the entire UI. LLMChat solves this with a WebWorker, a background thread that handles all model loading and generation:
1const worker = new Worker("/static/worker.js", { type: "module" });The worker communicates with the main thread via message passing:
1Main Thread WebWorker
2 | |
3 |-- { type: 'load', data }--> | Load model
4 | |
5 |<-- { status: 'loading' }---- | "Loading TinyLlama..."
6 |<-- { status: 'progress' }--- | "Downloading: 45%"
7 |<-- { status: 'ready' }------ | Model loaded
8 | |
9 |-- { type: 'generate' }----> | Start generation
10 | |
11 |<-- { status: 'update' }----- | Token: "Hello"
12 |<-- { status: 'update' }----- | Token: " world"
13 |<-- { status: 'update' }----- | Token: "!"
14 |<-- { status: 'complete' }--- | Done
15 | |
16 |-- { type: 'interrupt' }--> | Stop!The PipelineSingleton
The worker uses a singleton pattern to cache loaded models. You don't want to re-download a 700MB model every time you send a message:
1class PipelineSingleton {
2 static task = 'text-generation';
3 static model = null;
4 static instance = null;
5 static loading = false;
6 static webgpuAvailable = null;
7
8 static async getInstance(progress_callback = null, model_id = null) {
9 // Check WebGPU availability once
10 if (this.webgpuAvailable === null) {
11 this.webgpuAvailable = await isWebGPUAvailable();
12 }
13
14 // If switching models, reset the instance
15 if (model_id && this.model !== model_id) {
16 this.instance = null;
17 this.model = model_id;
18 }
19
20 // Return cached instance if available
21 if (this.instance && typeof this.instance.then !== 'function') {
22 return this.instance;
23 }
24
25 // Determine backend based on model
26 const useWebGPU = this.webgpuAvailable &&
27 WEBGPU_MODELS.some(m => this.model.includes(m));
28 const device = useWebGPU ? 'webgpu' : undefined;
29
30 const pipelineOptions = { progress_callback };
31 if (device) {
32 pipelineOptions.device = device;
33 pipelineOptions.dtype = 'q4'; // 4-bit quantization for WebGPU
34 }
35
36 this.instance = await pipeline(this.task, this.model, pipelineOptions);
37 return this.instance;
38 }
39}Key decisions:
- Lazy loading: Models are loaded on first use, not at page load
- Model switching: Switching to a different local model resets the instance and loads the new one
- Backend auto-selection: WebGPU models get
device: 'webgpu'anddtype: 'q4'; WASM models get default settings
Streaming Tokens
Just like the server-side streaming, local inference streams tokens one at a time using TextStreamer:
1const streamer = new TextStreamer(generator.tokenizer, {
2 skip_prompt: true,
3 skip_special_tokens: true,
4 callback_function: (token) => {
5 if (stopping) {
6 throw new Error("Generation interrupted");
7 }
8 self.postMessage({
9 status: 'update',
10 output: token,
11 delta: true
12 });
13 }
14});
15
16const output = await generator(text, {
17 max_new_tokens: 2048,
18 temperature: 0.7,
19 do_sample: true,
20 top_k: 20,
21 streamer: streamer,
22});Each token is sent to the main thread via postMessage, where it's appended to the chat bubble and rendered as markdown in real-time. The experience feels the same as server-side streaming, just slower.
Interrupt Handling
Users might want to stop generation mid-stream. The worker supports this through a flag:
1let stopping = false;
2
3self.addEventListener('message', async (event) => {
4 const { type } = event.data;
5 switch (type) {
6 case 'interrupt':
7 stopping = true;
8 break;
9 }
10});When the interrupt message arrives, the TextStreamer callback throws an error that cleanly exits the generation loop. The main thread switches the "Stop" button back to "Send".
Chat Template Handling
Different models expect different prompt formats. The frontend handles this with a buildPrompt() function that formats the user's message according to the model's expected template:
1function buildPrompt(modelId, userText) {
2 if (modelId.includes("TinyLlama")) {
3 // TinyLlama uses system/user/assistant tags
4 return systemTag + "\nYou are a friendly assistant.\n" + endTag
5 + "\n" + userTag + "\n" + userText + "\n" + endTag
6 + "\n" + assistantTag + "\n";
7 } else if (modelId.includes("Llama-3")) {
8 // Llama 3.x uses begin_of_text and header tokens
9 return bosToken + headerStart + "system" + headerEnd
10 + "\n\nYou are a helpful assistant." + eot
11 + headerStart + "user" + headerEnd
12 + "\n\n" + userText + eot
13 + headerStart + "assistant" + headerEnd + "\n\n";
14 } else if (modelId.toLowerCase().includes("gemma")) {
15 // Gemma uses start/end_of_turn markers
16 return "<start_of_turn>user\n" + userText
17 + "<end_of_turn>\n<start_of_turn>model\n";
18 }
19 // ... more templates for Qwen, Phi-3, SmolLM2, etc.
20}Getting the template wrong means garbled output. Each model family has its own special tokens, and the frontend needs to know which ones to use. This is one of the less glamorous parts of browser inference. Server-side LLM serving engines handle chat templates automatically, but in the browser, we're on our own.
Cross-Origin Isolation Headers
There's an annoying but necessary technical detail for browser inference: Cross-Origin Isolation headers.
WASM-based inference can use multi-threading via SharedArrayBuffer, but browsers require specific security headers for this:
1@app.middleware("http")
2async def add_process_time_header(request: Request, call_next):
3 response = await call_next(request)
4 response.headers["Cross-Origin-Opener-Policy"] = "same-origin"
5 response.headers["Cross-Origin-Embedder-Policy"] = "require-corp"
6 return responseThese headers tell the browser: "This page is isolated. Don't share memory with other origins." Without them, SharedArrayBuffer is disabled, and WASM multi-threading falls back to single-threaded mode.
We disable multi-threading anyway (
env.backends.onnx.wasm.numThreads = 1) because ofSecurityErrorissues with CDN-hosted worker scripts. But the headers are still set for future compatibility when multi-threading matures.
Performance Reality Check
Let's be honest about what browser inference can and can't do today.
What Works
- SmolLM2 360M (WebGPU): Fast enough for simple questions. Responses in 5-15 seconds.
- TinyLlama 1.1B (WASM): Usable for quick tests. Slower but works on any browser.
- Gemma 3 1B (WebGPU): Decent quality for a browser model. Good for summarization tasks.
- Short responses: Anything under 100 tokens is practical for all models.
What Doesn't (Yet)
- Long-form generation: Generating 500+ tokens takes minutes, not seconds.
- Complex reasoning: 1B models can't do what 7B+ models do on a server.
- Mobile devices: Browser inference on phones is painfully slow and drains battery.
- First load: Downloading 700MB of model weights takes time (cached after first load).
- Memory pressure: Running a 1B model uses 1-2GB of browser memory. Tabs may crash on low-RAM devices.
When To Use Browser Inference
✅ Quick questions when your server is offline
✅ Privacy-sensitive prompts you don't want on any server
✅ Testing and experimentation
✅ Demonstrating LLM capabilities without infrastructure
❌ Production workloads
❌ Long conversations with history
❌ Anything requiring high-quality output
❌ Mobile or low-spec devices

The Full Picture
Step back and look at what LLMChat offers as a complete system:
Three tiers of inference, one interface. You pick the point on the privacy/performance tradeoff that works for your situation:
- Need the best quality? Point at GPT-4 or Claude.
- Want control and good quality? Run Qwen or Gemma on your vLLM server.
- Need absolute privacy? Use browser inference with SmolLM2 or Llama 3.2.
The interface stays the same. Markdown rendering, streaming, chat history, the Send/Stop button, all of it works identically regardless of where the model runs.
What's Next
In Part 5, we'll cover deployment: exposing LLMChat to the internet with Cloudflared, hardening it for production use, and tying the entire self-hosted AI ecosystem together.
P.S.: If you're testing WebGPU inference and it feels slow, try SmolLM2 360M first. It's the smallest and fastest model in the list. If even that feels too slow, your browser might not have WebGPU enabled. Check chrome://gpu in Chrome to verify. And remember: the first load downloads the model, but subsequent loads use the browser cache. Give it a second chance 🚀.
← Previous Post
LLMChat: Vision Models, Web Search, and Smart Fallbacks
Next Post →
LLMChat: Deploying Your Self-Hosted AI Stack
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.