LLMChat: Building a Self-Hosted ChatGPT Alternative with FastAPI
New models drop on HuggingFace almost daily. LLMChat is a self-hosted ChatML interface — plug in any OpenAI-compatible endpoint, pick a model, and start chatting. No data leaves your infrastructure.

Why I Built LLMChat?
For learning purpose mostly. I wanted to understand how tools like ChatGPT actually work under the hood. The streaming, the context management, the way a chat interface stitches together API calls, history, and rendering into something that feels seamless. The best way to understand something is to build it from scratch.
So I started with a simple FastAPI backend, a vanilla JavaScript frontend, and a connection to a local LLM. One weekend project. But then interesting things started happening.
Once I had a working chat interface, I realized it was also a perfect wireframe for testing new models. Open-source models drop on HuggingFace almost daily. Falcon, Ministral, InternVL, Gemma, Nemotron, Qwen, each with different strengths. With LLMChat, I could just swap the endpoint, pick the model from a dropdown, and start chatting. No config files, no rebuilds, just plug and play.
Then the privacy angle became obvious. Every prompt through ChatGPT or Claude goes to someone else's server. With LLMChat pointing at a local vLLM instance, nothing leaves my machine. Not a single token.
And then I kept adding things. RAG for document Q&A, vision model support, web search, even in-browser inference via WebGPU. What started as a learning exercise turned into a genuinely useful tool.
This is Part 1 of a 5-part series where I'll walk through everything I built and why.
If you haven't set up a local LLM serving engine yet, check out my guide on Self-Hosting LLMs with vLLM, SGLang, and Llama.cpp. LLMChat is the frontend to that backend.

What LLMChat Does
Here's the feature set at a glance:
✅ Multi-Provider Support: Connects to any OpenAI-compatible endpoint. Works with vLLM, Ollama, OpenAI, LM Studio, or your own
✅ Streaming Responses: Real-time token streaming via Server-Sent Events (SSE)
✅ Model Dropdown: Switch between models on the fly, no restart needed
✅ Markdown Rendering: Syntax-highlighted code blocks with one-click copy
✅ Thinking Blocks: Parses <think> tags into collapsible "💭 Model's Reasoning" sections
✅ Chat History: Smart pruning with configurable turn limits per model type
✅ Token Stats: Real-time throughput metrics (tokens/sec, completion count, duration)
✅ RAG Pipeline: Upload PDFs and chat with your documents (Part 2)
✅ Web Search: Tavily-powered web search injected into context (Part 3)
✅ Vision Models: Auto-detecting image support with smart fallbacks (Part 3)
✅ Browser Inference: Run models locally in your browser via WebGPU (Part 4)
The key idea is plug and play. New model on HuggingFace? Serve it with vLLM, point LLMChat at the endpoint, and you're testing it in under a minute. No config files to edit, no rebuilds, no waiting.
Architecture
LLMChat has a clean two-layer architecture: a FastAPI backend that handles all the LLM communication, and a vanilla JavaScript frontend that renders the chat UI.
Why Vanilla JS?
You might ask: "Why not React? Why not Next.js?"
Because I didn't need them. The entire frontend is ~1,500 lines across 5 files. No build step, no node_modules, no webpack config, no dependency hell. Just .html, .js, and .css files served by FastAPI.
The JavaScript is organized using the IIFE (Immediately Invoked Function Expression) pattern to avoid polluting the global namespace:
1// Each module wraps itself in an IIFE
2(function () {
3 // Private logic here...
4
5 // Expose public API
6 window.Render = {
7 renderMarkdownInto,
8 updateStatsFromStream,
9 showImagePreviews,
10 };
11})();It's old school, it works, and it's fast. When I want to change something, I change a file and refresh the browser. No hot module replacement needed because cold refresh takes 200ms.
Setup in 5 Minutes
Prerequisites
You need Python 3.9+ and an OpenAI-compatible endpoint running somewhere. If you don't have one yet, check out my self-hosting LLMs guide.
Step 1: Clone and Install
1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txtThe dependencies are minimal:
| Package | Purpose |
|---|---|
fastapi + uvicorn | Web framework and ASGI server |
openai | OpenAI-compatible API client |
Pillow | Image compression for uploads |
python-dotenv | Environment variable loading |
tavily-python | Web search integration |
PyMuPDF | PDF text extraction for RAG |
sentence-transformers | Embedding model for RAG |
qdrant-client | Vector database for RAG |
Step 2: Configure
Copy the environment template and fill in your endpoint:
1cp app/.env.example app/.env1# app/.env
2OPENAI_API_KEY=EMPTY # Only needed if your endpoint requires auth
3OPENAI_API_BASE=http://localhost:8000/v1 # Your vLLM/Ollama/OpenAI endpoint
4TAVILY_API_KEY=your-tavily-api-key-here # Optional: enables web searchThat's it. Two required fields: OPENAI_API_KEY (set to EMPTY for local models) and OPENAI_API_BASE (your serving engine's URL).
Step 3: Run
1uvicorn app.main:app --host 0.0.0.0 --port 3000Open http://localhost:3000 in your browser. LLMChat will auto-detect all models available on your endpoint and populate the dropdown. Pick a model and start chatting.

Tested Models
I've been testing LLMChat with a variety of models on my RTX 3080 Ti (12 GB VRAM) using vLLM as the serving engine. Here's what works:
| Model | vLLM Command | Notes |
|---|---|---|
| Falcon3-7B-Instruct-GPTQ-Int4 | vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85 | Solid general-purpose |
| Ministral-3-8B-Instruct-AWQ-4bit | vllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024 | Great instruction following |
| InternVL3-8B-AWQ | vllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awq | Vision-capable. Must set --quantization awq |
| InternVL3-2B | vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code | Smaller vision model |
| Nemotron Cascade 8B | vllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code | NVIDIA's reasoning model |
| Qwen2-VL-2B-Instruct | vllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 | Qwen's vision-language model |
| Gemma 3 4B Instruct (GPTQ) | vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8 | Original OOMs. GPTQ quantized works. Max concurrency: 9 |
The beauty of LLMChat's design is that you don't need to configure any of these models in the frontend. Just serve them with vLLM (or Ollama, or any OpenAI-compatible engine), and LLMChat auto-discovers them via the
/modelsendpoint. Plug and play.
Confused about hosting LLMs locally? Check out Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp for a detailed walkthrough.
Under the Hood: Streaming
Nobody wants to stare at a loading spinner for 10 seconds waiting for a complete response. That's why LLMChat uses Server-Sent Events (SSE) for real-time token streaming.
Here's the core pattern — a Python generator that yields tokens as they arrive from the LLM:
1@app.post("/chat/stream")
2def chat_stream(req: ChatRequest):
3 model_name = (req.model or DEFAULT_MODEL).strip()
4 messages = build_messages(req.user_id, req.message, req.attachments, model_name)
5
6 def token_generator():
7 buffer = []
8 start_t = time.perf_counter()
9
10 stream = client.chat.completions.create(
11 model=model_name,
12 messages=messages,
13 max_tokens=effective_max_tokens,
14 temperature=0.8,
15 stream=True, # This is the key flag
16 )
17
18 for chunk in stream:
19 piece = chunk.choices[0].delta.content or ""
20 if piece:
21 buffer.append(piece)
22 yield piece # Token goes to the browser immediately
23
24 # Append throughput metrics at the end
25 final_answer = "".join(buffer)
26 dur_ms = int((time.perf_counter() - start_t) * 1000)
27 approx_tokens = max(1, len(final_answer) // 4)
28 tps = approx_tokens / max(0.001, dur_ms / 1000)
29 yield f"\n\n[throughput] duration_ms={dur_ms} tokens_per_sec={tps:.2f}\n"
30
31 return StreamingResponse(token_generator(), media_type="text/plain")On the frontend side, we read the stream chunk by chunk and render markdown in real-time:
1const reader = response.body.getReader();
2const decoder = new TextDecoder();
3let buffer = "";
4
5while (true) {
6 const { value, done } = await reader.read();
7 if (done) break;
8 buffer += decoder.decode(value, { stream: true });
9 renderMarkdownInto(bubble, buffer); // Re-render on every chunk
10}The result is a ChatGPT-like experience where text appears token by token. At the end of each response, the backend appends a throughput footer like [throughput] duration_ms=1234 tokens_per_sec=45.67. The frontend strips this out and displays it in a stats bar.

Thinking Block Support
Some models (like DeepSeek, Nemotron, QwQ) wrap their reasoning in <think> tags. LLMChat detects these and renders them as collapsible sections:
1function extractThinking(rawText) {
2 let thinking = '';
3 let content = rawText;
4
5 // Pattern: <think>...</think> tags
6 const rawThinkRegex = /<think>([\s\S]*?)<\/think>/gi;
7 content = content.replace(rawThinkRegex, (match, p1) => {
8 thinking += p1.trim();
9 return '';
10 });
11
12 return { thinking: thinking.trim(), content: content.trim() };
13}When thinking content is found, it gets rendered as a <details> element. Collapsed by default, expandable on click:
1<details class="thinking-block">
2 <summary>💭 Model's Reasoning</summary>
3 <div class="thinking-content">...</div>
4</details>
Under the Hood: Smart Context Management
LLMs have finite context windows. Send too many tokens and you get an error. Most chat interfaces just crash or show a cryptic error message, but LLMChat handles this gracefully.
Auto-Retry with Reduced Tokens
When the backend catches a BadRequestError (context length exceeded), it parses the error message to figure out exactly how many tokens are available:
1def parse_allowed_tokens_from_error(msg: str) -> int | None:
2 """
3 Parse: "This model's maximum context length is 1024 tokens
4 and your request has 899 input tokens"
5 """
6 max_ctx_match = re.search(r"maximum context length is (\d+) tokens", msg)
7 input_match = re.search(r"your request has (\d+) input tokens", msg)
8 if not max_ctx_match or not input_match:
9 return None
10 max_ctx = int(max_ctx_match.group(1))
11 input_tokens = int(input_match.group(1))
12 context_margin = 16 # Safety headroom
13 return max(0, max_ctx - input_tokens - context_margin)If parsing succeeds, the request is retried with the exact remaining budget. If parsing fails (different error format), it halves the token count and tries again. Either way, the user gets a response instead of an error.
Truncation Suffix
If the model hits its max_tokens limit mid-response, the dreaded finish_reason: "length", LLMChat appends a clear indicator:
1if finish_reason == "length" and TRUNCATION_SUFFIX:
2 answer = f"{answer}… Would you like me to continue?"Instead of the response just... stopping mid-sentence, the user gets a clear signal that there's more to say.
History Pruning
Chat history grows with every message. Vision models are especially context-hungry because image tokens are expensive. LLMChat handles this with model-aware pruning:
1# Vision models get fewer history turns to manage context
2MAX_HISTORY_TURNS_TEXT = 6 # 6 user+assistant pairs for text models
3MAX_HISTORY_TURNS_VISION = 2 # Only 2 pairs for vision modelsThe system message is always preserved, and older messages are pruned from the front. The conversation stays coherent while keeping context usage manageable.
The Tech Stack
| Layer | Technology |
|---|---|
| Backend framework | FastAPI + Uvicorn |
| LLM client | openai Python SDK (v1+) |
| RAG embeddings | sentence-transformers (all-MiniLM-L6-v2, 384-dim) |
| RAG vector store | Qdrant (in-memory) |
| PDF extraction | PyMuPDF (fitz) |
| Web search | Tavily API |
| Image processing | Pillow |
| Frontend | Vanilla HTML/CSS/JS (no framework) |
| Markdown rendering | marked.js + DOMPurify + highlight.js |
| Local inference | HuggingFace Transformers.js (WebGPU + WASM) |
What's Next
This post covered the foundation: the core chat interface, streaming, and context management. But LLMChat does a lot more. Here's what's coming in the rest of this series:
-
Part 2: Chat with Your PDFs — Adding a RAG Pipeline >> Upload PDFs, chunk them, embed them, and chat with your documents using Qdrant and sentence-transformers.
-
Part 3: Vision Models, Web Search, and Smart Fallbacks >> Auto-detecting vision capability, Tavily web search integration, and graceful fallbacks when models go down.
-
Part 4: Running LLMs in Your Browser with WebGPU >> Zero-server inference using Transformers.js, WebGPU, and WASM. The privacy endgame.
-
Part 5: Deploying Your Self-Hosted AI Stack >> Cloudflared tunnels, production hardening, and tying the entire ecosystem together.
Try It Yourself
If you're tired of paying per-token for cloud APIs, or if you just want a fast way to test the latest models as they drop on HuggingFace, give LLMChat a try.
GitHub: https://github.com/kXborg/LLMChat
Quick Start:
1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txt
4# Configure your endpoint in app/.env
5uvicorn app.main:app --host 0.0.0.0 --port 3000Point it at your serving engine, pick a model, and start chatting. It's that simple.
Final Thoughts
The open-source LLM ecosystem is moving at an incredible pace. New models, new architectures, new capabilities, every single week. What's missing is a simple way to interact with all of them without being locked into a single provider's interface.
LLMChat is my answer to that gap. It's not trying to compete with ChatGPT's polish or Cursor's agentic features. It's a wireframe. Deliberately simple, deliberately flexible. Swap the model, swap the endpoint, keep the interface.
Your models, your data, your chat. 🚀
P.S.: If you're also interested in LLM-powered code completion (the non-agentic kind), check out CodeContinue. It's a Sublime Text plugin I built for intelligent inline suggestions. Same philosophy: your model, your rules.
← Previous Post
The Anatomy of 3D Print Failures: Causes and Fixes
Next Post →
LLMChat: Chat with Your PDFs — Adding a RAG Pipeline
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.