LLMChat: Building a Self-Hosted ChatGPT Alternative with FastAPI

Why I Built LLMChat?

For learning purpose mostly. I wanted to understand how tools like ChatGPT actually work under the hood. The streaming, the context management, the way a chat interface stitches together API calls, history, and rendering into something that feels seamless. The best way to understand something is to build it from scratch.

So I started with a simple FastAPI backend, a vanilla JavaScript frontend, and a connection to a local LLM. One weekend project. But then interesting things started happening.

Once I had a working chat interface, I realized it was also a perfect wireframe for testing new models. Open-source models drop on HuggingFace almost daily. Falcon, Ministral, InternVL, Gemma, Nemotron, Qwen, each with different strengths. With LLMChat, I could just swap the endpoint, pick the model from a dropdown, and start chatting. No config files, no rebuilds, just plug and play.

Then the privacy angle became obvious. Every prompt through ChatGPT or Claude goes to someone else's server. With LLMChat pointing at a local vLLM instance, nothing leaves my machine. Not a single token.

And then I kept adding things. RAG for document Q&A, vision model support, web search, even in-browser inference via WebGPU. What started as a learning exercise turned into a genuinely useful tool.

This is Part 1 of a 5-part series where I'll walk through everything I built and why.

If you haven't set up a local LLM serving engine yet, check out my guide on Self-Hosting LLMs with vLLM, SGLang, and Llama.cpp. LLMChat is the frontend to that backend.

LLMChat main interface

What LLMChat Does

Here's the feature set at a glance:

✅ Multi-Provider Support: Connects to any OpenAI-compatible endpoint. Works with vLLM, Ollama, OpenAI, LM Studio, or your own
✅ Streaming Responses: Real-time token streaming via Server-Sent Events (SSE)
✅ Model Dropdown: Switch between models on the fly, no restart needed
✅ Markdown Rendering: Syntax-highlighted code blocks with one-click copy
✅ Thinking Blocks: Parses <think> tags into collapsible "💭 Model's Reasoning" sections
✅ Chat History: Smart pruning with configurable turn limits per model type
✅ Token Stats: Real-time throughput metrics (tokens/sec, completion count, duration)
✅ RAG Pipeline: Upload PDFs and chat with your documents (Part 2)
✅ Web Search: Tavily-powered web search injected into context (Part 3)
✅ Vision Models: Auto-detecting image support with smart fallbacks (Part 3)
✅ Browser Inference: Run models locally in your browser via WebGPU (Part 4)

The key idea is plug and play. New model on HuggingFace? Serve it with vLLM, point LLMChat at the endpoint, and you're testing it in under a minute. No config files to edit, no rebuilds, no waiting.

Architecture

LLMChat has a clean two-layer architecture: a FastAPI backend that handles all the LLM communication, and a vanilla JavaScript frontend that renders the chat UI.

Loading diagram...

Why Vanilla JS?

You might ask: "Why not React? Why not Next.js?"

Because I didn't need them. The entire frontend is ~1,500 lines across 5 files. No build step, no node_modules, no webpack config, no dependency hell. Just .html, .js, and .css files served by FastAPI.

The JavaScript is organized using the IIFE (Immediately Invoked Function Expression) pattern to avoid polluting the global namespace:

javascript

1// Each module wraps itself in an IIFE
2(function () {
3  // Private logic here...
4
5  // Expose public API
6  window.Render = {
7    renderMarkdownInto,
8    updateStatsFromStream,
9    showImagePreviews,
10  };
11})();

It's old school, it works, and it's fast. When I want to change something, I change a file and refresh the browser. No hot module replacement needed because cold refresh takes 200ms.

Setup in 5 Minutes

Prerequisites

You need Python 3.9+ and an OpenAI-compatible endpoint running somewhere. If you don't have one yet, check out my self-hosting LLMs guide.

Step 1: Clone and Install

bash

1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txt

The dependencies are minimal:

Package	Purpose
`fastapi` + `uvicorn`	Web framework and ASGI server
`openai`	OpenAI-compatible API client
`Pillow`	Image compression for uploads
`python-dotenv`	Environment variable loading
`tavily-python`	Web search integration
`PyMuPDF`	PDF text extraction for RAG
`sentence-transformers`	Embedding model for RAG
`qdrant-client`	Vector database for RAG

Step 2: Configure

Copy the environment template and fill in your endpoint:

bash

1cp app/.env.example app/.env

bash

1# app/.env
2OPENAI_API_KEY=EMPTY                        # Only needed if your endpoint requires auth
3OPENAI_API_BASE=http://localhost:8000/v1     # Your vLLM/Ollama/OpenAI endpoint
4TAVILY_API_KEY=your-tavily-api-key-here      # Optional: enables web search

That's it. Two required fields: OPENAI_API_KEY (set to EMPTY for local models) and OPENAI_API_BASE (your serving engine's URL).

Step 3: Run

bash

1uvicorn app.main:app --host 0.0.0.0 --port 3000

Open http://localhost:3000 in your browser. LLMChat will auto-detect all models available on your endpoint and populate the dropdown. Pick a model and start chatting.

LLMChat model dropdown

Tested Models

I've been testing LLMChat with a variety of models on my RTX 3080 Ti (12 GB VRAM) using vLLM as the serving engine. Here's what works:

Model	vLLM Command	Notes
Falcon3-7B-Instruct-GPTQ-Int4	`vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85`	Solid general-purpose
Ministral-3-8B-Instruct-AWQ-4bit	`vllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024`	Great instruction following
InternVL3-8B-AWQ	`vllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awq`	Vision-capable. Must set `--quantization awq`
InternVL3-2B	`vllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code`	Smaller vision model
Nemotron Cascade 8B	`vllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-code`	NVIDIA's reasoning model
Qwen2-VL-2B-Instruct	`vllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024`	Qwen's vision-language model
Gemma 3 4B Instruct (GPTQ)	`vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8`	Original OOMs. GPTQ quantized works. Max concurrency: 9

The beauty of LLMChat's design is that you don't need to configure any of these models in the frontend. Just serve them with vLLM (or Ollama, or any OpenAI-compatible engine), and LLMChat auto-discovers them via the /models endpoint. Plug and play.

Confused about hosting LLMs locally? Check out Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp for a detailed walkthrough.

Under the Hood: Streaming

Nobody wants to stare at a loading spinner for 10 seconds waiting for a complete response. That's why LLMChat uses Server-Sent Events (SSE) for real-time token streaming.

Here's the core pattern — a Python generator that yields tokens as they arrive from the LLM:

python

1@app.post("/chat/stream")
2def chat_stream(req: ChatRequest):
3    model_name = (req.model or DEFAULT_MODEL).strip()
4    messages = build_messages(req.user_id, req.message, req.attachments, model_name)
5
6    def token_generator():
7        buffer = []
8        start_t = time.perf_counter()
9
10        stream = client.chat.completions.create(
11            model=model_name,
12            messages=messages,
13            max_tokens=effective_max_tokens,
14            temperature=0.8,
15            stream=True,               # This is the key flag
16        )
17
18        for chunk in stream:
19            piece = chunk.choices[0].delta.content or ""
20            if piece:
21                buffer.append(piece)
22                yield piece             # Token goes to the browser immediately
23
24        # Append throughput metrics at the end
25        final_answer = "".join(buffer)
26        dur_ms = int((time.perf_counter() - start_t) * 1000)
27        approx_tokens = max(1, len(final_answer) // 4)
28        tps = approx_tokens / max(0.001, dur_ms / 1000)
29        yield f"\n\n[throughput] duration_ms={dur_ms} tokens_per_sec={tps:.2f}\n"
30
31    return StreamingResponse(token_generator(), media_type="text/plain")

On the frontend side, we read the stream chunk by chunk and render markdown in real-time:

javascript

1const reader = response.body.getReader();
2const decoder = new TextDecoder();
3let buffer = "";
4
5while (true) {
6    const { value, done } = await reader.read();
7    if (done) break;
8    buffer += decoder.decode(value, { stream: true });
9    renderMarkdownInto(bubble, buffer);  // Re-render on every chunk
10}

The result is a ChatGPT-like experience where text appears token by token. At the end of each response, the backend appends a throughput footer like [throughput] duration_ms=1234 tokens_per_sec=45.67. The frontend strips this out and displays it in a stats bar.

Streaming response in action

Thinking Block Support

Some models (like DeepSeek, Nemotron, QwQ) wrap their reasoning in <think> tags. LLMChat detects these and renders them as collapsible sections:

javascript

1function extractThinking(rawText) {
2    let thinking = '';
3    let content = rawText;
4
5    // Pattern: <think>...</think> tags
6    const rawThinkRegex = /<think>([\s\S]*?)<\/think>/gi;
7    content = content.replace(rawThinkRegex, (match, p1) => {
8        thinking += p1.trim();
9        return '';
10    });
11
12    return { thinking: thinking.trim(), content: content.trim() };
13}

When thinking content is found, it gets rendered as a <details> element. Collapsed by default, expandable on click:

html

1<details class="thinking-block">
2  <summary>💭 Model's Reasoning</summary>
3  <div class="thinking-content">...</div>
4</details>

Thinking block — collapsible reasoning

Under the Hood: Smart Context Management

LLMs have finite context windows. Send too many tokens and you get an error. Most chat interfaces just crash or show a cryptic error message, but LLMChat handles this gracefully.

Auto-Retry with Reduced Tokens

When the backend catches a BadRequestError (context length exceeded), it parses the error message to figure out exactly how many tokens are available:

python

1def parse_allowed_tokens_from_error(msg: str) -> int | None:
2    """
3    Parse: "This model's maximum context length is 1024 tokens 
4    and your request has 899 input tokens"
5    """
6    max_ctx_match = re.search(r"maximum context length is (\d+) tokens", msg)
7    input_match = re.search(r"your request has (\d+) input tokens", msg)
8    if not max_ctx_match or not input_match:
9        return None
10    max_ctx = int(max_ctx_match.group(1))
11    input_tokens = int(input_match.group(1))
12    context_margin = 16  # Safety headroom
13    return max(0, max_ctx - input_tokens - context_margin)

If parsing succeeds, the request is retried with the exact remaining budget. If parsing fails (different error format), it halves the token count and tries again. Either way, the user gets a response instead of an error.

Truncation Suffix

If the model hits its max_tokens limit mid-response, the dreaded finish_reason: "length", LLMChat appends a clear indicator:

python

1if finish_reason == "length" and TRUNCATION_SUFFIX:
2    answer = f"{answer}… Would you like me to continue?"

Instead of the response just... stopping mid-sentence, the user gets a clear signal that there's more to say.

History Pruning

Chat history grows with every message. Vision models are especially context-hungry because image tokens are expensive. LLMChat handles this with model-aware pruning:

python

1# Vision models get fewer history turns to manage context
2MAX_HISTORY_TURNS_TEXT = 6      # 6 user+assistant pairs for text models
3MAX_HISTORY_TURNS_VISION = 2    # Only 2 pairs for vision models

The system message is always preserved, and older messages are pruned from the front. The conversation stays coherent while keeping context usage manageable.

The Tech Stack

Layer	Technology
Backend framework	FastAPI + Uvicorn
LLM client	`openai` Python SDK (v1+)
RAG embeddings	`sentence-transformers` (all-MiniLM-L6-v2, 384-dim)
RAG vector store	Qdrant (in-memory)
PDF extraction	PyMuPDF (`fitz`)
Web search	Tavily API
Image processing	Pillow
Frontend	Vanilla HTML/CSS/JS (no framework)
Markdown rendering	marked.js + DOMPurify + highlight.js
Local inference	HuggingFace Transformers.js (WebGPU + WASM)

What's Next

This post covered the foundation: the core chat interface, streaming, and context management. But LLMChat does a lot more. Here's what's coming in the rest of this series:

Part 2: Chat with Your PDFs — Adding a RAG Pipeline >> Upload PDFs, chunk them, embed them, and chat with your documents using Qdrant and sentence-transformers.
Part 3: Vision Models, Web Search, and Smart Fallbacks >> Auto-detecting vision capability, Tavily web search integration, and graceful fallbacks when models go down.
Part 4: Running LLMs in Your Browser with WebGPU >> Zero-server inference using Transformers.js, WebGPU, and WASM. The privacy endgame.
Part 5: Deploying Your Self-Hosted AI Stack >> Cloudflared tunnels, production hardening, and tying the entire ecosystem together.

Try It Yourself

If you're tired of paying per-token for cloud APIs, or if you just want a fast way to test the latest models as they drop on HuggingFace, give LLMChat a try.

GitHub: https://github.com/kXborg/LLMChat

Quick Start:

bash

1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txt
4# Configure your endpoint in app/.env
5uvicorn app.main:app --host 0.0.0.0 --port 3000

Point it at your serving engine, pick a model, and start chatting. It's that simple.

Final Thoughts

The open-source LLM ecosystem is moving at an incredible pace. New models, new architectures, new capabilities, every single week. What's missing is a simple way to interact with all of them without being locked into a single provider's interface.

LLMChat is my answer to that gap. It's not trying to compete with ChatGPT's polish or Cursor's agentic features. It's a wireframe. Deliberately simple, deliberately flexible. Swap the model, swap the endpoint, keep the interface.

Your models, your data, your chat. 🚀

P.S.: If you're also interested in LLM-powered code completion (the non-agentic kind), check out CodeContinue. It's a Sublime Text plugin I built for intelligent inline suggestions. Same philosophy: your model, your rules.

LLMChat: Building a Self-Hosted ChatGPT Alternative with FastAPI

Why I Built LLMChat?

What LLMChat Does

Architecture

Why Vanilla JS?

Setup in 5 Minutes

Prerequisites

Step 1: Clone and Install

Step 2: Configure

Step 3: Run

Tested Models

Under the Hood: Streaming

Thinking Block Support

Under the Hood: Smart Context Management

Auto-Retry with Reduced Tokens

Truncation Suffix

History Pruning

The Tech Stack

What's Next

Try It Yourself

Final Thoughts

The Anatomy of 3D Print Failures: Causes and Fixes

LLMChat: Chat with Your PDFs — Adding a RAG Pipeline