← Back to Blog

LLMChat: Building a Self-Hosted ChatGPT Alternative with FastAPI

New models drop on HuggingFace almost daily. LLMChat is a self-hosted ChatML interface — plug in any OpenAI-compatible endpoint, pick a model, and start chatting. No data leaves your infrastructure.

AI/ML12 min readAuthor: Kukil Kashyap Borgohain
LLMChat - Self-hosted ChatGPT alternative

Why I Built LLMChat?

For learning purpose mostly. I wanted to understand how tools like ChatGPT actually work under the hood. The streaming, the context management, the way a chat interface stitches together API calls, history, and rendering into something that feels seamless. The best way to understand something is to build it from scratch.

So I started with a simple FastAPI backend, a vanilla JavaScript frontend, and a connection to a local LLM. One weekend project. But then interesting things started happening.

Once I had a working chat interface, I realized it was also a perfect wireframe for testing new models. Open-source models drop on HuggingFace almost daily. Falcon, Ministral, InternVL, Gemma, Nemotron, Qwen, each with different strengths. With LLMChat, I could just swap the endpoint, pick the model from a dropdown, and start chatting. No config files, no rebuilds, just plug and play.

Then the privacy angle became obvious. Every prompt through ChatGPT or Claude goes to someone else's server. With LLMChat pointing at a local vLLM instance, nothing leaves my machine. Not a single token.

And then I kept adding things. RAG for document Q&A, vision model support, web search, even in-browser inference via WebGPU. What started as a learning exercise turned into a genuinely useful tool.

This is Part 1 of a 5-part series where I'll walk through everything I built and why.

If you haven't set up a local LLM serving engine yet, check out my guide on Self-Hosting LLMs with vLLM, SGLang, and Llama.cpp. LLMChat is the frontend to that backend.

LLMChat main interface

What LLMChat Does

Here's the feature set at a glance:

Multi-Provider Support: Connects to any OpenAI-compatible endpoint. Works with vLLM, Ollama, OpenAI, LM Studio, or your own
Streaming Responses: Real-time token streaming via Server-Sent Events (SSE)
Model Dropdown: Switch between models on the fly, no restart needed
Markdown Rendering: Syntax-highlighted code blocks with one-click copy
Thinking Blocks: Parses <think> tags into collapsible "💭 Model's Reasoning" sections
Chat History: Smart pruning with configurable turn limits per model type
Token Stats: Real-time throughput metrics (tokens/sec, completion count, duration)
RAG Pipeline: Upload PDFs and chat with your documents (Part 2)
Web Search: Tavily-powered web search injected into context (Part 3)
Vision Models: Auto-detecting image support with smart fallbacks (Part 3)
Browser Inference: Run models locally in your browser via WebGPU (Part 4)

The key idea is plug and play. New model on HuggingFace? Serve it with vLLM, point LLMChat at the endpoint, and you're testing it in under a minute. No config files to edit, no rebuilds, no waiting.

Architecture

LLMChat has a clean two-layer architecture: a FastAPI backend that handles all the LLM communication, and a vanilla JavaScript frontend that renders the chat UI.

Loading diagram...

Why Vanilla JS?

You might ask: "Why not React? Why not Next.js?"

Because I didn't need them. The entire frontend is ~1,500 lines across 5 files. No build step, no node_modules, no webpack config, no dependency hell. Just .html, .js, and .css files served by FastAPI.

The JavaScript is organized using the IIFE (Immediately Invoked Function Expression) pattern to avoid polluting the global namespace:

javascript
1// Each module wraps itself in an IIFE
2(function () {
3  // Private logic here...
4
5  // Expose public API
6  window.Render = {
7    renderMarkdownInto,
8    updateStatsFromStream,
9    showImagePreviews,
10  };
11})();

It's old school, it works, and it's fast. When I want to change something, I change a file and refresh the browser. No hot module replacement needed because cold refresh takes 200ms.

Setup in 5 Minutes

Prerequisites

You need Python 3.9+ and an OpenAI-compatible endpoint running somewhere. If you don't have one yet, check out my self-hosting LLMs guide.

Step 1: Clone and Install

bash
1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txt

The dependencies are minimal:

PackagePurpose
fastapi + uvicornWeb framework and ASGI server
openaiOpenAI-compatible API client
PillowImage compression for uploads
python-dotenvEnvironment variable loading
tavily-pythonWeb search integration
PyMuPDFPDF text extraction for RAG
sentence-transformersEmbedding model for RAG
qdrant-clientVector database for RAG

Step 2: Configure

Copy the environment template and fill in your endpoint:

bash
1cp app/.env.example app/.env
bash
1# app/.env
2OPENAI_API_KEY=EMPTY                        # Only needed if your endpoint requires auth
3OPENAI_API_BASE=http://localhost:8000/v1     # Your vLLM/Ollama/OpenAI endpoint
4TAVILY_API_KEY=your-tavily-api-key-here      # Optional: enables web search

That's it. Two required fields: OPENAI_API_KEY (set to EMPTY for local models) and OPENAI_API_BASE (your serving engine's URL).

Step 3: Run

bash
1uvicorn app.main:app --host 0.0.0.0 --port 3000

Open http://localhost:3000 in your browser. LLMChat will auto-detect all models available on your endpoint and populate the dropdown. Pick a model and start chatting.

LLMChat model dropdown

Tested Models

I've been testing LLMChat with a variety of models on my RTX 3080 Ti (12 GB VRAM) using vLLM as the serving engine. Here's what works:

ModelvLLM CommandNotes
Falcon3-7B-Instruct-GPTQ-Int4vllm serve tiiuae/Falcon3-7B-Instruct-GPTQ-Int4 --max-model-len 4096 --gpu-memory-utilization 0.85Solid general-purpose
Ministral-3-8B-Instruct-AWQ-4bitvllm serve cyankiwi/Ministral-3-8B-Instruct-2512-AWQ-4bit --gpu-memory-utilization 0.85 --max-model-len 6144 --max-num-batched-tokens 1024Great instruction following
InternVL3-8B-AWQvllm serve OpenGVLab/InternVL3-8B-AWQ --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-code --quantization awqVision-capable. Must set --quantization awq
InternVL3-2Bvllm serve OpenGVLab/InternVL3-2B --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024 --trust-remote-codeSmaller vision model
Nemotron Cascade 8Bvllm serve cyankiwi/Nemotron-Cascade-8B-AWQ-4bit --max-model-len 4096 --gpu-memory-utilization 0.85 --max-num-batched-tokens 1024 --trust-remote-codeNVIDIA's reasoning model
Qwen2-VL-2B-Instructvllm serve Qwen/Qwen2-VL-2B-Instruct --max-model-len 4096 --gpu-memory-utilization 0.75 --max-num-batched-tokens 1024Qwen's vision-language model
Gemma 3 4B Instruct (GPTQ)vllm serve ISTA-DASLab/gemma-3-4b-it-GPTQ-4b-128g --max-model-len 4096 --max-num-batched-tokens 1024 --gpu-memory-utilization 0.8Original OOMs. GPTQ quantized works. Max concurrency: 9

The beauty of LLMChat's design is that you don't need to configure any of these models in the frontend. Just serve them with vLLM (or Ollama, or any OpenAI-compatible engine), and LLMChat auto-discovers them via the /models endpoint. Plug and play.

Confused about hosting LLMs locally? Check out Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp for a detailed walkthrough.

Under the Hood: Streaming

Nobody wants to stare at a loading spinner for 10 seconds waiting for a complete response. That's why LLMChat uses Server-Sent Events (SSE) for real-time token streaming.

Here's the core pattern — a Python generator that yields tokens as they arrive from the LLM:

python
1@app.post("/chat/stream")
2def chat_stream(req: ChatRequest):
3    model_name = (req.model or DEFAULT_MODEL).strip()
4    messages = build_messages(req.user_id, req.message, req.attachments, model_name)
5
6    def token_generator():
7        buffer = []
8        start_t = time.perf_counter()
9
10        stream = client.chat.completions.create(
11            model=model_name,
12            messages=messages,
13            max_tokens=effective_max_tokens,
14            temperature=0.8,
15            stream=True,               # This is the key flag
16        )
17
18        for chunk in stream:
19            piece = chunk.choices[0].delta.content or ""
20            if piece:
21                buffer.append(piece)
22                yield piece             # Token goes to the browser immediately
23
24        # Append throughput metrics at the end
25        final_answer = "".join(buffer)
26        dur_ms = int((time.perf_counter() - start_t) * 1000)
27        approx_tokens = max(1, len(final_answer) // 4)
28        tps = approx_tokens / max(0.001, dur_ms / 1000)
29        yield f"\n\n[throughput] duration_ms={dur_ms} tokens_per_sec={tps:.2f}\n"
30
31    return StreamingResponse(token_generator(), media_type="text/plain")

On the frontend side, we read the stream chunk by chunk and render markdown in real-time:

javascript
1const reader = response.body.getReader();
2const decoder = new TextDecoder();
3let buffer = "";
4
5while (true) {
6    const { value, done } = await reader.read();
7    if (done) break;
8    buffer += decoder.decode(value, { stream: true });
9    renderMarkdownInto(bubble, buffer);  // Re-render on every chunk
10}

The result is a ChatGPT-like experience where text appears token by token. At the end of each response, the backend appends a throughput footer like [throughput] duration_ms=1234 tokens_per_sec=45.67. The frontend strips this out and displays it in a stats bar.

Streaming response in action

Thinking Block Support

Some models (like DeepSeek, Nemotron, QwQ) wrap their reasoning in <think> tags. LLMChat detects these and renders them as collapsible sections:

javascript
1function extractThinking(rawText) {
2    let thinking = '';
3    let content = rawText;
4
5    // Pattern: <think>...</think> tags
6    const rawThinkRegex = /<think>([\s\S]*?)<\/think>/gi;
7    content = content.replace(rawThinkRegex, (match, p1) => {
8        thinking += p1.trim();
9        return '';
10    });
11
12    return { thinking: thinking.trim(), content: content.trim() };
13}

When thinking content is found, it gets rendered as a <details> element. Collapsed by default, expandable on click:

html
1<details class="thinking-block">
2  <summary>💭 Model's Reasoning</summary>
3  <div class="thinking-content">...</div>
4</details>

Thinking block — collapsible reasoning

Under the Hood: Smart Context Management

LLMs have finite context windows. Send too many tokens and you get an error. Most chat interfaces just crash or show a cryptic error message, but LLMChat handles this gracefully.

Auto-Retry with Reduced Tokens

When the backend catches a BadRequestError (context length exceeded), it parses the error message to figure out exactly how many tokens are available:

python
1def parse_allowed_tokens_from_error(msg: str) -> int | None:
2    """
3    Parse: "This model's maximum context length is 1024 tokens 
4    and your request has 899 input tokens"
5    """
6    max_ctx_match = re.search(r"maximum context length is (\d+) tokens", msg)
7    input_match = re.search(r"your request has (\d+) input tokens", msg)
8    if not max_ctx_match or not input_match:
9        return None
10    max_ctx = int(max_ctx_match.group(1))
11    input_tokens = int(input_match.group(1))
12    context_margin = 16  # Safety headroom
13    return max(0, max_ctx - input_tokens - context_margin)

If parsing succeeds, the request is retried with the exact remaining budget. If parsing fails (different error format), it halves the token count and tries again. Either way, the user gets a response instead of an error.

Truncation Suffix

If the model hits its max_tokens limit mid-response, the dreaded finish_reason: "length", LLMChat appends a clear indicator:

python
1if finish_reason == "length" and TRUNCATION_SUFFIX:
2    answer = f"{answer}… Would you like me to continue?"

Instead of the response just... stopping mid-sentence, the user gets a clear signal that there's more to say.

History Pruning

Chat history grows with every message. Vision models are especially context-hungry because image tokens are expensive. LLMChat handles this with model-aware pruning:

python
1# Vision models get fewer history turns to manage context
2MAX_HISTORY_TURNS_TEXT = 6      # 6 user+assistant pairs for text models
3MAX_HISTORY_TURNS_VISION = 2    # Only 2 pairs for vision models

The system message is always preserved, and older messages are pruned from the front. The conversation stays coherent while keeping context usage manageable.

The Tech Stack

LayerTechnology
Backend frameworkFastAPI + Uvicorn
LLM clientopenai Python SDK (v1+)
RAG embeddingssentence-transformers (all-MiniLM-L6-v2, 384-dim)
RAG vector storeQdrant (in-memory)
PDF extractionPyMuPDF (fitz)
Web searchTavily API
Image processingPillow
FrontendVanilla HTML/CSS/JS (no framework)
Markdown renderingmarked.js + DOMPurify + highlight.js
Local inferenceHuggingFace Transformers.js (WebGPU + WASM)

What's Next

This post covered the foundation: the core chat interface, streaming, and context management. But LLMChat does a lot more. Here's what's coming in the rest of this series:

Try It Yourself

If you're tired of paying per-token for cloud APIs, or if you just want a fast way to test the latest models as they drop on HuggingFace, give LLMChat a try.

GitHub: https://github.com/kXborg/LLMChat

Quick Start:

bash
1git clone https://github.com/kXborg/LLMChat.git
2cd LLMChat
3pip install -r app/requirements.txt
4# Configure your endpoint in app/.env
5uvicorn app.main:app --host 0.0.0.0 --port 3000

Point it at your serving engine, pick a model, and start chatting. It's that simple.

Final Thoughts

The open-source LLM ecosystem is moving at an incredible pace. New models, new architectures, new capabilities, every single week. What's missing is a simple way to interact with all of them without being locked into a single provider's interface.

LLMChat is my answer to that gap. It's not trying to compete with ChatGPT's polish or Cursor's agentic features. It's a wireframe. Deliberately simple, deliberately flexible. Swap the model, swap the endpoint, keep the interface.

Your models, your data, your chat. 🚀


P.S.: If you're also interested in LLM-powered code completion (the non-agentic kind), check out CodeContinue. It's a Sublime Text plugin I built for intelligent inline suggestions. Same philosophy: your model, your rules.

If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.

If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.