LLMChat: Deploying Your Self-Hosted AI Stack

From Local to Everywhere

Over the past four posts, we've built something substantial:

Part 1: A self-hosted chat interface with streaming and smart context management
Part 2: A RAG pipeline for chatting with your PDFs
Part 3: Vision model support, web search integration, and smart fallbacks
Part 4: In-browser LLM inference via WebGPU and WASM

All of this runs beautifully on localhost:3000. But what about sharing it with your team? Accessing it from your phone? Running it on a remote server?

This final post covers deployment. Getting LLMChat running beyond your local machine, hardening it for real-world use, and stepping back to see how all the pieces of a self-hosted AI stack fit together.

The Deployment Spectrum

Not every deployment needs a Kubernetes cluster. Here's how I think about it:

Scenario	Setup	Complexity
Personal use	`uvicorn` on your machine	Trivial
Local network / Team	`uvicorn` + bind to `0.0.0.0`	Low
Internet-facing	`uvicorn` + Cloudflared tunnel	Medium
Production	Docker + reverse proxy + auth	Higher

For most personal and small-team use cases, you don't need Docker, Nginx, or any of the production scaffolding. Let's start with the simplest path to sharing LLMChat outside localhost.

Exposing LLMChat with Cloudflared

Cloudflared is a free tool from Cloudflare that creates secure tunnels to your local services. No port forwarding, no static IPs, no DNS configuration needed.

Install Cloudflared

bash

1# Windows
2winget install --id Cloudflare.cloudflared
3
4# macOS
5brew install cloudflared
6
7# Linux (Debian/Ubuntu)
8curl -L --output cloudflared.deb \
9    https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
10sudo dpkg -i cloudflared.deb

Create a Quick Tunnel

With LLMChat running on port 3000, expose it with a single command:

bash

1cloudflared tunnel --url http://localhost:3000

Cloudflared generates a temporary public URL:

cpp

1Your quick Tunnel has been created!
2+-----------------------------------------------------------+
3|  Your free tunnel URL: https://random-words.trycloudflare.com  |
4+-----------------------------------------------------------+

Anyone with this URL can now access your LLMChat instance from anywhere in the world. The tunnel encrypts traffic with TLS and proxies requests through Cloudflare's network.

If you're also running a model server (vLLM on port 8000), you can tunnel that too:

bash

1# Terminal 1: LLMChat
2cloudflared tunnel --url http://localhost:3000
3
4# Terminal 2: Model server (if remote access needed)
5cloudflared tunnel --url http://localhost:8000

I covered Cloudflared in detail in my Self-Hosting LLMs guide. Check that post for persistent tunnel setup with Cloudflare Zero Trust.

Cloudflared tunnel setup

Security Considerations

Quick tunnels are publicly accessible. Anyone with the URL can use your LLMChat. For personal use, this is fine since the URL is random and temporary. But for anything more:

Use Cloudflare Zero Trust for:

Email-based authentication (only your team can access)
IP whitelisting
Session management
Audit logging

Or add your own auth layer (more on that in the hardening section below).

Hardening for Production

LLMChat was built for personal use and rapid prototyping. If you're deploying it for a team or exposing it to the internet, there are several areas to tighten.

1. CORS Lockdown

The current config allows everything:

python

1# Current (development)
2app.add_middleware(
3    CORSMiddleware,
4    allow_origins=["*"],        # tighten in production
5    allow_credentials=True,
6    allow_methods=["*"],
7    allow_headers=["*"],
8)

For production, restrict to your domain:

python

1# Production
2app.add_middleware(
3    CORSMiddleware,
4    allow_origins=["https://your-domain.com"],
5    allow_credentials=True,
6    allow_methods=["GET", "POST", "DELETE"],
7    allow_headers=["Content-Type", "Authorization"],
8)

2. Authentication

Currently, user_id is generated client-side with Math.random(). There's no verification:

javascript

1// Current: anyone can claim any user_id
2let userId = "user-" + Math.random().toString(36).substring(2, 8);

For a shared deployment, add at minimum:

API key authentication: Simple Authorization: Bearer <key> header check
Session-based auth: Login page with username/password
OAuth: Google/GitHub login via FastAPI middleware

3. Persistent Storage

Everything in LLMChat is in-memory. Chat histories, RAG indexes, vision overrides, document metadata. Restart the server and it's all gone.

python

1# Current: all in-memory
2user_histories = {}                    # Chat histories
3vision_capability_overrides = {}       # Vision overrides
4document_metadata = {}                 # RAG document metadata
5qdrant_client = QdrantClient(":memory:")  # Vector database

For persistence, consider:

Data	Current	Production Option
Chat histories	Python dict	SQLite via `aiosqlite` or PostgreSQL
RAG vectors	Qdrant in-memory	Qdrant persistent mode or hosted Qdrant
Document metadata	Python dict	SQLite or same DB as chat histories
Vision overrides	Python dict	Redis or DB-backed session store
Uploaded files	Local filesystem	S3-compatible storage or persistent volume

4. Environment Variable Management

Keep sensitive values out of code:

bash

1# Required
2OPENAI_API_KEY=your-key-here
3OPENAI_API_BASE=http://your-server:8000/v1
4
5# Optional but recommended
6TAVILY_API_KEY=tvly-xxx
7MAX_TOKENS=2048
8MAX_HISTORY_TURNS_TEXT=6
9MAX_HISTORY_TURNS_VISION=2
10IMAGE_MAX_DIMENSION=2048
11IMAGE_QUALITY=85

Use .env for local development and proper secrets management (Docker secrets, cloud provider secret manager) for production.

5. Error Handling and Logging

The current codebase uses print() for logging. For production:

python

1import logging
2
3logger = logging.getLogger("llmchat")
4logger.setLevel(logging.INFO)
5
6# Replace print() calls
7logger.info(f"Loading embedding model: {model_name}")
8logger.warning(f"Image compression failed for {file_path.name}: {e}")
9logger.error(f"Vision probe failed for {model_id}: {e}")

Add structured logging with request IDs for debugging multi-user issues.

The Known Limitations: Honest Assessment

I've been honest throughout this series about what works and what doesn't. Here's the full picture:

Architecture Limitations

Limitation	Impact	Mitigation
In-memory state	All data lost on restart	Migrate to persistent storage
No automated tests	`tests/` directory = screenshots only	Add pytest suite for API endpoints
CORS wide open	Security risk in production	Restrict origins
No auth	Anyone can access, impersonate users	Add authentication layer
Single-process	Can't scale horizontally	Put behind load balancer with sticky sessions

Feature Limitations

Limitation	Impact	Possible Fix
PDF-only RAG	Can't index Word, HTML, or web pages	Add document format handlers
No re-ranking	RAG results ordered by raw similarity only	Add cross-encoder second pass
Vision probe delay	1-2s per model on first load	Background probe on startup
Image compression edge cases	Very large images (5MB+) can fail	Better progressive resizing
Legacy duplicate code	Some functions exist in both `app.js` and modular files	Complete the restructuring

I list these not to discourage, but because knowing the limits is how you know where to improve. For a personal self-hosted tool, most of these are non-issues. For a team deployment, pick the ones that matter to you and address them.

The Full Self-Hosted AI Ecosystem

Let's zoom out and see how everything connects. Over the past year, I've built three pieces that together form a complete self-hosted AI stack:

Loading diagram...

The Three Pillars

1. Serving Engines >> Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp

The foundation. Models need to be loaded into memory and exposed via API. vLLM for high-throughput production, SGLang for structured generation, Llama.cpp for CPU/edge deployment.

2. Chat Interface >> LLMChat (this series)

The frontend. A flexible chat UI that connects to any of those serving engines, augments queries with RAG and web search, handles vision models, and even runs small models directly in the browser.

3. Code Completion >> CodeContinue: LLM-Powered Sublime Text Autocomplete

The developer tool. A Sublime Text plugin that uses the same local LLM infrastructure for intelligent 1-2 line code suggestions. Same philosophy: your model, your rules.

How They Connect

All three tools speak OpenAI-compatible API. This means:

Any model served by vLLM, SGLang, or Llama.cpp works with both LLMChat and CodeContinue
You can run one model server and point multiple tools at it
Switching models is just changing a URL or model name, no reconfiguration

This is the power of open standards. The OpenAI API format has become the lingua franca of LLM tooling, and every piece in this stack speaks it natively.

What I'd Build Next

If I were spending another month on this project, here's my priority list:

Persistent storage : SQLite for chat histories and RAG metadata. This is the single biggest improvement for daily use.
Multi-user auth : Simple username/password with session tokens. Not enterprise SSO, just enough that two people can use the same deployment without interfering with each other.
Docker Compose : One-command deployment: docker compose up. Bundle LLMChat + vLLM + Qdrant in a single docker-compose.yml.
More RAG formats :.docx, .html, and web scraping. The pipeline is already modular enough to add new extractors.
Fine-tuned models: Domain-specific fine-tunes served alongside general models. The model dropdown already supports multiple models, so fine-tunes would just be another entry.
Mobile-responsive UI : The current CSS works on desktop but gets cramped on phones. A responsive redesign would make browser inference genuinely useful on mobile.

Final Thoughts

When I started this project, I just wanted a simple interface to test new models. It grew into something bigger. A complete self-hosted AI chat system with RAG, vision, web search, and browser inference.

But the core philosophy never changed:

Plug and play: Any model, any endpoint, no reconfiguration
Privacy first: Your data stays on your infrastructure (or in your browser)
Honest about limitations: In-memory storage, no auth, no tests, but functional and useful
Building blocks, not monoliths: Each component (serving, chat, code completion) works independently

The open-source LLM ecosystem has given us incredible models. What's been missing is the connective tissue. The tools that let you use those models productively. LLMChat is my contribution to that layer.

GitHub: https://github.com/kXborg/LLMChat

Try it. Fork it. Break it. Improve it. And if you build something cool on top of it, I'd love to hear about it.

Your hardware. Your models. Your data. Your rules. 🚀

P.S.: This is the end of the LLMChat series, but not the end of the project. I have plans for edge deployment (Jetson, Raspberry Pi) that'll get their own dedicated post. Stay tuned! And if you've been following along, you now have everything you need to run, extend, and deploy your own self-hosted AI chat system. Happy hacking 🛠️.

LLMChat: Deploying Your Self-Hosted AI Stack

From Local to Everywhere

The Deployment Spectrum

Exposing LLMChat with Cloudflared

Install Cloudflared

Create a Quick Tunnel

Security Considerations

Hardening for Production

1. CORS Lockdown

2. Authentication

3. Persistent Storage

4. Environment Variable Management

5. Error Handling and Logging

The Known Limitations: Honest Assessment

Architecture Limitations

Feature Limitations

The Full Self-Hosted AI Ecosystem

The Three Pillars

How They Connect

What I'd Build Next

Final Thoughts

LLMChat: Running LLMs in Your Browser with WebGPU

Prompt Engineering in 2026: Context, Constraints, and Security