LLMChat: Deploying Your Self-Hosted AI Stack
You've built it, Now deploy it. From Cloudflared tunnels to production hardening, this post ties the entire self-hosted AI ecosystem together. Serving engines, chat interface, and code completion.

From Local to Everywhere
Over the past four posts, we've built something substantial:
- Part 1: A self-hosted chat interface with streaming and smart context management
- Part 2: A RAG pipeline for chatting with your PDFs
- Part 3: Vision model support, web search integration, and smart fallbacks
- Part 4: In-browser LLM inference via WebGPU and WASM
All of this runs beautifully on localhost:3000. But what about sharing it with your team? Accessing it from your phone? Running it on a remote server?
This final post covers deployment. Getting LLMChat running beyond your local machine, hardening it for real-world use, and stepping back to see how all the pieces of a self-hosted AI stack fit together.
The Deployment Spectrum
Not every deployment needs a Kubernetes cluster. Here's how I think about it:
| Scenario | Setup | Complexity |
|---|---|---|
| Personal use | uvicorn on your machine | Trivial |
| Local network / Team | uvicorn + bind to 0.0.0.0 | Low |
| Internet-facing | uvicorn + Cloudflared tunnel | Medium |
| Production | Docker + reverse proxy + auth | Higher |
For most personal and small-team use cases, you don't need Docker, Nginx, or any of the production scaffolding. Let's start with the simplest path to sharing LLMChat outside localhost.
Exposing LLMChat with Cloudflared
Cloudflared is a free tool from Cloudflare that creates secure tunnels to your local services. No port forwarding, no static IPs, no DNS configuration needed.
Install Cloudflared
1# Windows
2winget install --id Cloudflare.cloudflared
3
4# macOS
5brew install cloudflared
6
7# Linux (Debian/Ubuntu)
8curl -L --output cloudflared.deb \
9 https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
10sudo dpkg -i cloudflared.debCreate a Quick Tunnel
With LLMChat running on port 3000, expose it with a single command:
1cloudflared tunnel --url http://localhost:3000Cloudflared generates a temporary public URL:
1Your quick Tunnel has been created!
2+-----------------------------------------------------------+
3| Your free tunnel URL: https://random-words.trycloudflare.com |
4+-----------------------------------------------------------+Anyone with this URL can now access your LLMChat instance from anywhere in the world. The tunnel encrypts traffic with TLS and proxies requests through Cloudflare's network.
If you're also running a model server (vLLM on port 8000), you can tunnel that too:
1# Terminal 1: LLMChat
2cloudflared tunnel --url http://localhost:3000
3
4# Terminal 2: Model server (if remote access needed)
5cloudflared tunnel --url http://localhost:8000I covered Cloudflared in detail in my Self-Hosting LLMs guide. Check that post for persistent tunnel setup with Cloudflare Zero Trust.

Security Considerations
Quick tunnels are publicly accessible. Anyone with the URL can use your LLMChat. For personal use, this is fine since the URL is random and temporary. But for anything more:
Use Cloudflare Zero Trust for:
- Email-based authentication (only your team can access)
- IP whitelisting
- Session management
- Audit logging
Or add your own auth layer (more on that in the hardening section below).
Hardening for Production
LLMChat was built for personal use and rapid prototyping. If you're deploying it for a team or exposing it to the internet, there are several areas to tighten.
1. CORS Lockdown
The current config allows everything:
1# Current (development)
2app.add_middleware(
3 CORSMiddleware,
4 allow_origins=["*"], # tighten in production
5 allow_credentials=True,
6 allow_methods=["*"],
7 allow_headers=["*"],
8)For production, restrict to your domain:
1# Production
2app.add_middleware(
3 CORSMiddleware,
4 allow_origins=["https://your-domain.com"],
5 allow_credentials=True,
6 allow_methods=["GET", "POST", "DELETE"],
7 allow_headers=["Content-Type", "Authorization"],
8)2. Authentication
Currently, user_id is generated client-side with Math.random(). There's no verification:
1// Current: anyone can claim any user_id
2let userId = "user-" + Math.random().toString(36).substring(2, 8);For a shared deployment, add at minimum:
- API key authentication: Simple
Authorization: Bearer <key>header check - Session-based auth: Login page with username/password
- OAuth: Google/GitHub login via FastAPI middleware
3. Persistent Storage
Everything in LLMChat is in-memory. Chat histories, RAG indexes, vision overrides, document metadata. Restart the server and it's all gone.
1# Current: all in-memory
2user_histories = {} # Chat histories
3vision_capability_overrides = {} # Vision overrides
4document_metadata = {} # RAG document metadata
5qdrant_client = QdrantClient(":memory:") # Vector databaseFor persistence, consider:
| Data | Current | Production Option |
|---|---|---|
| Chat histories | Python dict | SQLite via aiosqlite or PostgreSQL |
| RAG vectors | Qdrant in-memory | Qdrant persistent mode or hosted Qdrant |
| Document metadata | Python dict | SQLite or same DB as chat histories |
| Vision overrides | Python dict | Redis or DB-backed session store |
| Uploaded files | Local filesystem | S3-compatible storage or persistent volume |
4. Environment Variable Management
Keep sensitive values out of code:
1# Required
2OPENAI_API_KEY=your-key-here
3OPENAI_API_BASE=http://your-server:8000/v1
4
5# Optional but recommended
6TAVILY_API_KEY=tvly-xxx
7MAX_TOKENS=2048
8MAX_HISTORY_TURNS_TEXT=6
9MAX_HISTORY_TURNS_VISION=2
10IMAGE_MAX_DIMENSION=2048
11IMAGE_QUALITY=85Use .env for local development and proper secrets management (Docker secrets, cloud provider secret manager) for production.
5. Error Handling and Logging
The current codebase uses print() for logging. For production:
1import logging
2
3logger = logging.getLogger("llmchat")
4logger.setLevel(logging.INFO)
5
6# Replace print() calls
7logger.info(f"Loading embedding model: {model_name}")
8logger.warning(f"Image compression failed for {file_path.name}: {e}")
9logger.error(f"Vision probe failed for {model_id}: {e}")Add structured logging with request IDs for debugging multi-user issues.
The Known Limitations: Honest Assessment
I've been honest throughout this series about what works and what doesn't. Here's the full picture:
Architecture Limitations
| Limitation | Impact | Mitigation |
|---|---|---|
| In-memory state | All data lost on restart | Migrate to persistent storage |
| No automated tests | tests/ directory = screenshots only | Add pytest suite for API endpoints |
| CORS wide open | Security risk in production | Restrict origins |
| No auth | Anyone can access, impersonate users | Add authentication layer |
| Single-process | Can't scale horizontally | Put behind load balancer with sticky sessions |
Feature Limitations
| Limitation | Impact | Possible Fix |
|---|---|---|
| PDF-only RAG | Can't index Word, HTML, or web pages | Add document format handlers |
| No re-ranking | RAG results ordered by raw similarity only | Add cross-encoder second pass |
| Vision probe delay | 1-2s per model on first load | Background probe on startup |
| Image compression edge cases | Very large images (5MB+) can fail | Better progressive resizing |
| Legacy duplicate code | Some functions exist in both app.js and modular files | Complete the restructuring |
I list these not to discourage, but because knowing the limits is how you know where to improve. For a personal self-hosted tool, most of these are non-issues. For a team deployment, pick the ones that matter to you and address them.
The Full Self-Hosted AI Ecosystem
Let's zoom out and see how everything connects. Over the past year, I've built three pieces that together form a complete self-hosted AI stack:
The Three Pillars
1. Serving Engines >> Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp
The foundation. Models need to be loaded into memory and exposed via API. vLLM for high-throughput production, SGLang for structured generation, Llama.cpp for CPU/edge deployment.
2. Chat Interface >> LLMChat (this series)
The frontend. A flexible chat UI that connects to any of those serving engines, augments queries with RAG and web search, handles vision models, and even runs small models directly in the browser.
3. Code Completion >> CodeContinue: LLM-Powered Sublime Text Autocomplete
The developer tool. A Sublime Text plugin that uses the same local LLM infrastructure for intelligent 1-2 line code suggestions. Same philosophy: your model, your rules.
How They Connect
All three tools speak OpenAI-compatible API. This means:
- Any model served by vLLM, SGLang, or Llama.cpp works with both LLMChat and CodeContinue
- You can run one model server and point multiple tools at it
- Switching models is just changing a URL or model name, no reconfiguration
This is the power of open standards. The OpenAI API format has become the lingua franca of LLM tooling, and every piece in this stack speaks it natively.
What I'd Build Next
If I were spending another month on this project, here's my priority list:
-
Persistent storage : SQLite for chat histories and RAG metadata. This is the single biggest improvement for daily use.
-
Multi-user auth : Simple username/password with session tokens. Not enterprise SSO, just enough that two people can use the same deployment without interfering with each other.
-
Docker Compose : One-command deployment:
docker compose up. Bundle LLMChat + vLLM + Qdrant in a singledocker-compose.yml. -
More RAG formats :
.docx,.html, and web scraping. The pipeline is already modular enough to add new extractors. -
Fine-tuned models: Domain-specific fine-tunes served alongside general models. The model dropdown already supports multiple models, so fine-tunes would just be another entry.
-
Mobile-responsive UI : The current CSS works on desktop but gets cramped on phones. A responsive redesign would make browser inference genuinely useful on mobile.
Final Thoughts
When I started this project, I just wanted a simple interface to test new models. It grew into something bigger. A complete self-hosted AI chat system with RAG, vision, web search, and browser inference.
But the core philosophy never changed:
- Plug and play: Any model, any endpoint, no reconfiguration
- Privacy first: Your data stays on your infrastructure (or in your browser)
- Honest about limitations: In-memory storage, no auth, no tests, but functional and useful
- Building blocks, not monoliths: Each component (serving, chat, code completion) works independently
The open-source LLM ecosystem has given us incredible models. What's been missing is the connective tissue. The tools that let you use those models productively. LLMChat is my contribution to that layer.
GitHub: https://github.com/kXborg/LLMChat
Try it. Fork it. Break it. Improve it. And if you build something cool on top of it, I'd love to hear about it.
Your hardware. Your models. Your data. Your rules. 🚀
P.S.: This is the end of the LLMChat series, but not the end of the project. I have plans for edge deployment (Jetson, Raspberry Pi) that'll get their own dedicated post. Stay tuned! And if you've been following along, you now have everything you need to run, extend, and deploy your own self-hosted AI chat system. Happy hacking 🛠️.
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.