← Back to Blog

LLMChat: Deploying Your Self-Hosted AI Stack

You've built it, Now deploy it. From Cloudflared tunnels to production hardening, this post ties the entire self-hosted AI ecosystem together. Serving engines, chat interface, and code completion.

AI/ML9 min readAuthor: Kukil Kashyap Borgohain
LLMChat - Deploying your self-hosted AI stack

From Local to Everywhere

Over the past four posts, we've built something substantial:

  • Part 1: A self-hosted chat interface with streaming and smart context management
  • Part 2: A RAG pipeline for chatting with your PDFs
  • Part 3: Vision model support, web search integration, and smart fallbacks
  • Part 4: In-browser LLM inference via WebGPU and WASM

All of this runs beautifully on localhost:3000. But what about sharing it with your team? Accessing it from your phone? Running it on a remote server?

This final post covers deployment. Getting LLMChat running beyond your local machine, hardening it for real-world use, and stepping back to see how all the pieces of a self-hosted AI stack fit together.

The Deployment Spectrum

Not every deployment needs a Kubernetes cluster. Here's how I think about it:

ScenarioSetupComplexity
Personal useuvicorn on your machineTrivial
Local network / Teamuvicorn + bind to 0.0.0.0Low
Internet-facinguvicorn + Cloudflared tunnelMedium
ProductionDocker + reverse proxy + authHigher

For most personal and small-team use cases, you don't need Docker, Nginx, or any of the production scaffolding. Let's start with the simplest path to sharing LLMChat outside localhost.

Exposing LLMChat with Cloudflared

Cloudflared is a free tool from Cloudflare that creates secure tunnels to your local services. No port forwarding, no static IPs, no DNS configuration needed.

Install Cloudflared

bash
1# Windows
2winget install --id Cloudflare.cloudflared
3
4# macOS
5brew install cloudflared
6
7# Linux (Debian/Ubuntu)
8curl -L --output cloudflared.deb \
9    https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-amd64.deb
10sudo dpkg -i cloudflared.deb

Create a Quick Tunnel

With LLMChat running on port 3000, expose it with a single command:

bash
1cloudflared tunnel --url http://localhost:3000

Cloudflared generates a temporary public URL:

cpp
1Your quick Tunnel has been created!
2+-----------------------------------------------------------+
3|  Your free tunnel URL: https://random-words.trycloudflare.com  |
4+-----------------------------------------------------------+

Anyone with this URL can now access your LLMChat instance from anywhere in the world. The tunnel encrypts traffic with TLS and proxies requests through Cloudflare's network.

If you're also running a model server (vLLM on port 8000), you can tunnel that too:

bash
1# Terminal 1: LLMChat
2cloudflared tunnel --url http://localhost:3000
3
4# Terminal 2: Model server (if remote access needed)
5cloudflared tunnel --url http://localhost:8000

I covered Cloudflared in detail in my Self-Hosting LLMs guide. Check that post for persistent tunnel setup with Cloudflare Zero Trust.

Cloudflared tunnel setup

Security Considerations

Quick tunnels are publicly accessible. Anyone with the URL can use your LLMChat. For personal use, this is fine since the URL is random and temporary. But for anything more:

Use Cloudflare Zero Trust for:

  • Email-based authentication (only your team can access)
  • IP whitelisting
  • Session management
  • Audit logging

Or add your own auth layer (more on that in the hardening section below).

Hardening for Production

LLMChat was built for personal use and rapid prototyping. If you're deploying it for a team or exposing it to the internet, there are several areas to tighten.

1. CORS Lockdown

The current config allows everything:

python
1# Current (development)
2app.add_middleware(
3    CORSMiddleware,
4    allow_origins=["*"],        # tighten in production
5    allow_credentials=True,
6    allow_methods=["*"],
7    allow_headers=["*"],
8)

For production, restrict to your domain:

python
1# Production
2app.add_middleware(
3    CORSMiddleware,
4    allow_origins=["https://your-domain.com"],
5    allow_credentials=True,
6    allow_methods=["GET", "POST", "DELETE"],
7    allow_headers=["Content-Type", "Authorization"],
8)

2. Authentication

Currently, user_id is generated client-side with Math.random(). There's no verification:

javascript
1// Current: anyone can claim any user_id
2let userId = "user-" + Math.random().toString(36).substring(2, 8);

For a shared deployment, add at minimum:

  • API key authentication: Simple Authorization: Bearer <key> header check
  • Session-based auth: Login page with username/password
  • OAuth: Google/GitHub login via FastAPI middleware

3. Persistent Storage

Everything in LLMChat is in-memory. Chat histories, RAG indexes, vision overrides, document metadata. Restart the server and it's all gone.

python
1# Current: all in-memory
2user_histories = {}                    # Chat histories
3vision_capability_overrides = {}       # Vision overrides
4document_metadata = {}                 # RAG document metadata
5qdrant_client = QdrantClient(":memory:")  # Vector database

For persistence, consider:

DataCurrentProduction Option
Chat historiesPython dictSQLite via aiosqlite or PostgreSQL
RAG vectorsQdrant in-memoryQdrant persistent mode or hosted Qdrant
Document metadataPython dictSQLite or same DB as chat histories
Vision overridesPython dictRedis or DB-backed session store
Uploaded filesLocal filesystemS3-compatible storage or persistent volume

4. Environment Variable Management

Keep sensitive values out of code:

bash
1# Required
2OPENAI_API_KEY=your-key-here
3OPENAI_API_BASE=http://your-server:8000/v1
4
5# Optional but recommended
6TAVILY_API_KEY=tvly-xxx
7MAX_TOKENS=2048
8MAX_HISTORY_TURNS_TEXT=6
9MAX_HISTORY_TURNS_VISION=2
10IMAGE_MAX_DIMENSION=2048
11IMAGE_QUALITY=85

Use .env for local development and proper secrets management (Docker secrets, cloud provider secret manager) for production.

5. Error Handling and Logging

The current codebase uses print() for logging. For production:

python
1import logging
2
3logger = logging.getLogger("llmchat")
4logger.setLevel(logging.INFO)
5
6# Replace print() calls
7logger.info(f"Loading embedding model: {model_name}")
8logger.warning(f"Image compression failed for {file_path.name}: {e}")
9logger.error(f"Vision probe failed for {model_id}: {e}")

Add structured logging with request IDs for debugging multi-user issues.

The Known Limitations: Honest Assessment

I've been honest throughout this series about what works and what doesn't. Here's the full picture:

Architecture Limitations

LimitationImpactMitigation
In-memory stateAll data lost on restartMigrate to persistent storage
No automated teststests/ directory = screenshots onlyAdd pytest suite for API endpoints
CORS wide openSecurity risk in productionRestrict origins
No authAnyone can access, impersonate usersAdd authentication layer
Single-processCan't scale horizontallyPut behind load balancer with sticky sessions

Feature Limitations

LimitationImpactPossible Fix
PDF-only RAGCan't index Word, HTML, or web pagesAdd document format handlers
No re-rankingRAG results ordered by raw similarity onlyAdd cross-encoder second pass
Vision probe delay1-2s per model on first loadBackground probe on startup
Image compression edge casesVery large images (5MB+) can failBetter progressive resizing
Legacy duplicate codeSome functions exist in both app.js and modular filesComplete the restructuring

I list these not to discourage, but because knowing the limits is how you know where to improve. For a personal self-hosted tool, most of these are non-issues. For a team deployment, pick the ones that matter to you and address them.

The Full Self-Hosted AI Ecosystem

Let's zoom out and see how everything connects. Over the past year, I've built three pieces that together form a complete self-hosted AI stack:

Loading diagram...

The Three Pillars

1. Serving Engines >> Self-Hosting LLMs: A Guide to vLLM, SGLang, and Llama.cpp

The foundation. Models need to be loaded into memory and exposed via API. vLLM for high-throughput production, SGLang for structured generation, Llama.cpp for CPU/edge deployment.

2. Chat Interface >> LLMChat (this series)

The frontend. A flexible chat UI that connects to any of those serving engines, augments queries with RAG and web search, handles vision models, and even runs small models directly in the browser.

3. Code Completion >> CodeContinue: LLM-Powered Sublime Text Autocomplete

The developer tool. A Sublime Text plugin that uses the same local LLM infrastructure for intelligent 1-2 line code suggestions. Same philosophy: your model, your rules.

How They Connect

All three tools speak OpenAI-compatible API. This means:

  • Any model served by vLLM, SGLang, or Llama.cpp works with both LLMChat and CodeContinue
  • You can run one model server and point multiple tools at it
  • Switching models is just changing a URL or model name, no reconfiguration

This is the power of open standards. The OpenAI API format has become the lingua franca of LLM tooling, and every piece in this stack speaks it natively.

What I'd Build Next

If I were spending another month on this project, here's my priority list:

  1. Persistent storage : SQLite for chat histories and RAG metadata. This is the single biggest improvement for daily use.

  2. Multi-user auth : Simple username/password with session tokens. Not enterprise SSO, just enough that two people can use the same deployment without interfering with each other.

  3. Docker Compose : One-command deployment: docker compose up. Bundle LLMChat + vLLM + Qdrant in a single docker-compose.yml.

  4. More RAG formats :.docx, .html, and web scraping. The pipeline is already modular enough to add new extractors.

  5. Fine-tuned models: Domain-specific fine-tunes served alongside general models. The model dropdown already supports multiple models, so fine-tunes would just be another entry.

  6. Mobile-responsive UI : The current CSS works on desktop but gets cramped on phones. A responsive redesign would make browser inference genuinely useful on mobile.

Final Thoughts

When I started this project, I just wanted a simple interface to test new models. It grew into something bigger. A complete self-hosted AI chat system with RAG, vision, web search, and browser inference.

But the core philosophy never changed:

  • Plug and play: Any model, any endpoint, no reconfiguration
  • Privacy first: Your data stays on your infrastructure (or in your browser)
  • Honest about limitations: In-memory storage, no auth, no tests, but functional and useful
  • Building blocks, not monoliths: Each component (serving, chat, code completion) works independently

The open-source LLM ecosystem has given us incredible models. What's been missing is the connective tissue. The tools that let you use those models productively. LLMChat is my contribution to that layer.

GitHub: https://github.com/kXborg/LLMChat

Try it. Fork it. Break it. Improve it. And if you build something cool on top of it, I'd love to hear about it.

Your hardware. Your models. Your data. Your rules. 🚀


P.S.: This is the end of the LLMChat series, but not the end of the project. I have plans for edge deployment (Jetson, Raspberry Pi) that'll get their own dedicated post. Stay tuned! And if you've been following along, you now have everything you need to run, extend, and deploy your own self-hosted AI chat system. Happy hacking 🛠️.

If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.

If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.