LLMChat: Vision Models, Web Search, and Smart Fallbacks
Text-only chat is limiting. LLMChat now auto-detects vision models, integrates Tavily web search, and gracefully falls back when things go wrong. Here's how it all works under the hood.

Beyond Text: Making LLMChat Multi-Modal
In Part 1, we built the core chat interface. In Part 2, we added the ability to chat with documents via RAG. But text-only interactions are limiting.
Sometimes you want to ask: "What's in this image?" or "Can you describe this diagram?" And sometimes the LLM's training data isn't enough. You need real-time information from the web.
This post covers two major upgrades to LLMChat:
- Vision model support : auto-detecting which models can handle images, with a smart three-tier override system
- Web search integration : grounding LLM responses in live web data via Tavily
Plus, the fallback strategy that keeps LLMChat useful even when your model server goes down.
The Vision Problem
Here's a problem that got annoying fast: not all models support images.
You serve Falcon3-7B on vLLM. Text only. You serve InternVL3-8B. Vision capable. You switch between them in the dropdown. But if you've attached an image and switch to a text-only model, what happens?
Some approaches:
- Crash: Send the image anyway and get a cryptic error. Bad UX.
- Manual config: Force users to tag each model as "vision" or "text only". Tedious.
- Auto-detect: Probe the model at startup to figure out what it supports. This is what LLMChat does.
Auto-Probing Vision Capability
When LLMChat fetches the model list from your endpoint, it probes each model to determine if it supports image inputs:
1def probe_vision_capability(base_url: str, model_id: str, timeout: int = 5) -> bool:
2 """Send a test vision payload to check if the model supports images."""
3
4 if model_id in VISION_PROBE_CACHE:
5 return VISION_PROBE_CACHE[model_id]
6
7 payload = {
8 "model": model_id,
9 "messages": [{
10 "role": "user",
11 "content": [
12 {"type": "image_url", "image_url": "about:blank"},
13 {"type": "text", "text": "ping"},
14 ],
15 }],
16 "max_tokens": 1,
17 }
18
19 try:
20 r = requests.post(
21 f"{base_url}/v1/chat/completions",
22 json=payload,
23 timeout=timeout,
24 )
25
26 if r.status_code == 200:
27 VISION_PROBE_CACHE[model_id] = True
28 return True
29
30 # Check error message for vision-related keywords
31 error_text = r.text.lower()
32 if any(k in error_text for k in [
33 "image", "vision", "multimodal", "image_url", "image token"
34 ]):
35 VISION_PROBE_CACHE[model_id] = True
36 return True
37
38 VISION_PROBE_CACHE[model_id] = False
39 return False
40
41 except requests.RequestException:
42 VISION_PROBE_CACHE[model_id] = False
43 return FalseThe trick is in the error handling. We send an about:blank image URL, which will always fail. But how it fails reveals whether the model understands vision:
- 200 OK: Model processed it (unlikely with
about:blank, but we handle it) - Error mentioning "image", "vision", or "multimodal": The model understands vision inputs but couldn't process this one. It's vision-capable
- Generic error or no mention of images: Text-only model
Results are cached in VISION_PROBE_CACHE so we only probe once per model.
The Three-Tier Priority System
But auto-detection isn't perfect. Some models might be falsely classified. That's why LLMChat uses a three-tier priority system for vision capability:
Per-request override >> Stored user override >> Probed capability
(highest) (medium) (lowest)
1def get_vision_capability_from_request(base_url, user_id, model_id,
2 vision_capability_overrides,
3 vision_enabled_override=None):
4 # Tier 1: Per-request override (from the toggle in the UI)
5 if vision_enabled_override is not None:
6 return vision_enabled_override
7
8 # Tier 2: Stored user override (persisted in session)
9 override_key = f"{user_id}:{model_id}"
10 override = vision_capability_overrides.get(override_key)
11 if override is not None:
12 return override
13
14 # Tier 3: Auto-probed capability
15 return probe_vision_capability(base_url, model_id)In the UI, there's a vision toggle next to the model dropdown. Toggle it on, and your override takes priority over everything else. The model list shows "Vision-capable" or "Text-only" based on the probe, and "Vision-capable (override)" or "Text-only (override)" when you've manually toggled it.
This means you never get stuck. Auto-detection handles 95% of cases. The toggle handles the rest.

Vision in Practice
Image Attachments and Compression
When you attach an image, LLMChat doesn't just forward it blindly. It runs a compression pipeline to keep things manageable:
1def compress_image(file_path, mime_type, size_threshold=500*1024,
2 max_dimension=2048, quality=85):
3 """Compress image in-place if it exceeds size threshold."""
4
5 file_size = file_path.stat().st_size
6 if file_size < size_threshold:
7 return # Under 500KB, no compression needed
8
9 img = Image.open(file_path)
10
11 # Convert RGBA to RGB for JPEG (no transparency support)
12 if img.mode in ("RGBA", "LA", "P"):
13 rgb_img = Image.new("RGB", img.size, (255, 255, 255))
14 rgb_img.paste(img, mask=img.split()[-1])
15 img = rgb_img
16
17 # Resize if dimensions exceed 2048px
18 if img.width > max_dimension or img.height > max_dimension:
19 img.thumbnail((max_dimension, max_dimension), Image.Resampling.LANCZOS)
20
21 # Re-encode with quality tuning
22 buffer = io.BytesIO()
23 img.save(buffer, format="JPEG", quality=quality, optimize=True)
24
25 # Only write back if smaller
26 if buffer.tell() < file_size:
27 file_path.write_bytes(buffer.getvalue())The settings are configurable via environment variables:
| Variable | Default | Description |
|---|---|---|
IMAGE_SIZE_THRESHOLD | 500 KB | Compression kicks in above this |
IMAGE_MAX_SIZE_THRESHOLD | 1 MB | Hard limit for uploads |
IMAGE_MAX_DIMENSION | 2048 px | Max width or height |
IMAGE_QUALITY | 85 | JPEG/WebP quality (1-100) |
Known limitation: Very large images (5MB+, 4096x4096) can still cause issues. The compression pipeline handles most cases, but there's a TODO in the code for edge cases with extreme dimensions.
Smart History Management for Vision
Here's a subtle but important detail: vision models use way more context per message than text-only models, because image tokens are expensive. LLMChat adapts:
1# Text-only models: keep 6 conversation turns in history
2MAX_HISTORY_TURNS_TEXT = 6
3
4# Vision models: only 2 turns (because images eat context fast)
5MAX_HISTORY_TURNS_VISION = 2The system automatically adjusts based on the current model's capabilities. When you switch from a text model to a vision model, the history window shrinks to prevent context overflow.
Building Multi-Modal Messages
For vision models, the message format changes from a simple string to a structured array:
1def build_user_content(user_message, attachments, model_id, ...):
2 vision_capable = get_vision_capability(user_id, model_id)
3
4 if vision_capable:
5 parts = [{"type": "text", "text": user_message}]
6 for att in attachments or []:
7 if att.mime_type.startswith("image/"):
8 # Inline as base64 so the model server doesn't need
9 # to fetch from our FastAPI host
10 raw = file_path.read_bytes()
11 b64 = base64.b64encode(raw).decode("ascii")
12 url = f"data:{mime_type};base64,{b64}"
13
14 parts.append({
15 "type": "image_url",
16 "image_url": {"url": url}
17 })
18 return parts # Structured content for vision models
19
20 # Text-only: merge everything into a single string
21 return user_message # Simple string for text modelsFor text-only models, images are stripped and text attachments are merged into the message as plain text. No errors, no crashes. Just graceful degradation.
Tested Vision Models
Here are the vision-language models I've tested with LLMChat:
| Model | vLLM Command | Notes |
|---|---|---|
| InternVL3-8B-AWQ | vllm serve OpenGVLab/InternVL3-8B-AWQ --quantization awq --trust-remote-code | Must set --quantization awq flag |
| InternVL3-2B | vllm serve OpenGVLab/InternVL3-2B --trust-remote-code | Smaller, faster, decent quality |
| Qwen2-VL-2B-Instruct | vllm serve Qwen/Qwen2-VL-2B-Instruct | Qwen's vision model |
| Cosmos-Reason2-2B | vllm serve nvidia/Cosmos-Reason2-2B | NVIDIA's reasoning VLM |
| H2O-VL-Mississippi-2B | vllm serve h2oai/h2ovl-mississippi-2b | ⚠️ No system prompt support |
Web Search Integration
LLMs know a lot, but they don't know what happened today. When you ask "What's the latest PyTorch release?" or "Who won the election?", even the best model can only guess based on its training cutoff.
LLMChat solves this with Tavily web search integration. Real-time search results injected directly into the LLM's context.
Setup
Get a free API key from tavily.com and add it to your .env:
1TAVILY_API_KEY=tvly-your-api-key-hereThat's it. LLMChat checks if Tavily is configured on startup and enables the web search toggle accordingly.

How It Works
When web search is enabled and you send a message, LLMChat queries Tavily before calling the LLM:
1def perform_web_search(tavily_client, query, max_results=5):
2 if not tavily_client:
3 return ""
4
5 response = tavily_client.search(
6 query=query,
7 search_depth="basic",
8 max_results=max_results,
9 include_answer=True, # Get a synthesized direct answer
10 )
11
12 parts = []
13
14 # Include Tavily's direct answer if available
15 if response.get("answer"):
16 parts.append(f"**Direct Answer:** {response['answer']}")
17
18 # Include individual search results
19 results = response.get("results", [])
20 if results:
21 parts.append("\n**Web Search Results:**")
22 for i, r in enumerate(results, 1):
23 title = r.get("title", "")
24 url = r.get("url", "")
25 content = r.get("content", "")[:500]
26 parts.append(f"\n{i}. **{title}**\n URL: {url}\n {content}")
27
28 return "\n".join(parts)The search results get injected into the user's message as context:
1if web_search:
2 search_results = perform_web_search(user_message)
3 if search_results:
4 context_parts.append(f"**[Web Search Context]**\n{search_results}")
5
6# Final message to the LLM
7augmented_message = f"{user_message}\n\n---\n{combined_context}\n---\n\nPlease use the above context to help answer my question. Cite sources when relevant."The LLM sees the user's question plus relevant web results, and generates a response grounded in current information. It's not perfect (the model can still hallucinate), but it's dramatically better than relying on training data alone.
Graceful Degradation
If Tavily isn't configured, the web search toggle is disabled in the UI with a tooltip:
1async function initWebSearch() {
2 const available = await Net.checkWebSearchStatus();
3 if (webSearchToggle) {
4 webSearchToggle.disabled = !available;
5 if (!available) {
6 webSearchToggle.title = "Web search is not available";
7 webSearchToggle.parentElement.style.opacity = "0.5";
8 }
9 }
10}The backend exposes a /search/status endpoint that returns {"available": true/false, "provider": "tavily"}. The frontend uses this to show or hide the toggle. No errors, no broken state.

The Fallback Strategy
Here's my favorite part of LLMChat's resilience layer: when your model goes down, LLMChat doesn't just show an error. It falls back to web search.
Model Unavailable >> Web Search Fallback
1except (NotFoundError, Exception) as e:
2 # Model is down or unavailable — try web search as fallback
3 if tavily_client:
4 search_results = perform_web_search(req.message)
5 if search_results:
6 fallback_answer = (
7 "**Model is not available, results are from web:**\n\n"
8 + search_results
9 )
10 return ChatResponse(reply=fallback_answer)
11
12 # If no web search available, raise the original error
13 raise HTTPException(
14 status_code=400,
15 detail=f"Model '{model_name}' not found on server."
16 )In the streaming endpoint, this works similarly. The generator catches the error and yields web search results instead:
1except NotFoundError:
2 if tavily_client:
3 search_results = perform_web_search(req.message)
4 if search_results:
5 yield f"**Model is not available, results are from web:**\n\n{search_results}"
6 return
7 yield f"Model '{model_name}' not found on server."
8 returnThis means: if you're chatting and your vLLM server crashes (GPU OOM, SSH disconnects, etc.), LLMChat doesn't just die. It transparently falls back to web search and gives you something useful while you restart the model server.
Context Length Exceeded >> Auto-Retry
We covered this in Part 1, but it's worth repeating in the context of the full fallback strategy:
Request fails >>
Is it a context length error?
Yes >> Parse exact allowed tokens >> Retry with smaller max_tokens
No >> Is Tavily configured?
Yes >> Return web search results with a disclaimer
No >> Return error message
The user almost never sees a raw error. There's always a fallback.

The File Upload System
Supporting all these features requires a robust file upload system. Here's what LLMChat handles:
Supported Formats
1ALLOWED_EXTENSIONS = {
2 ".txt", ".md", ".markdown", # Text files (preview extracted)
3 ".pdf", # PDFs (routed to RAG pipeline)
4 ".png", ".jpg", ".jpeg", # Images (compressed, sent to vision models)
5 ".gif", ".webp", ".svg" # More image formats
6}Upload Flow
1@app.post("/upload")
2async def upload(files: List[UploadFile] = File(...)):
3 saved = []
4 for uf in files:
5 ext = os.path.splitext(uf.filename)[1].lower()
6
7 # Unique filename to avoid collisions
8 unique = f"{os.urandom(8).hex()}{ext}"
9 dest = UPLOAD_DIR / unique
10
11 # Write file to disk
12 with dest.open("wb") as out:
13 while True:
14 chunk = await uf.read(1024 * 1024) # 1MB chunks
15 if not chunk:
16 break
17 out.write(chunk)
18
19 # Compress images that exceed threshold
20 if ext in {".png", ".jpg", ".jpeg", ".gif", ".webp"}:
21 compress_image(dest, uf.content_type,
22 IMAGE_SIZE_THRESHOLD, IMAGE_MAX_DIMENSION, IMAGE_QUALITY)
23
24 item = {
25 "filename": uf.filename,
26 "url": f"/uploads/{unique}",
27 "mime_type": uf.content_type,
28 }
29
30 # Extract text preview for text files
31 if ext in {".txt", ".md", ".markdown"}:
32 txt = dest.read_text(encoding="utf-8", errors="ignore")
33 item["text"] = txt[:20000] # First 20K chars
34
35 saved.append(item)
36
37 return {"files": saved}Key design decisions:
- Random filenames:
os.urandom(8).hex()prevents name collisions without a database - Streaming writes: 1MB chunks prevent memory issues with large files
- Compression on upload: Images are compressed before storage, so the model server gets reasonably sized images
- Text preview extraction: Text files get their content extracted inline so the LLM can read them without a separate fetch
Tying It All Together
Here's where it gets interesting. A single request to LLMChat can combine multiple context sources:
The build_messages() function is the orchestrator. It:
- Retrieves chat history (pruned to model-appropriate length)
- Runs RAG search if enabled
- Runs web search if enabled
- Combines all context into the user message
- Formats attachments for vision or text-only models
- Normalizes the message format for vLLM compatibility
The result is a clean message array that any OpenAI-compatible endpoint can process, regardless of whether it's a vision model, a text model, or a local inference server.
What's Next
In Part 4, we'll take a completely different approach: running LLMs directly in your browser using WebGPU and Transformers.js. No server. No API. No data leaving your machine. The ultimate privacy play.
P.S.: The vision probe adds a small delay when loading models for the first time, about 1-2 seconds per model. After that, results are cached for the session. If you're impatient, you can always override with the toggle 😉.
← Previous Post
LLMChat: Chat with Your PDFs — Adding a RAG Pipeline
Next Post →
LLMChat: Running LLMs in Your Browser with WebGPU
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.