LLMChat: Vision Models, Web Search, and Smart Fallbacks

Beyond Text: Making LLMChat Multi-Modal

In Part 1, we built the core chat interface. In Part 2, we added the ability to chat with documents via RAG. But text-only interactions are limiting.

Sometimes you want to ask: "What's in this image?" or "Can you describe this diagram?" And sometimes the LLM's training data isn't enough. You need real-time information from the web.

This post covers two major upgrades to LLMChat:

Vision model support : auto-detecting which models can handle images, with a smart three-tier override system
Web search integration : grounding LLM responses in live web data via Tavily

Plus, the fallback strategy that keeps LLMChat useful even when your model server goes down.

The Vision Problem

Here's a problem that got annoying fast: not all models support images.

You serve Falcon3-7B on vLLM. Text only. You serve InternVL3-8B. Vision capable. You switch between them in the dropdown. But if you've attached an image and switch to a text-only model, what happens?

Some approaches:

Crash: Send the image anyway and get a cryptic error. Bad UX.
Manual config: Force users to tag each model as "vision" or "text only". Tedious.
Auto-detect: Probe the model at startup to figure out what it supports. This is what LLMChat does.

Auto-Probing Vision Capability

When LLMChat fetches the model list from your endpoint, it probes each model to determine if it supports image inputs:

python

1def probe_vision_capability(base_url: str, model_id: str, timeout: int = 5) -> bool:
2    """Send a test vision payload to check if the model supports images."""
3    
4    if model_id in VISION_PROBE_CACHE:
5        return VISION_PROBE_CACHE[model_id]
6    
7    payload = {
8        "model": model_id,
9        "messages": [{
10            "role": "user",
11            "content": [
12                {"type": "image_url", "image_url": "about:blank"},
13                {"type": "text", "text": "ping"},
14            ],
15        }],
16        "max_tokens": 1,
17    }
18
19    try:
20        r = requests.post(
21            f"{base_url}/v1/chat/completions",
22            json=payload,
23            timeout=timeout,
24        )
25
26        if r.status_code == 200:
27            VISION_PROBE_CACHE[model_id] = True
28            return True
29
30        # Check error message for vision-related keywords
31        error_text = r.text.lower()
32        if any(k in error_text for k in [
33            "image", "vision", "multimodal", "image_url", "image token"
34        ]):
35            VISION_PROBE_CACHE[model_id] = True
36            return True
37
38        VISION_PROBE_CACHE[model_id] = False
39        return False
40
41    except requests.RequestException:
42        VISION_PROBE_CACHE[model_id] = False
43        return False

The trick is in the error handling. We send an about:blank image URL, which will always fail. But how it fails reveals whether the model understands vision:

200 OK: Model processed it (unlikely with about:blank, but we handle it)
Error mentioning "image", "vision", or "multimodal": The model understands vision inputs but couldn't process this one. It's vision-capable
Generic error or no mention of images: Text-only model

Results are cached in VISION_PROBE_CACHE so we only probe once per model.

The Three-Tier Priority System

But auto-detection isn't perfect. Some models might be falsely classified. That's why LLMChat uses a three-tier priority system for vision capability:

Per-request override  >>   Stored user override  >>   Probed capability
   (highest)                 (medium)                 (lowest)

python

1def get_vision_capability_from_request(base_url, user_id, model_id, 
2                                        vision_capability_overrides, 
3                                        vision_enabled_override=None):
4    # Tier 1: Per-request override (from the toggle in the UI)
5    if vision_enabled_override is not None:
6        return vision_enabled_override
7    
8    # Tier 2: Stored user override (persisted in session)
9    override_key = f"{user_id}:{model_id}"
10    override = vision_capability_overrides.get(override_key)
11    if override is not None:
12        return override
13    
14    # Tier 3: Auto-probed capability
15    return probe_vision_capability(base_url, model_id)

In the UI, there's a vision toggle next to the model dropdown. Toggle it on, and your override takes priority over everything else. The model list shows "Vision-capable" or "Text-only" based on the probe, and "Vision-capable (override)" or "Text-only (override)" when you've manually toggled it.

This means you never get stuck. Auto-detection handles 95% of cases. The toggle handles the rest.

Vision in Practice

Image Attachments and Compression

When you attach an image, LLMChat doesn't just forward it blindly. It runs a compression pipeline to keep things manageable:

python

1def compress_image(file_path, mime_type, size_threshold=500*1024, 
2                   max_dimension=2048, quality=85):
3    """Compress image in-place if it exceeds size threshold."""
4    
5    file_size = file_path.stat().st_size
6    if file_size < size_threshold:
7        return  # Under 500KB, no compression needed
8    
9    img = Image.open(file_path)
10    
11    # Convert RGBA to RGB for JPEG (no transparency support)
12    if img.mode in ("RGBA", "LA", "P"):
13        rgb_img = Image.new("RGB", img.size, (255, 255, 255))
14        rgb_img.paste(img, mask=img.split()[-1])
15        img = rgb_img
16    
17    # Resize if dimensions exceed 2048px
18    if img.width > max_dimension or img.height > max_dimension:
19        img.thumbnail((max_dimension, max_dimension), Image.Resampling.LANCZOS)
20    
21    # Re-encode with quality tuning
22    buffer = io.BytesIO()
23    img.save(buffer, format="JPEG", quality=quality, optimize=True)
24    
25    # Only write back if smaller
26    if buffer.tell() < file_size:
27        file_path.write_bytes(buffer.getvalue())

The settings are configurable via environment variables:

Variable	Default	Description
`IMAGE_SIZE_THRESHOLD`	500 KB	Compression kicks in above this
`IMAGE_MAX_SIZE_THRESHOLD`	1 MB	Hard limit for uploads
`IMAGE_MAX_DIMENSION`	2048 px	Max width or height
`IMAGE_QUALITY`	85	JPEG/WebP quality (1-100)

Known limitation: Very large images (5MB+, 4096x4096) can still cause issues. The compression pipeline handles most cases, but there's a TODO in the code for edge cases with extreme dimensions.

Smart History Management for Vision

Here's a subtle but important detail: vision models use way more context per message than text-only models, because image tokens are expensive. LLMChat adapts:

python

1# Text-only models: keep 6 conversation turns in history
2MAX_HISTORY_TURNS_TEXT = 6
3
4# Vision models: only 2 turns (because images eat context fast)
5MAX_HISTORY_TURNS_VISION = 2

The system automatically adjusts based on the current model's capabilities. When you switch from a text model to a vision model, the history window shrinks to prevent context overflow.

Building Multi-Modal Messages

For vision models, the message format changes from a simple string to a structured array:

python

1def build_user_content(user_message, attachments, model_id, ...):
2    vision_capable = get_vision_capability(user_id, model_id)
3    
4    if vision_capable:
5        parts = [{"type": "text", "text": user_message}]
6        for att in attachments or []:
7            if att.mime_type.startswith("image/"):
8                # Inline as base64 so the model server doesn't need 
9                # to fetch from our FastAPI host
10                raw = file_path.read_bytes()
11                b64 = base64.b64encode(raw).decode("ascii")
12                url = f"data:{mime_type};base64,{b64}"
13                
14                parts.append({
15                    "type": "image_url",
16                    "image_url": {"url": url}
17                })
18        return parts  # Structured content for vision models
19    
20    # Text-only: merge everything into a single string
21    return user_message  # Simple string for text models

For text-only models, images are stripped and text attachments are merged into the message as plain text. No errors, no crashes. Just graceful degradation.

Tested Vision Models

Here are the vision-language models I've tested with LLMChat:

Model	vLLM Command	Notes
InternVL3-8B-AWQ	`vllm serve OpenGVLab/InternVL3-8B-AWQ --quantization awq --trust-remote-code`	Must set `--quantization awq` flag
InternVL3-2B	`vllm serve OpenGVLab/InternVL3-2B --trust-remote-code`	Smaller, faster, decent quality
Qwen2-VL-2B-Instruct	`vllm serve Qwen/Qwen2-VL-2B-Instruct`	Qwen's vision model
Cosmos-Reason2-2B	`vllm serve nvidia/Cosmos-Reason2-2B`	NVIDIA's reasoning VLM
H2O-VL-Mississippi-2B	`vllm serve h2oai/h2ovl-mississippi-2b`	⚠️ No system prompt support

Web Search Integration

LLMs know a lot, but they don't know what happened today. When you ask "What's the latest PyTorch release?" or "Who won the election?", even the best model can only guess based on its training cutoff.

LLMChat solves this with Tavily web search integration. Real-time search results injected directly into the LLM's context.

Setup

Get a free API key from tavily.com and add it to your .env:

bash

1TAVILY_API_KEY=tvly-your-api-key-here

That's it. LLMChat checks if Tavily is configured on startup and enables the web search toggle accordingly.

How It Works

When web search is enabled and you send a message, LLMChat queries Tavily before calling the LLM:

python

1def perform_web_search(tavily_client, query, max_results=5):
2    if not tavily_client:
3        return ""
4    
5    response = tavily_client.search(
6        query=query,
7        search_depth="basic",
8        max_results=max_results,
9        include_answer=True,     # Get a synthesized direct answer
10    )
11    
12    parts = []
13    
14    # Include Tavily's direct answer if available
15    if response.get("answer"):
16        parts.append(f"**Direct Answer:** {response['answer']}")
17    
18    # Include individual search results
19    results = response.get("results", [])
20    if results:
21        parts.append("\n**Web Search Results:**")
22        for i, r in enumerate(results, 1):
23            title = r.get("title", "")
24            url = r.get("url", "")
25            content = r.get("content", "")[:500]
26            parts.append(f"\n{i}. **{title}**\n   URL: {url}\n   {content}")
27    
28    return "\n".join(parts)

The search results get injected into the user's message as context:

python

1if web_search:
2    search_results = perform_web_search(user_message)
3    if search_results:
4        context_parts.append(f"**[Web Search Context]**\n{search_results}")
5
6# Final message to the LLM
7augmented_message = f"{user_message}\n\n---\n{combined_context}\n---\n\nPlease use the above context to help answer my question. Cite sources when relevant."

The LLM sees the user's question plus relevant web results, and generates a response grounded in current information. It's not perfect (the model can still hallucinate), but it's dramatically better than relying on training data alone.

Graceful Degradation

If Tavily isn't configured, the web search toggle is disabled in the UI with a tooltip:

javascript

1async function initWebSearch() {
2    const available = await Net.checkWebSearchStatus();
3    if (webSearchToggle) {
4        webSearchToggle.disabled = !available;
5        if (!available) {
6            webSearchToggle.title = "Web search is not available";
7            webSearchToggle.parentElement.style.opacity = "0.5";
8        }
9    }
10}

The backend exposes a /search/status endpoint that returns {"available": true/false, "provider": "tavily"}. The frontend uses this to show or hide the toggle. No errors, no broken state.

The Fallback Strategy

Here's my favorite part of LLMChat's resilience layer: when your model goes down, LLMChat doesn't just show an error. It falls back to web search.

Model Unavailable >> Web Search Fallback

python

1except (NotFoundError, Exception) as e:
2    # Model is down or unavailable — try web search as fallback
3    if tavily_client:
4        search_results = perform_web_search(req.message)
5        if search_results:
6            fallback_answer = (
7                "**Model is not available, results are from web:**\n\n"
8                + search_results
9            )
10            return ChatResponse(reply=fallback_answer)
11    
12    # If no web search available, raise the original error
13    raise HTTPException(
14        status_code=400, 
15        detail=f"Model '{model_name}' not found on server."
16    )

In the streaming endpoint, this works similarly. The generator catches the error and yields web search results instead:

python

1except NotFoundError:
2    if tavily_client:
3        search_results = perform_web_search(req.message)
4        if search_results:
5            yield f"**Model is not available, results are from web:**\n\n{search_results}"
6            return
7    yield f"Model '{model_name}' not found on server."
8    return

This means: if you're chatting and your vLLM server crashes (GPU OOM, SSH disconnects, etc.), LLMChat doesn't just die. It transparently falls back to web search and gives you something useful while you restart the model server.

Context Length Exceeded >> Auto-Retry

We covered this in Part 1, but it's worth repeating in the context of the full fallback strategy:

Request fails >> 
  Is it a context length error?
    Yes >>  Parse exact allowed tokens >>  Retry with smaller max_tokens
    No  >>  Is Tavily configured?
      Yes >>  Return web search results with a disclaimer
      No  >>  Return error message

The user almost never sees a raw error. There's always a fallback.

The File Upload System

Supporting all these features requires a robust file upload system. Here's what LLMChat handles:

Supported Formats

python

1ALLOWED_EXTENSIONS = {
2    ".txt", ".md", ".markdown",    # Text files (preview extracted)
3    ".pdf",                         # PDFs (routed to RAG pipeline)
4    ".png", ".jpg", ".jpeg",       # Images (compressed, sent to vision models)
5    ".gif", ".webp", ".svg"        # More image formats
6}

Upload Flow

python

1@app.post("/upload")
2async def upload(files: List[UploadFile] = File(...)):
3    saved = []
4    for uf in files:
5        ext = os.path.splitext(uf.filename)[1].lower()
6        
7        # Unique filename to avoid collisions
8        unique = f"{os.urandom(8).hex()}{ext}"
9        dest = UPLOAD_DIR / unique
10        
11        # Write file to disk
12        with dest.open("wb") as out:
13            while True:
14                chunk = await uf.read(1024 * 1024)  # 1MB chunks
15                if not chunk:
16                    break
17                out.write(chunk)
18        
19        # Compress images that exceed threshold
20        if ext in {".png", ".jpg", ".jpeg", ".gif", ".webp"}:
21            compress_image(dest, uf.content_type, 
22                          IMAGE_SIZE_THRESHOLD, IMAGE_MAX_DIMENSION, IMAGE_QUALITY)
23        
24        item = {
25            "filename": uf.filename,
26            "url": f"/uploads/{unique}",
27            "mime_type": uf.content_type,
28        }
29        
30        # Extract text preview for text files
31        if ext in {".txt", ".md", ".markdown"}:
32            txt = dest.read_text(encoding="utf-8", errors="ignore")
33            item["text"] = txt[:20000]  # First 20K chars
34        
35        saved.append(item)
36    
37    return {"files": saved}

Key design decisions:

Random filenames: os.urandom(8).hex() prevents name collisions without a database
Streaming writes: 1MB chunks prevent memory issues with large files
Compression on upload: Images are compressed before storage, so the model server gets reasonably sized images
Text preview extraction: Text files get their content extracted inline so the LLM can read them without a separate fetch

Tying It All Together

Here's where it gets interesting. A single request to LLMChat can combine multiple context sources:

Loading diagram...

The build_messages() function is the orchestrator. It:

Retrieves chat history (pruned to model-appropriate length)
Runs RAG search if enabled
Runs web search if enabled
Combines all context into the user message
Formats attachments for vision or text-only models
Normalizes the message format for vLLM compatibility

The result is a clean message array that any OpenAI-compatible endpoint can process, regardless of whether it's a vision model, a text model, or a local inference server.

What's Next

In Part 4, we'll take a completely different approach: running LLMs directly in your browser using WebGPU and Transformers.js. No server. No API. No data leaving your machine. The ultimate privacy play.

P.S.: The vision probe adds a small delay when loading models for the first time, about 1-2 seconds per model. After that, results are cached for the session. If you're impatient, you can always override with the toggle 😉.

LLMChat: Vision Models, Web Search, and Smart Fallbacks

Beyond Text: Making LLMChat Multi-Modal

The Vision Problem

Auto-Probing Vision Capability

The Three-Tier Priority System

Vision in Practice

Image Attachments and Compression

Smart History Management for Vision

Building Multi-Modal Messages

Tested Vision Models

Web Search Integration

Setup

How It Works

Graceful Degradation

The Fallback Strategy

Model Unavailable >> Web Search Fallback

Context Length Exceeded >> Auto-Retry

The File Upload System

Supported Formats

Upload Flow

Tying It All Together

What's Next

LLMChat: Chat with Your PDFs — Adding a RAG Pipeline

LLMChat: Running LLMs in Your Browser with WebGPU