The Landscape of Small LLMs and VLMs (Under 12B)

The foundation model ecosystem now offers highly capable options under 12 billion parameters. These compact language, vision, and audio models are built specifically for edge deployment, mobile devices, and low-latency local inference. This catalog breaks down the top performing small models available as of mid-2026, organized by provider.

1. OpenBMB (Tsinghua University / ModelBest)

MiniCPM Family

OpenBMB is a China-based open-source lab jointly founded in 2022 by Tsinghua University's NLP Lab and ModelBest Inc. Their MiniCPM series focuses on edge-side, on-device inference with strong performance relative to model size.

MiniCPM5-1B

Parameters: 1B (dense)
Modality: Text >> Text
Architecture: Standard LlamaForCausalLM (no custom kernels needed)
Highlights: Average score 42.57 across reasoning, knowledge, code, instruction-following, math, logic, and agentic benchmarks. Beats the previous 1B-class SOTA of 35.61. Built-in hybrid reasoning with <think> template. Supports RL and On-Policy Distillation (OPD) training.
Inference Support: SGLang (recommended for tool calling), vLLM, Transformers, llama.cpp, Ollama
License: Apache 2.0

MiniCPM-V 4.6 (1.3B)

Parameters: 1.3B total (SigLIP2-400M vision encoder and Qwen3.5-0.8B LLM backbone)
Modality: Text + Image + Video >> Text
Context: 262K tokens
Highlights: Scores 13 on Artificial Analysis Intelligence Index. Highest for any open-weights model under 2B. Surpasses Gemma4-E2B-it. 1.5x token throughput compared to Qwen3.5-0.8B. Mixed 4x/16x visual token compression. Deployable on iOS, Android, and HarmonyOS.
Inference Support: vLLM, SGLang, llama.cpp, Ollama, Transformers
License: Apache 2.0

MiniCPM-o 4.5 (9B)

Parameters: 9B total (end-to-end omnimodal)
Modality: Text + Image + Video + Audio >> Text + Speech (full-duplex live streaming)
Highlights: Approaches Gemini 2.5 Flash in vision, speech, and full-duplex multimodal live streaming. Supports simultaneous see/listen/speak in real-time conversation with proactive interactions.
Inference Support: vLLM, SGLang, Transformers
License: Apache 2.0

2. Liquid AI (MIT Spin-off)

LFM2 / LFM2.5 Family

Liquid AI builds models using a novel "liquid" hybrid architecture. This alternates Grouped Query Attention with short convolutional layers. Their models are highly optimized for fast on-device processing.

LFM2-350M / LFM2-700M / LFM2-1.2B / LFM2-2.6B

Parameters: 350M, 700M, 1.2B, 2.6B (dense)
Modality: Text >> Text
Highlights: LFM2-2.6B outperforms Llama 3.2-3B-Instruct, Gemma-3-4B-it, and SmolLM3-3B. Scores 82.41% on GSM8K, 79.56% on IFEval. Trained on 10T tokens. Designed for CPU, NPU, and GPU inference. Optimized for English and Japanese with strong multilingual support (French, Spanish, German, Italian, Portuguese, Arabic, Chinese, Korean).
Inference Support: Transformers, ExecuTorch, LEAP SDK
License: LFM Open License (Apache 2.0-based, commercial use permitted)

LFM2.5-1.2B-Instruct

Parameters: 1.2B
Modality: Text >> Text
Highlights: Backbone pretrained on 28T tokens (up from 10T). Reinforcement learning for instruction following, tool use, math, and knowledge reasoning. Also available in Japanese variant (LFM2.5-1.2B-JP).
Inference Support: Transformers, ExecuTorch, LEAP SDK
License: LFM Open License

LFM2.5-VL-1.6B

Parameters: 1.6B (vision-language)
Modality: Text + Image >> Text
Highlights: Native resolution processing up to 512x512 without upscaling. Improved multilingual vision understanding. Supports real-time video stream captioning via WebGPU. Reliable multi-image and OCR performance.
Inference Support: Transformers (v5.1+), WebGPU, LEAP SDK
License: LFM Open License

LFM2.5-Audio-1.5B

Parameters: 1.5B
Modality: Audio >> Text / Audio
Highlights: Native audio-language model for edge voice agents. Covers speech understanding and audio language workloads.
Inference Support: Transformers, LEAP SDK
License: LFM Open License

3. Hugging Face (SmolLM / SmolVLM)

SmolLM Family

Hugging Face focuses on highly compact models with full training transparency.

SmolLM2 (135M / 360M / 1.7B)

Parameters: 135M, 360M, 1.7B
Modality: Text >> Text
Highlights: Pre-trained on curated high-quality datasets (Cosmopedia, FineWeb-Edu, Python-Edu). Designed as a baseline and research tool for understanding small model capabilities.
Inference Support: Transformers, Ollama, MLX, llama.cpp
License: Apache 2.0

SmolLM3-3B

Parameters: 3B
Modality: Text >> Text
Context: 128K (genuine, not interpolated)
Highlights: SOTA 3B model with dual reasoning (thinking/non-thinking modes). Supports 6 languages and long context with strong function calling. Multi-stage training recipe. Standout for long-document processing at 3B scale.
Inference Support: Transformers, Ollama, MLX, llama.cpp, vLLM
License: Apache 2.0

[!TIP] Use SmolLM3-3B for processing long documents at the 3B scale due to its non-interpolated 128K context window.

SmolVLM Family

SmolVLM-2.2B-Instruct / SmolVLM2-2.2B

Parameters: 2.2B (SigLIP-400M encoder and SmolLM2 decoder)
Modality: Text + Image + Video >> Text
Highlights: 81 visual tokens per 384x384 patch. Trained on The Cauldron and Docmatix datasets. Strong document understanding (25%), image captioning (18%), visual reasoning, and chart comprehension.
Inference Support: Transformers, Ollama
License: Apache 2.0

SmolVLM-500M / SmolVLM-256M

Parameters: 500M, 256M
Modality: Text + Image >> Text
Highlights: Ultra-compact VLMs. The 500M uses a 93M SigLIP encoder (vs. 400M in the 2.2B) and 512x512 patches with 64 visual tokens. Designed for extreme edge deployment.
Inference Support: Transformers
License: Apache 2.0

4. Alibaba Cloud (Qwen)

Qwen3.5 Small Series

The Qwen3.5 small models (March 2026) are designed from scratch with a hybrid architecture combining Gated DeltaNet with sparse Mixture-of-Experts.

Qwen3.5-0.8B / 2B / 4B / 9B

Parameters: 0.8B, 2B, 4B, 9B (dense, with hybrid attention)
Modality: Text + Image + Video >> Text (native multimodal, all sizes)
Context: 262K tokens (all sizes)
Highlights:
- 9B: Intelligence Index score 32 (most intelligent model under 10B). MMMU-Pro 69.2%.
- 4B: Intelligence Index 27 (most intelligent under 5B). MMMU-Pro 65.4%. Supports 262K+ context extensible to 1M+.
- 2B: Runs on any recent iPhone in airplane mode. 100% zero-shot classification accuracy.
- 0.8B: Supports 262K context at sub-1B scale. Early fusion multimodal training with 3D convolution for video.
- Architecture uses 3:1 ratio of linear attention to full attention layers. This controls memory growth.
Inference Support: Transformers, vLLM, SGLang, Ollama, MLX, llama.cpp, GGUF (Unsloth optimized)
License: Apache 2.0

Qwen2.5 Series

Qwen2.5-1.5B / 3B / 7B

Parameters: 1.5B, 3B, 7B
Modality: Text >> Text
Context: 128K tokens
Highlights: Pretrained on 18T tokens. Strong multilingual support. Qwen2.5-7B was the category leader in its class for much of 2025.
Inference Support: Transformers, vLLM, SGLang, Ollama, MLX, llama.cpp, TensorRT-LLM
License: Apache 2.0

Qwen2.5-VL-7B

Parameters: 7B
Modality: Text + Image + Video >> Text
Highlights: Scores 888 on OCRBench. Strong document and chart understanding. One of the top open-weight VLMs in the 7B class.
Inference Support: Transformers, vLLM
License: Apache 2.0

5. Google DeepMind (Gemma)

Gemma 4 Family (April 2026)

Gemma 4 E2B (Effective 2B)

Parameters: 2B (dense, edge-optimized)
Modality: Text + Image + Audio >> Text
Context: 128K tokens
Highlights: Native audio input. Designed for phones, browsers, and Pixel devices. Configurable thinking modes.
Inference Support: Transformers, MediaPipe, LiteRT, Ollama, llama.cpp, MLX
License: Apache 2.0 (Gemma terms)

Gemma 4 E4B (Effective 4B)

Parameters: 4B (dense, edge-optimized)
Modality: Text + Image + Audio >> Text
Context: 128K tokens
Highlights: Native audio input. Stronger reasoning than E2B. Achieves 60+ tok/s on edge hardware.
Inference Support: Transformers, MediaPipe, LiteRT, Ollama, llama.cpp, MLX
License: Apache 2.0 (Gemma terms)

Gemma 3 Family

Gemma 3 1B / 4B / 12B

Parameters: 1B, 4B, 12B
Modality: Text + Image >> Text (4B and 12B; 1B is text-only)
Context: 128K (4B/12B), 32K (1B)
Highlights: Gemma 3 4B scores 89.2% on GSM8K. Uses alternating local/global attention (5:1 ratio). Custom SigLIP vision encoder. Supports 140+ languages. Gemma 3 4B is highly competitive in the 3–4B class.
Inference Support: Transformers, Ollama, vLLM, MLX, llama.cpp, TensorRT-LLM, MediaPipe
License: Gemma Terms of Use (permissive, similar to Apache 2.0)

6. Microsoft (Phi)

Phi-4 Family

Phi-4-mini (3.8B)

Parameters: 3.8B (dense, decoder-only transformer)
Modality: Text >> Text
Context: 128K tokens
Highlights: MMLU 67.3% (5-shot), GSM8K 88.6%, ARC-C 83.7% (highest in its size class), BigBench-Hard 70.4% (0-shot CoT). Uses GQA, 200K vocabulary, and shared input-output embeddings. Trained on synthetic reasoning-rich data. Reasoning variant also available (Phi-4-mini-reasoning).
Inference Support: Transformers, Ollama, vLLM, ONNX Runtime, llama.cpp, MLX, TensorRT-LLM
License: MIT

Phi-4-multimodal (5.6B)

Parameters: 5.6B
Modality: Text + Image + Audio >> Text
Highlights: First Phi model supporting audio, image, and text input simultaneously. Surpasses Qwen2-Audio on speech/audio/music understanding. Mixture-of-LoRAs architecture.
Inference Support: Transformers, ONNX Runtime, Intel Extension for PyTorch
License: MIT

7. Mistral AI

Ministral 3 Family (December 2025)

Ministral 3 3B

Parameters: 3B (dense)
Modality: Text + Image >> Text
Context: 131K tokens
Highlights: TTFT 0.51s (lowest among Mistral models). Costs $0.10/M tokens. Outperforms Mistral 7B on most benchmarks. Base, instruct, and reasoning variants. Image understanding included.
Inference Support: Transformers, vLLM, Ollama, llama.cpp, Mistral API
License: Apache 2.0

Ministral 3 8B

Parameters: 8B (dense, interleaved sliding-window attention)
Modality: Text + Image >> Text
Context: 262K tokens
Highlights: Outperforms the larger Gemma 12B on most evaluations. Interleaved sliding-window attention for memory-efficient inference. Base, instruct, and reasoning variants.
Inference Support: Transformers, vLLM, Ollama, llama.cpp, Mistral API
License: Apache 2.0

Voxtral TTS

Parameters: 3B (8 GB BF16, 3 GB quantized)
Modality: Text >> Speech
Highlights: Supports 9 languages. Zero-shot voice cloning with 3 seconds of reference audio. Cross-lingual generation. 70ms model latency on H200. Built on Ministral 3B backbone.
Inference Support: Transformers, Mistral API
License: Apache 2.0

Pixtral 12B

Parameters: 12B
Modality: Text + Image >> Text
Context: 128K tokens
Highlights: First multi-modal (text+image) model from Mistral. Strong chart and image comprehension. Competitive with Llama 3.2 11B Vision.
Inference Support: Transformers, vLLM, Ollama, Mistral API
License: Apache 2.0

8. Meta (Llama)

Llama 3.2 Family

Llama 3.2 1B / 3B

Parameters: 1B, 3B
Modality: Text >> Text
Context: 128K tokens
Highlights: Optimized for on-device, mobile, and edge inference. Strong at tool routing for their size. 3B outperforms Gemma 2 2.6B and Phi 3.5-mini. Trained on 9T tokens. 8 officially supported languages.
Inference Support: Transformers, Ollama, vLLM, llama.cpp, MLX, ExecuTorch, TensorRT-LLM
License: Llama 3.2 Community License (700M MAU restriction)

Llama 3.2 11B-Vision

Parameters: 11B
Modality: Text + Image >> Text
Context: 128K tokens
Highlights: Beats Gemini 1.5 Flash 8B on DocVQA. Tops Claude 3 Haiku/Sonnet on AI2D, ChartQA, MathVista. Competitive with Pixtral 12B and Qwen2-VL 7B on VQAv2.
Inference Support: Transformers, vLLM, Ollama, llama.cpp, TensorRT-LLM
License: Llama 3.2 Community License

9. TII / Falcon

Falcon 3 Family (December 2024)

Falcon 3 1B / 3B / 7B / 10B

Parameters: 1B, 3B, 7B, 10B
Modality: Text >> Text
Context: Varies by model
Highlights: Trained on 14T tokens. 10B and 7B outperform Gemma 2-9B, Llama 3.1-8B, Mistral 7B, and Yi 1.5-9B. They surpass Qwen 2.5-7B on MUSR, MATH, GPQA, IFEval. Supports English, French, Spanish, Portuguese.
Inference Support: Transformers, Ollama, vLLM, llama.cpp
License: Apache 2.0

Falcon H1R 7B (January 2026)

Falcon H1R 7B

Parameters: 7B (hybrid Transformer + Mamba)
Modality: Text >> Text
Highlights: Scores 96.7% on AIME 2025 with Deep Think with Confidence (DeepConf). Uses 38% fewer tokens than DeepSeek-R1-0528-Qwen3-8B. Achieves 1,500 tok/s/GPU at batch size 64. Hybrid architecture offers best-of-both-worlds efficiency.
Inference Support: Transformers, vLLM
License: Falcon LLM License 1.0 (Apache 2.0-based, attribution required)

Falcon Mamba 7B

Falcon Mamba 7B

Parameters: 7B (State Space Model)
Modality: Text >> Text
Highlights: World's top open-source SSLM at release. Low memory cost for arbitrary-length text generation. Outperforms Llama 3.1 8B and Mistral 7B.
Inference Support: Transformers, vLLM
License: Falcon Terms (permissive)

10. AI2 / Allen Institute (Molmo / OLMo)

Molmo-7B-D

Parameters: 7B (built on Qwen 2 7B)
Modality: Text + Image >> Text
Highlights: Outperforms Llama 3.2 Vision in some benchmarks. Fully open source with all training data, code, and evaluation published.
Inference Support: Transformers, vLLM
License: Apache 2.0

Molmo-E

Parameters: 7B total, 1B active (OLMoE-based)
Modality: Text + Image >> Text
Highlights: Efficient MoE architecture with only 1B active parameters per forward pass.
Inference Support: Transformers
License: Apache 2.0

11. Shanghai AI Lab (InternVL)

InternVL3-1B / 2B / 4B / 8B

Parameters: 1B, 2B, 4B, 8B
Modality: Text + Image + Video >> Text
Highlights: Strong industrial and 3D reasoning capabilities. InternVL3-8B is competitive with much larger models on MMMU. Evaluated in zero-shot settings for construction safety, medical imaging, and document understanding.
Inference Support: Transformers, vLLM, LMDeploy
License: MIT

12. Speech & Audio Models

OpenAI Whisper (Tiny / Base / Small / Medium)

Parameters: 39M (Tiny), 74M (Base), 244M (Small), 769M (Medium)
Modality: Audio >> Text (ASR)
Highlights: Supports 99+ languages. Most widely used open-source ASR. Strong in noisy environments. Extensive community tooling.
Inference Support: Transformers, whisper.cpp, faster-whisper, MLX, ONNX, Ollama
License: MIT

Useful Sensors Moonshine (Tiny / Base)

Parameters: 27M (Tiny), 100M (Base)
Modality: Audio >> Text (ASR)
Highlights: Outperforms Whisper Tiny/Small despite being smaller. Moonshine v2 has Ergodic Streaming Encoder for latency-critical applications. Multi-language variants available.
Inference Support: ONNX, Custom C++ runtime
License: Permissive open-source

CosyVoice2-0.5B (Alibaba)

Parameters: 0.5B
Modality: Text >> Speech (TTS)
Highlights: Finite scalar quantization. Streams at 150ms. Frame-level control over emotion and dialect.
Inference Support: Custom framework
License: Apache 2.0

13. Other Notable Small Models

Mistral 7B v0.3

Parameters: 7B
Modality: Text >> Text
Highlights: Fastest wall-clock throughput in the 7B class. Practical pick when latency matters. Requires 4.5 GB VRAM.
Inference Support: Transformers, Ollama, vLLM, llama.cpp, MLX, GGUF
License: Apache 2.0

DeepSeek-R1-Distill-Qwen-1.5B / 7B

Parameters: 1.5B, 7B (distilled from DeepSeek-R1)
Modality: Text >> Text
Highlights: Reasoning-focused distilled models. The 7B variant is derived from Qwen2.5 series and fine-tuned with 800K curated samples.
Inference Support: Transformers, vLLM, Ollama, llama.cpp
License: Apache 2.0

TinyLlama 1.1B

Parameters: 1.1B
Modality: Text >> Text
Highlights: Small Llama 2 variant. Useful baseline for research and fine-tuning experiments.
Inference Support: Transformers, Ollama, llama.cpp
License: Apache 2.0