The Landscape of Small LLMs and VLMs (Under 12B)
A practical catalog of small language and vision models optimized for edge inference.

The foundation model ecosystem now offers highly capable options under 12 billion parameters. These compact language, vision, and audio models are built specifically for edge deployment, mobile devices, and low-latency local inference. This catalog breaks down the top performing small models available as of mid-2026, organized by provider.
1. OpenBMB (Tsinghua University / ModelBest)

MiniCPM Family
OpenBMB is a China-based open-source lab jointly founded in 2022 by Tsinghua University's NLP Lab and ModelBest Inc. Their MiniCPM series focuses on edge-side, on-device inference with strong performance relative to model size.
MiniCPM5-1B
- Parameters: 1B (dense)
- Modality: Text >> Text
- Architecture: Standard LlamaForCausalLM (no custom kernels needed)
- Highlights: Average score 42.57 across reasoning, knowledge, code, instruction-following, math, logic, and agentic benchmarks. Beats the previous 1B-class SOTA of 35.61. Built-in hybrid reasoning with
<think>template. Supports RL and On-Policy Distillation (OPD) training. - Inference Support: SGLang (recommended for tool calling), vLLM, Transformers, llama.cpp, Ollama
- License: Apache 2.0
MiniCPM-V 4.6 (1.3B)
- Parameters: 1.3B total (SigLIP2-400M vision encoder and Qwen3.5-0.8B LLM backbone)
- Modality: Text + Image + Video >> Text
- Context: 262K tokens
- Highlights: Scores 13 on Artificial Analysis Intelligence Index. Highest for any open-weights model under 2B. Surpasses Gemma4-E2B-it. 1.5x token throughput compared to Qwen3.5-0.8B. Mixed 4x/16x visual token compression. Deployable on iOS, Android, and HarmonyOS.
- Inference Support: vLLM, SGLang, llama.cpp, Ollama, Transformers
- License: Apache 2.0
MiniCPM-o 4.5 (9B)
- Parameters: 9B total (end-to-end omnimodal)
- Modality: Text + Image + Video + Audio >> Text + Speech (full-duplex live streaming)
- Highlights: Approaches Gemini 2.5 Flash in vision, speech, and full-duplex multimodal live streaming. Supports simultaneous see/listen/speak in real-time conversation with proactive interactions.
- Inference Support: vLLM, SGLang, Transformers
- License: Apache 2.0
2. Liquid AI (MIT Spin-off)

LFM2 / LFM2.5 Family
Liquid AI builds models using a novel "liquid" hybrid architecture. This alternates Grouped Query Attention with short convolutional layers. Their models are highly optimized for fast on-device processing.
LFM2-350M / LFM2-700M / LFM2-1.2B / LFM2-2.6B
- Parameters: 350M, 700M, 1.2B, 2.6B (dense)
- Modality: Text >> Text
- Highlights: LFM2-2.6B outperforms Llama 3.2-3B-Instruct, Gemma-3-4B-it, and SmolLM3-3B. Scores 82.41% on GSM8K, 79.56% on IFEval. Trained on 10T tokens. Designed for CPU, NPU, and GPU inference. Optimized for English and Japanese with strong multilingual support (French, Spanish, German, Italian, Portuguese, Arabic, Chinese, Korean).
- Inference Support: Transformers, ExecuTorch, LEAP SDK
- License: LFM Open License (Apache 2.0-based, commercial use permitted)
LFM2.5-1.2B-Instruct
- Parameters: 1.2B
- Modality: Text >> Text
- Highlights: Backbone pretrained on 28T tokens (up from 10T). Reinforcement learning for instruction following, tool use, math, and knowledge reasoning. Also available in Japanese variant (LFM2.5-1.2B-JP).
- Inference Support: Transformers, ExecuTorch, LEAP SDK
- License: LFM Open License
LFM2.5-VL-1.6B
- Parameters: 1.6B (vision-language)
- Modality: Text + Image >> Text
- Highlights: Native resolution processing up to 512x512 without upscaling. Improved multilingual vision understanding. Supports real-time video stream captioning via WebGPU. Reliable multi-image and OCR performance.
- Inference Support: Transformers (v5.1+), WebGPU, LEAP SDK
- License: LFM Open License
LFM2.5-Audio-1.5B
- Parameters: 1.5B
- Modality: Audio >> Text / Audio
- Highlights: Native audio-language model for edge voice agents. Covers speech understanding and audio language workloads.
- Inference Support: Transformers, LEAP SDK
- License: LFM Open License
3. Hugging Face (SmolLM / SmolVLM)

SmolLM Family
Hugging Face focuses on highly compact models with full training transparency.
SmolLM2 (135M / 360M / 1.7B)
- Parameters: 135M, 360M, 1.7B
- Modality: Text >> Text
- Highlights: Pre-trained on curated high-quality datasets (Cosmopedia, FineWeb-Edu, Python-Edu). Designed as a baseline and research tool for understanding small model capabilities.
- Inference Support: Transformers, Ollama, MLX, llama.cpp
- License: Apache 2.0
SmolLM3-3B
- Parameters: 3B
- Modality: Text >> Text
- Context: 128K (genuine, not interpolated)
- Highlights: SOTA 3B model with dual reasoning (thinking/non-thinking modes). Supports 6 languages and long context with strong function calling. Multi-stage training recipe. Standout for long-document processing at 3B scale.
- Inference Support: Transformers, Ollama, MLX, llama.cpp, vLLM
- License: Apache 2.0
[!TIP] Use SmolLM3-3B for processing long documents at the 3B scale due to its non-interpolated 128K context window.
SmolVLM Family
SmolVLM-2.2B-Instruct / SmolVLM2-2.2B
- Parameters: 2.2B (SigLIP-400M encoder and SmolLM2 decoder)
- Modality: Text + Image + Video >> Text
- Highlights: 81 visual tokens per 384x384 patch. Trained on The Cauldron and Docmatix datasets. Strong document understanding (25%), image captioning (18%), visual reasoning, and chart comprehension.
- Inference Support: Transformers, Ollama
- License: Apache 2.0
SmolVLM-500M / SmolVLM-256M
- Parameters: 500M, 256M
- Modality: Text + Image >> Text
- Highlights: Ultra-compact VLMs. The 500M uses a 93M SigLIP encoder (vs. 400M in the 2.2B) and 512x512 patches with 64 visual tokens. Designed for extreme edge deployment.
- Inference Support: Transformers
- License: Apache 2.0
4. Alibaba Cloud (Qwen)
Qwen3.5 Small Series
The Qwen3.5 small models (March 2026) are designed from scratch with a hybrid architecture combining Gated DeltaNet with sparse Mixture-of-Experts.
Qwen3.5-0.8B / 2B / 4B / 9B
- Parameters: 0.8B, 2B, 4B, 9B (dense, with hybrid attention)
- Modality: Text + Image + Video >> Text (native multimodal, all sizes)
- Context: 262K tokens (all sizes)
- Highlights:
- 9B: Intelligence Index score 32 (most intelligent model under 10B). MMMU-Pro 69.2%.
- 4B: Intelligence Index 27 (most intelligent under 5B). MMMU-Pro 65.4%. Supports 262K+ context extensible to 1M+.
- 2B: Runs on any recent iPhone in airplane mode. 100% zero-shot classification accuracy.
- 0.8B: Supports 262K context at sub-1B scale. Early fusion multimodal training with 3D convolution for video.
- Architecture uses 3:1 ratio of linear attention to full attention layers. This controls memory growth.
- Inference Support: Transformers, vLLM, SGLang, Ollama, MLX, llama.cpp, GGUF (Unsloth optimized)
- License: Apache 2.0
Qwen2.5 Series
Qwen2.5-1.5B / 3B / 7B
- Parameters: 1.5B, 3B, 7B
- Modality: Text >> Text
- Context: 128K tokens
- Highlights: Pretrained on 18T tokens. Strong multilingual support. Qwen2.5-7B was the category leader in its class for much of 2025.
- Inference Support: Transformers, vLLM, SGLang, Ollama, MLX, llama.cpp, TensorRT-LLM
- License: Apache 2.0
Qwen2.5-VL-7B
- Parameters: 7B
- Modality: Text + Image + Video >> Text
- Highlights: Scores 888 on OCRBench. Strong document and chart understanding. One of the top open-weight VLMs in the 7B class.
- Inference Support: Transformers, vLLM
- License: Apache 2.0
5. Google DeepMind (Gemma)
Gemma 4 Family (April 2026)
Gemma 4 E2B (Effective 2B)
- Parameters: 2B (dense, edge-optimized)
- Modality: Text + Image + Audio >> Text
- Context: 128K tokens
- Highlights: Native audio input. Designed for phones, browsers, and Pixel devices. Configurable thinking modes.
- Inference Support: Transformers, MediaPipe, LiteRT, Ollama, llama.cpp, MLX
- License: Apache 2.0 (Gemma terms)
Gemma 4 E4B (Effective 4B)
- Parameters: 4B (dense, edge-optimized)
- Modality: Text + Image + Audio >> Text
- Context: 128K tokens
- Highlights: Native audio input. Stronger reasoning than E2B. Achieves 60+ tok/s on edge hardware.
- Inference Support: Transformers, MediaPipe, LiteRT, Ollama, llama.cpp, MLX
- License: Apache 2.0 (Gemma terms)
Gemma 3 Family
Gemma 3 1B / 4B / 12B
- Parameters: 1B, 4B, 12B
- Modality: Text + Image >> Text (4B and 12B; 1B is text-only)
- Context: 128K (4B/12B), 32K (1B)
- Highlights: Gemma 3 4B scores 89.2% on GSM8K. Uses alternating local/global attention (5:1 ratio). Custom SigLIP vision encoder. Supports 140+ languages. Gemma 3 4B is highly competitive in the 3–4B class.
- Inference Support: Transformers, Ollama, vLLM, MLX, llama.cpp, TensorRT-LLM, MediaPipe
- License: Gemma Terms of Use (permissive, similar to Apache 2.0)
6. Microsoft (Phi)
Phi-4 Family
Phi-4-mini (3.8B)
- Parameters: 3.8B (dense, decoder-only transformer)
- Modality: Text >> Text
- Context: 128K tokens
- Highlights: MMLU 67.3% (5-shot), GSM8K 88.6%, ARC-C 83.7% (highest in its size class), BigBench-Hard 70.4% (0-shot CoT). Uses GQA, 200K vocabulary, and shared input-output embeddings. Trained on synthetic reasoning-rich data. Reasoning variant also available (Phi-4-mini-reasoning).
- Inference Support: Transformers, Ollama, vLLM, ONNX Runtime, llama.cpp, MLX, TensorRT-LLM
- License: MIT
Phi-4-multimodal (5.6B)
- Parameters: 5.6B
- Modality: Text + Image + Audio >> Text
- Highlights: First Phi model supporting audio, image, and text input simultaneously. Surpasses Qwen2-Audio on speech/audio/music understanding. Mixture-of-LoRAs architecture.
- Inference Support: Transformers, ONNX Runtime, Intel Extension for PyTorch
- License: MIT
7. Mistral AI

Ministral 3 Family (December 2025)
Ministral 3 3B
- Parameters: 3B (dense)
- Modality: Text + Image >> Text
- Context: 131K tokens
- Highlights: TTFT 0.51s (lowest among Mistral models). Costs $0.10/M tokens. Outperforms Mistral 7B on most benchmarks. Base, instruct, and reasoning variants. Image understanding included.
- Inference Support: Transformers, vLLM, Ollama, llama.cpp, Mistral API
- License: Apache 2.0
Ministral 3 8B
- Parameters: 8B (dense, interleaved sliding-window attention)
- Modality: Text + Image >> Text
- Context: 262K tokens
- Highlights: Outperforms the larger Gemma 12B on most evaluations. Interleaved sliding-window attention for memory-efficient inference. Base, instruct, and reasoning variants.
- Inference Support: Transformers, vLLM, Ollama, llama.cpp, Mistral API
- License: Apache 2.0
Voxtral TTS
- Parameters: 3B (8 GB BF16, 3 GB quantized)
- Modality: Text >> Speech
- Highlights: Supports 9 languages. Zero-shot voice cloning with 3 seconds of reference audio. Cross-lingual generation. 70ms model latency on H200. Built on Ministral 3B backbone.
- Inference Support: Transformers, Mistral API
- License: Apache 2.0
Pixtral 12B
- Parameters: 12B
- Modality: Text + Image >> Text
- Context: 128K tokens
- Highlights: First multi-modal (text+image) model from Mistral. Strong chart and image comprehension. Competitive with Llama 3.2 11B Vision.
- Inference Support: Transformers, vLLM, Ollama, Mistral API
- License: Apache 2.0
8. Meta (Llama)

Llama 3.2 Family
Llama 3.2 1B / 3B
- Parameters: 1B, 3B
- Modality: Text >> Text
- Context: 128K tokens
- Highlights: Optimized for on-device, mobile, and edge inference. Strong at tool routing for their size. 3B outperforms Gemma 2 2.6B and Phi 3.5-mini. Trained on 9T tokens. 8 officially supported languages.
- Inference Support: Transformers, Ollama, vLLM, llama.cpp, MLX, ExecuTorch, TensorRT-LLM
- License: Llama 3.2 Community License (700M MAU restriction)
Llama 3.2 11B-Vision
- Parameters: 11B
- Modality: Text + Image >> Text
- Context: 128K tokens
- Highlights: Beats Gemini 1.5 Flash 8B on DocVQA. Tops Claude 3 Haiku/Sonnet on AI2D, ChartQA, MathVista. Competitive with Pixtral 12B and Qwen2-VL 7B on VQAv2.
- Inference Support: Transformers, vLLM, Ollama, llama.cpp, TensorRT-LLM
- License: Llama 3.2 Community License
9. TII / Falcon

Falcon 3 Family (December 2024)
Falcon 3 1B / 3B / 7B / 10B
- Parameters: 1B, 3B, 7B, 10B
- Modality: Text >> Text
- Context: Varies by model
- Highlights: Trained on 14T tokens. 10B and 7B outperform Gemma 2-9B, Llama 3.1-8B, Mistral 7B, and Yi 1.5-9B. They surpass Qwen 2.5-7B on MUSR, MATH, GPQA, IFEval. Supports English, French, Spanish, Portuguese.
- Inference Support: Transformers, Ollama, vLLM, llama.cpp
- License: Apache 2.0
Falcon H1R 7B (January 2026)
Falcon H1R 7B
- Parameters: 7B (hybrid Transformer + Mamba)
- Modality: Text >> Text
- Highlights: Scores 96.7% on AIME 2025 with Deep Think with Confidence (DeepConf). Uses 38% fewer tokens than DeepSeek-R1-0528-Qwen3-8B. Achieves 1,500 tok/s/GPU at batch size 64. Hybrid architecture offers best-of-both-worlds efficiency.
- Inference Support: Transformers, vLLM
- License: Falcon LLM License 1.0 (Apache 2.0-based, attribution required)
Falcon Mamba 7B
Falcon Mamba 7B
- Parameters: 7B (State Space Model)
- Modality: Text >> Text
- Highlights: World's top open-source SSLM at release. Low memory cost for arbitrary-length text generation. Outperforms Llama 3.1 8B and Mistral 7B.
- Inference Support: Transformers, vLLM
- License: Falcon Terms (permissive)
10. AI2 / Allen Institute (Molmo / OLMo)
Molmo-7B-D
- Parameters: 7B (built on Qwen 2 7B)
- Modality: Text + Image >> Text
- Highlights: Outperforms Llama 3.2 Vision in some benchmarks. Fully open source with all training data, code, and evaluation published.
- Inference Support: Transformers, vLLM
- License: Apache 2.0
Molmo-E
- Parameters: 7B total, 1B active (OLMoE-based)
- Modality: Text + Image >> Text
- Highlights: Efficient MoE architecture with only 1B active parameters per forward pass.
- Inference Support: Transformers
- License: Apache 2.0
11. Shanghai AI Lab (InternVL)
InternVL3-1B / 2B / 4B / 8B
- Parameters: 1B, 2B, 4B, 8B
- Modality: Text + Image + Video >> Text
- Highlights: Strong industrial and 3D reasoning capabilities. InternVL3-8B is competitive with much larger models on MMMU. Evaluated in zero-shot settings for construction safety, medical imaging, and document understanding.
- Inference Support: Transformers, vLLM, LMDeploy
- License: MIT
12. Speech & Audio Models
OpenAI Whisper (Tiny / Base / Small / Medium)
- Parameters: 39M (Tiny), 74M (Base), 244M (Small), 769M (Medium)
- Modality: Audio >> Text (ASR)
- Highlights: Supports 99+ languages. Most widely used open-source ASR. Strong in noisy environments. Extensive community tooling.
- Inference Support: Transformers, whisper.cpp, faster-whisper, MLX, ONNX, Ollama
- License: MIT
Useful Sensors Moonshine (Tiny / Base)
- Parameters: 27M (Tiny), 100M (Base)
- Modality: Audio >> Text (ASR)
- Highlights: Outperforms Whisper Tiny/Small despite being smaller. Moonshine v2 has Ergodic Streaming Encoder for latency-critical applications. Multi-language variants available.
- Inference Support: ONNX, Custom C++ runtime
- License: Permissive open-source
CosyVoice2-0.5B (Alibaba)
- Parameters: 0.5B
- Modality: Text >> Speech (TTS)
- Highlights: Finite scalar quantization. Streams at 150ms. Frame-level control over emotion and dialect.
- Inference Support: Custom framework
- License: Apache 2.0
13. Other Notable Small Models
Mistral 7B v0.3
- Parameters: 7B
- Modality: Text >> Text
- Highlights: Fastest wall-clock throughput in the 7B class. Practical pick when latency matters. Requires 4.5 GB VRAM.
- Inference Support: Transformers, Ollama, vLLM, llama.cpp, MLX, GGUF
- License: Apache 2.0
DeepSeek-R1-Distill-Qwen-1.5B / 7B
- Parameters: 1.5B, 7B (distilled from DeepSeek-R1)
- Modality: Text >> Text
- Highlights: Reasoning-focused distilled models. The 7B variant is derived from Qwen2.5 series and fine-tuned with 800K curated samples.
- Inference Support: Transformers, vLLM, Ollama, llama.cpp
- License: Apache 2.0
TinyLlama 1.1B
- Parameters: 1.1B
- Modality: Text >> Text
- Highlights: Small Llama 2 variant. Useful baseline for research and fine-tuning experiments.
- Inference Support: Transformers, Ollama, llama.cpp
- License: Apache 2.0
Related Posts
← Previous Post
LiquidAI Foundation Models (LFMs): A Master Guide to Efficiency by Design
Next Post →
From 2B to 31B: The Evolution of Google's Gemma Models
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.