LiquidAI Foundation Models (LFMs): A Master Guide to Efficiency by Design
LiquidAI's LFMs use a hybrid convolution+attention architecture to deliver faster inference and lower memory than pure transformers. Full model catalog, tier-by-tier benchmarks, and ecosystem trade-offs.

LiquidAI's Liquid Foundation Models use a hybrid architecture: gated short convolutions alternating with grouped query attention blocks. The result is lower KV cache usage, faster decode, and smaller memory footprint than equivalent pure-transformer models.
This post covers the full catalog from 350M to 8.3B parameters, tier-by-tier benchmark comparisons against Qwen, Phi, Gemma, and Llama, and the honest trade-offs.
1. Company & Background
Founded in 2023 by MIT computer scientists working on dynamical systems and neural network theory. The company spun out of Liquid Neural Network (LNN) research. LNNs are built on continuous-time differential equations, not the standard transformer self-attention mechanism.
Core argument: the dense transformer is over-parameterized for most edge workloads. Replacing most attention layers with lightweight gated convolutions gives comparable performance with lower memory and decode latency.
[!NOTE] "Liquid" refers to Liquid Neural Networks, a specific architecture class from MIT's CSAIL lab. Not a marketing term.
2. The Hybrid Architecture
LFMs alternate Grouped Query Attention (GQA) blocks with gated short convolutional layers (LIV layers). In LFM2.5-8B, the ratio is 18 convolution layers to 6 GQA layers.
Why the ratio matters:
- KV cache: Only the 6 GQA blocks generate KV pairs. Memory scales with 6 attention layers, not 24.
- Prefill speed: Convolutions parallelize across the sequence dimension. Long prompts are processed faster than pure self-attention.
- Constant memory at long context: Convolution layers use fixed memory regardless of sequence length, unlike transformers where KV cache grows linearly.
3. Model Catalog
Text (LFM2 / LFM2.5)
| Model | Params | Training | Key Scores | Inference | License |
|---|---|---|---|---|---|
| LFM2-350M | 350M | 10T tokens | Browser control fine-tuning | Transformers, LEAP SDK | LFM Open |
| LFM2-700M | 700M | 10T tokens | Edge text inference | Transformers, LEAP SDK | LFM Open |
| LFM2-1.2B | 1.2B | 10T tokens | Competitive with Llama 3.2-1B | Transformers, ExecuTorch, LEAP SDK | LFM Open |
| LFM2-2.6B | 2.6B | 10T tokens | 82.41% GSM8K, 79.56% IFEval | Transformers, ExecuTorch, LEAP SDK | LFM Open |
| LFM2.5-1.2B | 1.2B | 28T tokens | RL-tuned: instruction, tool use, math | Transformers, ExecuTorch, LEAP SDK | LFM Open |
[!TIP] LFM2.5-1.2B was trained on 28T tokens. Nearly 3x the LFM2 budget. One of the most heavily trained models at the 1B scale. Japanese variant available (LFM2.5-1.2B-JP).
Vision-Language
| Model | Params | Vision Encoder | Key Feature | License |
|---|---|---|---|---|
| LFM2.5-VL-1.6B | ~1.6B | LFM2.5 native | Native 512x512 resolution, WebGPU video captioning, multi-image OCR | LFM Open |
Audio
| Model | Params | Modality | Key Feature | License |
|---|---|---|---|---|
| LFM2.5-Audio-1.5B | ~1.5B | Audio >> Text/Audio | Edge voice agent, speech understanding | LFM Open |
Mixture-of-Experts
| Model | Total Params | Active Params | Context | Architecture |
|---|---|---|---|---|
| LFM2.5-8B-A1B | 8.3B | 1.5B | 128K tokens | MoE with hybrid conv+GQA |
8.3B total parameters, 1.5B active per token. 8B-class knowledge at 1.5B-class decode speed.
4. Head-to-Head by Parameter Tier
Tier 1: Under 500M
LiquidAI has no serious production-grade competition here. SmolLM and SmolVLM are research baselines.
| Model | Provider | Params | Modality | Use Case |
|---|---|---|---|---|
| LFM2-350M | Liquid AI | 350M | Text | Browser control, edge text |
| LFM2-VL-450M | Liquid AI | 450M | Image+Text | Hyper-efficient edge VLM |
| SmolVLM-256M | Hugging Face | 256M | Image+Text | Smallest practical VLM |
| SmolLM2-360M | Hugging Face | 360M | Text | Research baseline |
Tier 2: 1B to 2B
| Model | Params | Modality | Context | Key Score | License |
|---|---|---|---|---|---|
| LFM2.5-1.2B | 1.2B | Text | — | 28T token pretraining | LFM Open |
| LFM2.5-VL-1.6B | 1.6B | Image+Text | — | WebGPU video captioning | LFM Open |
| MiniCPM-V 4.6 | 1.3B | Image+Text+Video | 262K | Intelligence Index 13 (best under 2B) | Apache 2.0 |
| Qwen3.5-0.8B | 0.8B | Image+Text+Video | 262K | Intelligence Index 9 | Apache 2.0 |
| Llama 3.2 1B | 1B | Text | 128K | Tool routing, ExecuTorch | Llama Community |
~1.3B VLM comparison:
| Feature | MiniCPM-V 4.6 (1.3B) | LFM2.5-VL-1.6B | SmolVLM-2.2B |
|---|---|---|---|
| Vision encoder | SigLIP2-400M | LFM2.5 native | SigLIP-400M |
| Video understanding | Yes | Yes (real-time) | Yes |
| Mobile deployment | iOS, Android, HarmonyOS | LEAP SDK only | Limited |
| Visual token compression | Mixed 4x/16x | Native resolution | 81 tokens/patch |
| Intelligence Index | 13 | — | — |
| Context window | 262K | — | — |
[!WARNING] At the 1-2B VLM tier, MiniCPM-V 4.6 leads on every measurable benchmark. LFM2.5-VL-1.6B's strength is inference speed. It loses on context window, benchmark scores, and mobile SDK breadth.
Tier 3: 2.6B to 4B
| Model | Params | MMLU | GSM8K | HumanEval | Context | Multimodal |
|---|---|---|---|---|---|---|
| LFM2-2.6B | 2.6B | — | 82.41% | — | — | No |
| Phi-4-mini | 3.8B | 67.3% | 88.6% | 55% | 128K | No |
| Gemma 3 4B | 4B | 65% | 89.2% | 71.3% | 128K | Image |
| Qwen3.5-4B | 4B | 64% | — | — | 262K | Image+Video |
| SmolLM3-3B | 3B | 63% | 80% | — | 128K (genuine) | No |
| Llama 3.2 3B | 3B | 61.8% | — | — | 128K | No |
LFM2-2.6B scores 82.41% on GSM8K. Phi-4-mini (88.6%) and Gemma 3 4B (89.2%) both beat it on math. LiquidAI's advantage at this tier is decode latency and memory, not benchmark scores.
Inference speed (3-4B, Q4_K_M):
| Model | Tok/s (M2 MacBook) | Tok/s (RTX 4060) | VRAM |
|---|---|---|---|
| Phi-4-mini | 40-50 | 95 | 3 GB |
| Gemma 3 4B | 30-40 | 70 | 3.5 GB |
| SmolLM3-3B | 35-45 | 80 | 2.5 GB |
| Llama 3.2 3B | 35-45 | 80 | 2.5 GB |
| LFM2-2.6B | — | — | Lower (hybrid arch) |
[!NOTE] LiquidAI does not publish standardized tok/s numbers in the same format. Their edge is constant-memory decode at long context, which short-context benchmarks don't capture well.
5. Inference Engine Support
The biggest practical gap in the LFM ecosystem.
| Model Family | Transformers | Ollama | vLLM | llama.cpp | MLX | SGLang | ONNX | ExecuTorch |
|---|---|---|---|---|---|---|---|---|
| Liquid AI LFM2 | ✅ | — | — | — | — | — | — | ✅ |
| Qwen3.5 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | — |
| MiniCPM | ✅ | ✅ | ✅ | ✅ | — | ✅ | — | — |
| Phi-4 | ✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ | — |
| Gemma 3/4 | ✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ | — |
| Llama 3.2 | ✅ | ✅ | ✅ | ✅ | ✅ | — | ✅ | ✅ |
LFMs run only through HuggingFace Transformers, ExecuTorch, and the proprietary LEAP SDK. No Ollama, no vLLM, no llama.cpp, no MLX.
Practical implications:
- Cannot use Ollama for local chat or API serving.
- Cannot deploy through vLLM in production.
- No GGUF quantization via llama.cpp.
- No Apple Silicon acceleration via MLX.
[!CAUTION] If your stack depends on Ollama, vLLM, or llama.cpp, LFMs are not drop-in replacements for Qwen, Llama, or Phi today. This is the primary adoption barrier.
6. Community Sentiment
From r/LocalLLaMA and GitHub:
- Speed holds up. The 1B and 1.2B variants consistently draw comments about generation feeling like streaming. The architecture's latency claims are real.
- Quality above class. LFM2.5-1.2B output quality is frequently compared to 2-3B transformer models for summarization, filtering, and categorization tasks.
- Reasoning is the weak spot. Multi-step logical reasoning is where LFMs lose to Qwen and Phi at equivalent sizes. The architecture is optimized for throughput, not chain-of-thought depth.
- Ecosystem friction. Lack of Ollama and llama.cpp support is the most common complaint. Users who want to test LFMs often find their existing local setup simply doesn't load the model.
7. Decision Guide
| Use Case | Best Pick | Reason |
|---|---|---|
| Ultra-fast edge text | LFM2.5-1.2B | 28T training, lowest latency at 1B scale |
| Edge VLM + real-time video | LFM2.5-VL-1.6B | WebGPU captioning, native resolution |
| Edge voice agent | LFM2.5-Audio-1.5B | No competing model at this size and modality |
| 8B knowledge at 1.5B speed | LFM2.5-8B-A1B | MoE, 1.5B active params per token |
| Best multimodal under 2B | MiniCPM-V 4.6 | Intelligence Index 13, 262K context, full mobile OS support |
| Best reasoning under 4B | Phi-4-mini | 88.6% GSM8K, 70.4% BBH |
| Best code under 4B | Gemma 3 4B | 71.3% HumanEval |
| Broadest inference support | Qwen3.5 (any size) | Ollama, vLLM, llama.cpp, MLX, SGLang all supported |
Conclusion
LiquidAI's hybrid architecture delivers on its core promise: lower memory and faster decode than pure transformers at equivalent parameter counts. The LFM2.5-Audio-1.5B has no direct competitor at its size and modality.
The blocker is ecosystem. Until LFMs get Ollama, vLLM, and llama.cpp support, they are a specialist choice. The architecture is production-ready. The tooling is not.
References
← Previous Post
The OpenBMB Ecosystem: A Master Guide to MiniCPM and Beyond
Next Post →
The Landscape of Small LLMs and VLMs (Under 12B)
If the article helped you in some way, consider giving it a like. This will mean a lot to me. You can download the code related to the post using the download button below.
If you see any bug, have a question for me, or would like to provide feedback, please drop a comment below.