LiquidAI Foundation Models (LFMs): A Master Guide to Efficiency by Design

LiquidAI's Liquid Foundation Models use a hybrid architecture: gated short convolutions alternating with grouped query attention blocks. The result is lower KV cache usage, faster decode, and smaller memory footprint than equivalent pure-transformer models.

This post covers the full catalog from 350M to 8.3B parameters, tier-by-tier benchmark comparisons against Qwen, Phi, Gemma, and Llama, and the honest trade-offs.

1. Company & Background

Founded in 2023 by MIT computer scientists working on dynamical systems and neural network theory. The company spun out of Liquid Neural Network (LNN) research. LNNs are built on continuous-time differential equations, not the standard transformer self-attention mechanism.

Core argument: the dense transformer is over-parameterized for most edge workloads. Replacing most attention layers with lightweight gated convolutions gives comparable performance with lower memory and decode latency.

[!NOTE] "Liquid" refers to Liquid Neural Networks, a specific architecture class from MIT's CSAIL lab. Not a marketing term.

2. The Hybrid Architecture

LFMs alternate Grouped Query Attention (GQA) blocks with gated short convolutional layers (LIV layers). In LFM2.5-8B, the ratio is 18 convolution layers to 6 GQA layers.

Loading diagram...

Why the ratio matters:

KV cache: Only the 6 GQA blocks generate KV pairs. Memory scales with 6 attention layers, not 24.
Prefill speed: Convolutions parallelize across the sequence dimension. Long prompts are processed faster than pure self-attention.
Constant memory at long context: Convolution layers use fixed memory regardless of sequence length, unlike transformers where KV cache grows linearly.

3. Model Catalog

Text (LFM2 / LFM2.5)

Model	Params	Training	Key Scores	Inference	License
LFM2-350M	350M	10T tokens	Browser control fine-tuning	Transformers, LEAP SDK	LFM Open
LFM2-700M	700M	10T tokens	Edge text inference	Transformers, LEAP SDK	LFM Open
LFM2-1.2B	1.2B	10T tokens	Competitive with Llama 3.2-1B	Transformers, ExecuTorch, LEAP SDK	LFM Open
LFM2-2.6B	2.6B	10T tokens	82.41% GSM8K, 79.56% IFEval	Transformers, ExecuTorch, LEAP SDK	LFM Open
LFM2.5-1.2B	1.2B	28T tokens	RL-tuned: instruction, tool use, math	Transformers, ExecuTorch, LEAP SDK	LFM Open

[!TIP] LFM2.5-1.2B was trained on 28T tokens. Nearly 3x the LFM2 budget. One of the most heavily trained models at the 1B scale. Japanese variant available (LFM2.5-1.2B-JP).

Vision-Language

Model	Params	Vision Encoder	Key Feature	License
LFM2.5-VL-1.6B	~1.6B	LFM2.5 native	Native 512x512 resolution, WebGPU video captioning, multi-image OCR	LFM Open

Audio

Model	Params	Modality	Key Feature	License
LFM2.5-Audio-1.5B	~1.5B	Audio >> Text/Audio	Edge voice agent, speech understanding	LFM Open

Mixture-of-Experts

Model	Total Params	Active Params	Context	Architecture
LFM2.5-8B-A1B	8.3B	1.5B	128K tokens	MoE with hybrid conv+GQA

8.3B total parameters, 1.5B active per token. 8B-class knowledge at 1.5B-class decode speed.

4. Head-to-Head by Parameter Tier

Tier 1: Under 500M

LiquidAI has no serious production-grade competition here. SmolLM and SmolVLM are research baselines.

Model	Provider	Params	Modality	Use Case
LFM2-350M	Liquid AI	350M	Text	Browser control, edge text
LFM2-VL-450M	Liquid AI	450M	Image+Text	Hyper-efficient edge VLM
SmolVLM-256M	Hugging Face	256M	Image+Text	Smallest practical VLM
SmolLM2-360M	Hugging Face	360M	Text	Research baseline

Tier 2: 1B to 2B

Model	Params	Modality	Context	Key Score	License
LFM2.5-1.2B	1.2B	Text	—	28T token pretraining	LFM Open
LFM2.5-VL-1.6B	1.6B	Image+Text	—	WebGPU video captioning	LFM Open
MiniCPM-V 4.6	1.3B	Image+Text+Video	262K	Intelligence Index 13 (best under 2B)	Apache 2.0
Qwen3.5-0.8B	0.8B	Image+Text+Video	262K	Intelligence Index 9	Apache 2.0
Llama 3.2 1B	1B	Text	128K	Tool routing, ExecuTorch	Llama Community

~1.3B VLM comparison:

Feature	MiniCPM-V 4.6 (1.3B)	LFM2.5-VL-1.6B	SmolVLM-2.2B
Vision encoder	SigLIP2-400M	LFM2.5 native	SigLIP-400M
Video understanding	Yes	Yes (real-time)	Yes
Mobile deployment	iOS, Android, HarmonyOS	LEAP SDK only	Limited
Visual token compression	Mixed 4x/16x	Native resolution	81 tokens/patch
Intelligence Index	13	—	—
Context window	262K	—	—

[!WARNING] At the 1-2B VLM tier, MiniCPM-V 4.6 leads on every measurable benchmark. LFM2.5-VL-1.6B's strength is inference speed. It loses on context window, benchmark scores, and mobile SDK breadth.

Tier 3: 2.6B to 4B

Model	Params	MMLU	GSM8K	HumanEval	Context	Multimodal
LFM2-2.6B	2.6B	—	82.41%	—	—	No
Phi-4-mini	3.8B	67.3%	88.6%	55%	128K	No
Gemma 3 4B	4B	65%	89.2%	71.3%	128K	Image
Qwen3.5-4B	4B	64%	—	—	262K	Image+Video
SmolLM3-3B	3B	63%	80%	—	128K (genuine)	No
Llama 3.2 3B	3B	61.8%	—	—	128K	No

LFM2-2.6B scores 82.41% on GSM8K. Phi-4-mini (88.6%) and Gemma 3 4B (89.2%) both beat it on math. LiquidAI's advantage at this tier is decode latency and memory, not benchmark scores.

Inference speed (3-4B, Q4_K_M):

Model	Tok/s (M2 MacBook)	Tok/s (RTX 4060)	VRAM
Phi-4-mini	40-50	95	3 GB
Gemma 3 4B	30-40	70	3.5 GB
SmolLM3-3B	35-45	80	2.5 GB
Llama 3.2 3B	35-45	80	2.5 GB
LFM2-2.6B	—	—	Lower (hybrid arch)

[!NOTE] LiquidAI does not publish standardized tok/s numbers in the same format. Their edge is constant-memory decode at long context, which short-context benchmarks don't capture well.

5. Inference Engine Support

The biggest practical gap in the LFM ecosystem.

Model Family	Transformers	Ollama	vLLM	llama.cpp	MLX	SGLang	ONNX	ExecuTorch
Liquid AI LFM2	✅	—	—	—	—	—	—	✅
Qwen3.5	✅	✅	✅	✅	✅	✅	—	—
MiniCPM	✅	✅	✅	✅	—	✅	—	—
Phi-4	✅	✅	✅	✅	✅	—	✅	—
Gemma 3/4	✅	✅	✅	✅	✅	—	✅	—
Llama 3.2	✅	✅	✅	✅	✅	—	✅	✅

LFMs run only through HuggingFace Transformers, ExecuTorch, and the proprietary LEAP SDK. No Ollama, no vLLM, no llama.cpp, no MLX.

Practical implications:

Cannot use Ollama for local chat or API serving.
Cannot deploy through vLLM in production.
No GGUF quantization via llama.cpp.
No Apple Silicon acceleration via MLX.

[!CAUTION] If your stack depends on Ollama, vLLM, or llama.cpp, LFMs are not drop-in replacements for Qwen, Llama, or Phi today. This is the primary adoption barrier.

6. Community Sentiment

From r/LocalLLaMA and GitHub:

Speed holds up. The 1B and 1.2B variants consistently draw comments about generation feeling like streaming. The architecture's latency claims are real.
Quality above class. LFM2.5-1.2B output quality is frequently compared to 2-3B transformer models for summarization, filtering, and categorization tasks.
Reasoning is the weak spot. Multi-step logical reasoning is where LFMs lose to Qwen and Phi at equivalent sizes. The architecture is optimized for throughput, not chain-of-thought depth.
Ecosystem friction. Lack of Ollama and llama.cpp support is the most common complaint. Users who want to test LFMs often find their existing local setup simply doesn't load the model.

7. Decision Guide

Use Case	Best Pick	Reason
Ultra-fast edge text	LFM2.5-1.2B	28T training, lowest latency at 1B scale
Edge VLM + real-time video	LFM2.5-VL-1.6B	WebGPU captioning, native resolution
Edge voice agent	LFM2.5-Audio-1.5B	No competing model at this size and modality
8B knowledge at 1.5B speed	LFM2.5-8B-A1B	MoE, 1.5B active params per token
Best multimodal under 2B	MiniCPM-V 4.6	Intelligence Index 13, 262K context, full mobile OS support
Best reasoning under 4B	Phi-4-mini	88.6% GSM8K, 70.4% BBH
Best code under 4B	Gemma 3 4B	71.3% HumanEval
Broadest inference support	Qwen3.5 (any size)	Ollama, vLLM, llama.cpp, MLX, SGLang all supported

Conclusion

LiquidAI's hybrid architecture delivers on its core promise: lower memory and faster decode than pure transformers at equivalent parameter counts. The LFM2.5-Audio-1.5B has no direct competitor at its size and modality.

The blocker is ecosystem. Until LFMs get Ollama, vLLM, and llama.cpp support, they are a specialist choice. The architecture is production-ready. The tooling is not.