Liquid AI's LFM2-VL-3B Brings A 3B Parameter Imaginative And Prescient Language Mannequin (VLM) To Edge-Class Gadgets

Liquid AI launched LFM2-VL-3B, a 3B parameter imaginative and prescient language mannequin for picture textual content to textual content duties. It extends the LFM2-VL household past the 450M and 1.6B variants. The mannequin targets increased accuracy whereas preserving the pace profile of the LFM2 structure. It’s out there on LEAP and Hugging Face below the LFM Open License v1.0.

Mannequin overview and interface

LFM2-VL-3B accepts interleaved picture and textual content inputs and produces textual content outputs. The mannequin exposes a ChatML like template. The processor inserts an sentinel that’s changed with encoded picture tokens at run time. The default textual content context size is 32,768 tokens. These particulars assist devs reproduce evaluations and combine the mannequin with current multimodal pipelines.

Screenshot 2025 10 24 at 1.45.41 PM 1 — https://www.liquid.ai/weblog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

Structure

The stack pairs a language tower with a form conscious imaginative and prescient tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus consideration spine. The imaginative and prescient tower is SigLIP2 NaFlex at 400M parameters, it preserves native side ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses picture tokens earlier than fusion with the language area. This design lets customers cap imaginative and prescient token budgets with out retraining the mannequin.

The encoder processes native resolutions as much as 512×512. Bigger inputs are cut up into non overlapping 512×512 patches. A thumbnail pathway offers international context throughout tiling. The environment friendly token mapping is documented with concrete examples, a 256×384 picture maps to 96 tokens, a 1000×3000 picture maps to 1,020 tokens. The mannequin card exposes person controls for minimal and most picture tokens and the tiling change. These controls tune pace and high quality at inference time.

Inference settings

The Hugging Face mannequin card offers beneficial parameters. Textual content era makes use of temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Imaginative and prescient settings use min picture tokens 64, max picture tokens 256, and picture splitting enabled. The processor applies the chat template and the picture sentinel robotically. The instance makes use of AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.

How is it skilled?

Liquid AI describes a staged strategy. The group performs joint mid coaching that adjusts the textual content to picture ratio over time. The mannequin then undergoes supervised fantastic tuning centered on picture understanding. The info sources are massive scale open datasets plus in home artificial imaginative and prescient information for activity protection.

Benchmarks

The analysis group experiences aggressive outcomes amongst light-weight open VLMs. On MM-IFEval the mannequin reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE rating is 89.01. The desk notes that scores for different programs have been computed with VLMEvalKit. The desk excludes Qwen3-VL-2B as a result of that system was launched sooner or later earlier.

Screenshot 2025 10 24 at 1.49.55 PM 1 — https://www.liquid.ai/weblog/lfm2-vl-3b-a-new-efficient-vision-language-for-the-edge

The language functionality stays near the LFM2-2.6B spine. The analysis group cites 30 p.c on GPQA and 63 p.c on MMLU. This issues when notion duties embody data queries. The group additionally states expanded multilingual visible understanding throughout English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese language, and Korean.

Why edge customers ought to care?

The structure retains compute and reminiscence inside small gadget budgets. Picture tokens are compressible and person constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves side ratios, which helps fantastic grained notion. The projector reduces tokens on the connector, which improves tokens per second. The analysis group additionally revealed a GGUF construct for on gadget runtimes. These properties are helpful for robotics, cell, and industrial purchasers that want native processing and strict information boundaries.

Key Takeaways

Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex imaginative and prescient encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native side ratios.
Decision dealing with and token budgets: Pictures run natively as much as 512×512, bigger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for international context. Documented token mappings embody 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
Inference interface: ChatML-like prompting with an sentinel, default textual content context 32,768 tokens, beneficial decoding settings, and processor-level controls for picture splitting allow reproducible analysis and straightforward integration in multimodal pipelines.
Measured efficiency: Reported outcomes embody MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only indicators from the spine are about 30% GPQA and 63% MMLU, helpful for blended notion plus data workloads.

LFM2-VL-3B is a sensible step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an environment friendly projector, which lowers picture token counts for predictable latency. Native decision processing with 512 by 512 tiling and token caps provides deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are aggressive for this measurement. Open weights, a GGUF construct, and LEAP entry cut back integration friction. Total, that is an edge prepared VLM launch with clear controls and clear benchmarks.

Take a look at the Mannequin on HF and Technical particulars. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be a part of us on telegram as properly.

Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at remodeling complicated datasets into actionable insights.

🙌 Comply with MARKTECHPOST: Add us as a most popular supply on Google.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s traits at present: learn extra, subscribe to our publication, and develop into a part of the NextTech group at NextTech-news.com

What's Hot

Honasa widens premium play with oral magnificence wager, says fast commerce drives 10% of complete income

This American hashish inventory is likely one of the greatest, analyst says

Maya1: A New Open Supply 3B Voice Mannequin For Expressive Textual content To Speech On A Single GPU

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Imaginative and prescient Language Mannequin (VLM) to Edge-Class Gadgets

Maya1: A New Open Supply 3B Voice Mannequin For Expressive Textual content To Speech On A Single GPU

Methods to Cut back Price and Latency of Your RAG Software Utilizing Semantic LLM Caching

Baidu Releases ERNIE-4.5-VL-28B-A3B-Considering: An Open-Supply and Compact Multimodal Reasoning Mannequin Beneath the ERNIE-4.5 Household

Honasa widens premium play with oral magnificence wager, says fast commerce drives 10% of complete income

This American hashish inventory is likely one of the greatest, analyst says

Maya1: A New Open Supply 3B Voice Mannequin For Expressive Textual content To Speech On A Single GPU

Honasa widens premium play with oral magnificence wager, says fast commerce drives 10% of complete income

This American hashish inventory is likely one of the greatest, analyst says

Maya1: A New Open Supply 3B Voice Mannequin For Expressive Textual content To Speech On A Single GPU

What's Hot

Liquid AI’s LFM2-VL-3B Brings a 3B Parameter Imaginative and prescient Language Mannequin (VLM) to Edge-Class Gadgets

Mannequin overview and interface

Structure

Inference settings

How is it skilled?

Benchmarks

Why edge customers ought to care?

Key Takeaways

Related Posts

Subscribe For Latest Updates