Jina AI has launched Jina-VLM, a 2.4B parameter imaginative and prescient language mannequin that targets multilingual visible query answering and doc understanding on constrained {hardware}. The mannequin {couples} a SigLIP2 imaginative and prescient encoder with a Qwen3 language spine and makes use of an consideration pooling connector to scale back visible tokens whereas preserving spatial construction. Amongst open 2B scale VLMs, it reaches cutting-edge outcomes on multilingual benchmarks corresponding to MMMB and Multilingual MMBench.

Structure, overlapping tiles with consideration pooling connector
Jina-VLM retains the usual VLM format, however optimizes the imaginative and prescient facet for arbitrary decision and low token depend. The imaginative and prescient encoder is SigLIP2 So400M/14 384, a 27 layer Imaginative and prescient Transformer with about 400M parameters. It processes 378×378 pixel crops right into a 27×27 grid of 14×14 patches, so every tile produces 729 patch tokens.
To deal with excessive decision pictures, the mannequin doesn’t resize the total enter to a single sq.. As an alternative, it constructs a grid of as much as 12 overlapping tiles together with a world thumbnail. Every tile is a 378×378 crop, adjoining tiles overlap by 112 pixels, and the stride between tile origins is 266 pixels. A 4×3 grid covers an efficient decision of 1176×910 pixels earlier than downscaling bigger pictures to suit contained in the tile funds.
The core design is the imaginative and prescient language connector. Reasonably than utilizing the ultimate ViT layer, Jina-VLM concatenates options from two intermediate layers, the third from final and ninth from final, that correspond to layers 24 and 18. This combines excessive stage semantics and mid stage spatial element. The connector then applies consideration pooling over 2×2 patch neighborhoods. It computes a imply pooled question for every 2×2 area, attends over the total concatenated characteristic map, and outputs a single pooled token per neighborhood. This reduces 729 visible tokens per tile to 182 tokens, which is a 4 instances compression. A SwiGLU projection maps the pooled options to the Qwen3 embedding dimension.
With the default 12 tile configuration plus thumbnail, a naive connector would feed 9,477 visible tokens into the language mannequin. Consideration pooling cuts this to 2,366 visible tokens. The ViT compute doesn’t change, however for the language spine this yields about 3.9 instances fewer prefill FLOPs and 4 instances smaller KV cache. When together with the shared ViT value, the general FLOPs drop by about 2.3 instances for the default setting.
The language decoder is Qwen3-1.7B-Base. The mannequin introduces particular tokens for pictures, with and across the tile sequence and to mark rows within the patch grid. Visible tokens from the connector and textual content embeddings are concatenated and handed to Qwen3 to generate solutions.
Coaching pipeline and multilingual information combine
Coaching proceeds in 2 levels. All parts, encoder, connector and decoder, are up to date collectively, with out freezing. The complete corpus accommodates about 5M multimodal samples and 12B textual content tokens throughout greater than 30 languages. Roughly half of the textual content is English, and the remaining covers excessive and mid useful resource languages corresponding to Chinese language, Arabic, German, Spanish, French, Italian, Japanese and Korean.
Stage 1 is alignment coaching. The purpose is cross language visible grounding, not instruction following. The group makes use of caption heavy datasets PixmoCap and PangeaIns, which span pure pictures, paperwork, diagrams and infographics. They add 15 % textual content solely information from the PleiAS frequent corpus to regulate degradation on pure language duties. The connector makes use of the next studying charge and shorter warmup than the encoder and decoder to hurry up adaptation with out destabilizing the backbones.
Stage 2 is instruction tremendous tuning. Right here Jina VLM learns to comply with prompts for visible query answering and reasoning. The combo combines LLaVA OneVision, Cauldron, Cambrian, PangeaIns and FineVision, plus Aya model multilingual textual content solely directions. The Jina analysis group first practice for 30,000 steps with single supply batches, then for an additional 30,000 steps with blended supply batches. This schedule stabilizes studying within the presence of very heterogeneous supervision.
Throughout pretraining and tremendous tuning, the mannequin sees about 10B tokens within the first stage and 37B tokens within the second stage, with a complete of roughly 1,300 GPU hours reported for the principle experiments.
Benchmark profile, 2.4B mannequin with multilingual power
On normal English VQA duties that embody diagrams, charts, paperwork, OCR and blended scenes, Jina-VLM reaches a median rating of 72.3 throughout 8 benchmarks. These are AI2D, ChartQA, TextVQA, DocVQA, InfoVQA, OCRBench, SEED Bench 2 Plus and CharXiv. That is the most effective common among the many 2B scale comparability fashions on this analysis paper from Jina AI.
On multimodal comprehension and actual world understanding duties, the mannequin scores 67.4 on the multimodal group, which incorporates MME, MMB v1.1 and MMStar. It scores 61.9 on the true world group, which incorporates RealWorldQA, MME RealWorld and R Bench, and it reaches 68.2 accuracy on RealWorldQA itself, which is the most effective outcome among the many baselines thought of.


Multi picture reasoning is a weaker space. On BLINK, MuirBench and MMT, Jina-VLM reaches a median of 47.3. The analysis group level to restricted multi-image coaching information as the explanation. In distinction, hallucination management is robust. On the POPE benchmark, which measures object hallucination, the mannequin scores 90.3, the most effective rating within the comparability desk.
For mathematical and structured reasoning, the mannequin makes use of the identical structure, with out considering mode. It reaches 59.5 on MMMU and an general math rating of 33.3 throughout MathVista, MathVision, MathVerse, WeMath and LogicVista. Jina-VLM is corresponding to InternVL3-2B on this set and clearly forward of Qwen2-VL-2B, whereas InternVL3.5-2B stays stronger attributable to its bigger scale and extra specialised math coaching.
On pure textual content benchmarks, the image is blended. The analysis group experiences that Jina-VLM retains many of the Qwen3-1.7B efficiency on MMLU, GSM 8K, ARC C and HellaSwag. Nevertheless, MMLU-Professional drops from 46.4 for the bottom mannequin to 30.3 after multimodal tuning. The analysis group attribute this to instruction tuning that pushes the mannequin towards very quick solutions, which clashes with the lengthy multi step reasoning required by MMLU Professional.
The principle spotlight is multilingual multimodal understanding. On MMMB throughout Arabic, Chinese language, English, Portuguese, Russian and Turkish, Jina-VLM reaches a median of 78.8. On Multilingual MMBench throughout the identical languages, it reaches 74.3. The analysis group experiences these as cutting-edge averages amongst open 2B scale VLMs.
Comparability Desk
| Mannequin | Params | VQA Avg | MMMB | Multi. MMB | DocVQA | OCRBench |
|---|---|---|---|---|---|---|
| Jina-VLM | 2.4B | 72.3 | 78.8 | 74.3 | 90.6 | 778 |
| Qwen2-VL-2B | 2.1B | 66.4 | 71.3 | 69.4 | 89.2 | 809 |
| Qwen3-VL-2B | 2.8B | 71.6 | 75.0 | 72.3 | 92.3 | 858 |
| InternVL3-2B | 2.2B | 69.2 | 73.6 | 71.9 | 87.4 | 835 |
| InternVL3.5-2B | 2.2B | 71.6 | 74.6 | 70.9 | 88.5 | 836 |
Key Takeaways
- Jina-VLM is a 2.4B parameter VLM that {couples} SigLIP2 So400M as imaginative and prescient encoder with Qwen3-1.7B as language spine by way of an consideration pooling connector that cuts visible tokens by 4 instances whereas conserving spatial construction.
- The mannequin makes use of overlapping 378×378 tiles, 12 tiles plus a world thumbnail, to deal with arbitrary decision pictures as much as roughly 4K, then feeds solely pooled visible tokens to the LLM which reduces prefill FLOPs and KV cache dimension by about 4 instances in comparison with naive patch token utilization.
- Coaching makes use of about 5M multimodal samples and 12B textual content tokens throughout almost 30 languages in a 2 stage pipeline, first alignment with caption model information, then instruction tremendous tuning with LLaVA OneVision, Cauldron, Cambrian, PangeaIns, FineVision and multilingual instruction units.
- On English VQA, Jina-VLM reaches 72.3 common throughout 8 VQA benchmarks, and on multilingual multimodal benchmarks it leads the open 2B scale class with 78.8 on MMMB and 74.3 on Multilingual MMBench whereas conserving aggressive textual content solely efficiency.
Try the Paper, Mannequin on HF and Technical particulars. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Publication. Wait! are you on telegram? now you’ll be able to be part of us on telegram as effectively.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s tendencies as we speak: learn extra, subscribe to our e-newsletter, and develop into a part of the NextTech neighborhood at NextTech-news.com

