How do you change complicated, multilingual paperwork—dense layouts, small scripts, formulation, charts, and handwriting—into devoted structured Markdown/JSON with state-of-the-art accuracy whereas conserving inference latency and reminiscence low sufficient for actual deployments?Baidu’s PaddlePaddle group has launched PaddleOCR-VL, a 0.9B-parameter vision-language mannequin designed for end-to-end doc parsing throughout textual content, tables, formulation, charts, and handwriting. The core mannequin combines a NaViT-style (Native-resolution ViT) dynamic-resolution imaginative and prescient encoder with the ERNIE-4.5-0.3B decoder. It helps 109 languages.

Understanding the system design
PaddleOCR-VL is deployed as a two-stage pipeline. Stage one (PP-DocLayoutV2) performs page-level structure evaluation: an RT-DETR detector localizes and classifies areas; a pointer community predicts studying order. Stage two (PaddleOCR-VL-0.9B) conducts element-level recognition conditioned on the detected structure. Remaining outputs are aggregated to Markdown and JSON for downstream consumption. This decoupling mitigates long-sequence decoding latency and instability that end-to-end VLMs face on dense, multi-column, blended textual content–graphic pages.
On the mannequin stage, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder (native-resolution sequence packing) with a 2-layer MLP projector and the ERNIE-4.5-0.3B language mannequin; 3D-RoPE is used for positional illustration. The technical report attributes decrease hallucinations and higher text-dense efficiency to native-resolution processing relative to fixed-resize or tiling approaches. The NaViT thought—patch-and-pack variable-resolution inputs with out harmful resizing—originates from prior work exhibiting improved effectivity and robustness; PaddleOCR-VL adopts this encoder model straight.
Benchmarks
PaddleOCR-VL achieves state-of-the-art outcomes on OmniDocBench v1.5 and aggressive or main scores on v1.0, masking total high quality in addition to sub-tasks (textual content edit distances, Formulation-CDM, Desk-TEDS/TEDS-S, and reading-order edit), with complementary power on olmOCR-Bench and in-house handwriting, desk, method, and chart evaluations.


Key Takeaways
- 0.9B-parameter PaddleOCR-VL integrates a NaViT-style dynamic-resolution encoder with ERNIE-4.5-0.3B for doc parsing.
- Targets end-to-end extraction throughout textual content, tables, formulation, charts, and handwriting with structured Markdown/JSON outputs.
- Claims SOTA efficiency on public doc benchmarks with quick inference appropriate for deployment.
- Helps 109 languages, together with small scripts and sophisticated web page layouts.
This launch is significant as a result of it joins a NaViT-style dynamic-resolution visible encoder with the light-weight ERNIE-4.5-0.3B decoder to ship SOTA page-level doc parsing and element-level recognition at sensible inference value. The 2-stage PP-DocLayoutV2 → PaddleOCR-VL-0.9B design stabilizes studying order and preserves native typography cues, which matter for small scripts, formulation, charts, and handwriting throughout 109 languages. Structured Markdown/JSON outputs and non-obligatory vLLM/SGLang acceleration make the system operationally clear for manufacturing doc intelligence.
Try the Technical Paper, Mannequin on HF, and Technical particulars . Be at liberty to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you may be a part of us on telegram as effectively.

Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its recognition amongst audiences.
Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s developments at the moment: learn extra, subscribe to our publication, and change into a part of the NextTech neighborhood at NextTech-news.com