Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Disney+ launches Perks rewards program in Canada

September 30, 2025

Dylect Launches ‘Rip-off Ya Sprint Cam’ Marketing campaign to Champion Street Security and Driver Safety in India

September 30, 2025

Canadian music business and international streamers meet CTRC to speak CanCon and extra

September 30, 2025
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Disney+ launches Perks rewards program in Canada
  • Dylect Launches ‘Rip-off Ya Sprint Cam’ Marketing campaign to Champion Street Security and Driver Safety in India
  • Canadian music business and international streamers meet CTRC to speak CanCon and extra
  • Indian entrepreneurs outrank international counterparts in luxurious spending, international dwelling: Report
  • Ford and BMW each take pictures at CarPlay this week
  • Former Apple Chief Design Officer Jony Ive’s LoveFrom x Balmuda Crusing Lantern was Constructed for the Waves
  • Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Artwork Outcomes
  • GIGABYTE Z890 AORUS TACHYON ICE is the bottom of the most recent DDR5 OC report at 12,920MT/s
Tuesday, September 30
NextTech NewsNextTech News
Home - AI & Machine Learning - Meet oLLM: A Light-weight Python Library that brings 100K-Context LLM Inference to eight GB Client GPUs by way of SSD Offload—No Quantization Required
AI & Machine Learning

Meet oLLM: A Light-weight Python Library that brings 100K-Context LLM Inference to eight GB Client GPUs by way of SSD Offload—No Quantization Required

NextTechBy NextTechSeptember 29, 2025No Comments4 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
Meet oLLM: A Light-weight Python Library that brings 100K-Context LLM Inference to eight GB Client GPUs by way of SSD Offload—No Quantization Required
Share
Facebook Twitter LinkedIn Pinterest Email


oLLM is a light-weight Python library constructed on prime of Huggingface Transformers and PyTorch and runs large-context Transformers on NVIDIA GPUs by aggressively offloading weights and KV-cache to quick native SSDs. The mission targets offline, single-GPU workloads and explicitly avoids quantization, utilizing FP16/BF16 weights with FlashAttention-2 and disk-backed KV caching to maintain VRAM inside 8–10 GB whereas dealing with as much as ~100K tokens of context.

However What’s new?

(1) KV cache learn/writes that bypass mmap to scale back host RAM utilization; (2) DiskCache assist for Qwen3-Subsequent-80B; (3) Llama-3 FlashAttention-2 for stability; and (4) GPT-OSS reminiscence reductions by way of “flash-attention-like” kernels and chunked MLP. The desk revealed by the maintainer studies end-to-end reminiscence/I/O footprints on an RTX 3060 Ti (8 GB):

  • Qwen3-Subsequent-80B (bf16, 160 GB weights, 50K ctx) → ~7.5 GB VRAM + ~180 GB SSD; famous throughput “≈ 1 tok/2 s”.
  • GPT-OSS-20B (packed bf16, 10K ctx) → ~7.3 GB VRAM + 15 GB SSD.
  • Llama-3.1-8B (fp16, 100K ctx) → ~6.6 GB VRAM + 69 GB SSD.

The way it works

oLLM streams layer weights straight from SSD into the GPU, offloads the eye KV cache to SSD, and optionally offloads layers to CPU. It makes use of FlashAttention-2 with on-line softmax so the total consideration matrix is rarely materialized, and chunks giant MLP projections to sure peak reminiscence. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the oLLM mission emphasizes NVMe-class SSDs and KvikIO/cuFile (GPUDirect Storage) for high-throughput file I/O.

Supported fashions and GPUs

Out of the field the examples cowl Llama-3 (1B/3B/8B), GPT-OSS-20B, and Qwen3-Subsequent-80B. The library targets NVIDIA Ampere (RTX 30xx, A-series), Ada (RTX 40xx, L4), and Hopper; Qwen3-Subsequent requires a dev construct of Transformers (≥ 4.57.0.dev). Notably, Qwen3-Subsequent-80B is a sparse MoE (80B whole, ~3B lively) that distributors sometimes place for multi-A100/H100 deployments; oLLM’s declare is that you may execute it offline on a single shopper GPU by paying the SSD penalty and accepting low throughput. This stands in distinction to vLLM docs, which recommend multi-GPU servers for a similar mannequin household.

Set up and minimal utilization

The mission is MIT-licensed and out there on PyPI (pip set up ollm), with an extra kvikio-cu{cuda_version} dependency for high-speed disk I/O. For Qwen3-Subsequent fashions, set up Transformers from GitHub. A brief instance within the README reveals Inference(...).DiskCache(...) wiring and generate(...) with a streaming textual content callback. (PyPI at the moment lists 0.4.1; the README references 0.4.2 modifications.)

Efficiency expectations and trade-offs

  • Throughput: The maintainer studies ~0.5 tok/s for Qwen3-Subsequent-80B at 50K context on an RTX 3060 Ti—usable for batch/offline analytics, not for interactive chat. SSD latency dominates.
  • Storage stress: Lengthy contexts require very giant KV caches; oLLM writes these to SSD to maintain VRAM flat. This mirrors broader trade work on KV offloading (e.g., NVIDIA Dynamo/NIXL and group discussions), however the strategy remains to be storage-bound and workload-specific.
  • {Hardware} actuality verify: Working Qwen3-Subsequent-80B “on shopper {hardware}” is possible with oLLM’s disk-centric design, however typical high-throughput inference for this mannequin nonetheless expects multi-GPU servers. Deal with oLLM as an execution path for large-context, offline passes moderately than a drop-in alternative for manufacturing serving stacks like vLLM/TGI.

Backside line

oLLM pushes a transparent design level: preserve precision excessive, push reminiscence to SSD, and make ultra-long contexts viable on a single 8 GB NVIDIA GPU. It received’t match data-center throughput, however for offline doc/log evaluation, compliance evaluate, or large-context summarization, it’s a realistic method to execute 8B–20B fashions comfortably and even step as much as MoE-80B for those who can tolerate ~100–200 GB of quick native storage and sub-1 tok/s technology.


Take a look at the GITHUB REPO right here. Be happy to take a look at our GitHub Web page for Tutorials, Codes and Notebooks. Additionally, be at liberty to observe us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Publication.


Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.

🔥[Recommended Read] NVIDIA AI Open-Sources ViPE (Video Pose Engine): A Highly effective and Versatile 3D Video Annotation Software for Spatial AI

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the most recent breakthroughs, get unique updates, and join with a world community of future-focused thinkers.
Unlock tomorrow’s traits at present: learn extra, subscribe to our e-newsletter, and turn into a part of the NextTech group at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Anthropic Launches Claude Sonnet 4.5 with New Coding and Agentic State-of-the-Artwork Outcomes

September 30, 2025

The right way to Design an Interactive Sprint and Plotly Dashboard with Callback Mechanisms for Native and On-line Deployment?

September 29, 2025

This AI Analysis Proposes an AI Agent Immune System for Adaptive Cybersecurity: 3.4× Sooner Containment with

September 29, 2025
Add A Comment
Leave A Reply Cancel Reply

Economy News

Disney+ launches Perks rewards program in Canada

By NextTechSeptember 30, 2025

Disney+ has rolled out its Perks program in Canada. Initially launched final yr within the…

Dylect Launches ‘Rip-off Ya Sprint Cam’ Marketing campaign to Champion Street Security and Driver Safety in India

September 30, 2025

Canadian music business and international streamers meet CTRC to speak CanCon and extra

September 30, 2025
Top Trending

Disney+ launches Perks rewards program in Canada

By NextTechSeptember 30, 2025

Disney+ has rolled out its Perks program in Canada. Initially launched final…

Dylect Launches ‘Rip-off Ya Sprint Cam’ Marketing campaign to Champion Street Security and Driver Safety in India

By NextTechSeptember 30, 2025

Goals to empower Indian drivers with tech-enabled vigilance by means of its…

Canadian music business and international streamers meet CTRC to speak CanCon and extra

By NextTechSeptember 30, 2025

Over the previous few days, the CRTC has been holding a listening…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!