Close Menu
  • Home
  • Opinion
  • Region
    • Africa
    • Asia
    • Europe
    • Middle East
    • North America
    • Oceania
    • South America
  • AI & Machine Learning
  • Robotics & Automation
  • Space & Deep Tech
  • Web3 & Digital Economies
  • Climate & Sustainability Tech
  • Biotech & Future Health
  • Mobility & Smart Cities
  • Global Tech Pulse
  • Cybersecurity & Digital Rights
  • Future of Work & Education
  • Trend Radar & Startup Watch
  • Creator Economy & Culture
What's Hot

Pc Historical past Museum Opens Just about

January 28, 2026

Blizzard declares 2026 Showcase Occasion with main updates throughout World of Warcraft, Overwatch, Diablo and Hearthstone

January 28, 2026

Canadian Recreation Awards to be held in Could alongside massive gaming occasion

January 28, 2026
Facebook X (Twitter) Instagram LinkedIn RSS
NextTech NewsNextTech News
Facebook X (Twitter) Instagram LinkedIn RSS
  • Home
  • Africa
  • Asia
  • Europe
  • Middle East
  • North America
  • Oceania
  • South America
  • Opinion
Trending
  • Pc Historical past Museum Opens Just about
  • Blizzard declares 2026 Showcase Occasion with main updates throughout World of Warcraft, Overwatch, Diablo and Hearthstone
  • Canadian Recreation Awards to be held in Could alongside massive gaming occasion
  • NCC cracks down on telecom operators with ₦12.4 billion in fines
  • OBSBOT Tiny 3 Evaluation – The Webcam that punches manner above its weight class
  • France strikes authorities departments off Zoom, MS Groups onto homegrown Visio
  • M-Pesa has turn into too massive for Kenya to fail
  • New FireDrone Brings a Sensible Edge to One of many Most Harmful Jobs on Earth
Wednesday, January 28
NextTech NewsNextTech News
Home - AI & Machine Learning - Tencent Hunyuan Releases HPC-Ops: A Excessive Efficiency LLM Inference Operator Library
AI & Machine Learning

Tencent Hunyuan Releases HPC-Ops: A Excessive Efficiency LLM Inference Operator Library

NextTechBy NextTechJanuary 28, 2026No Comments5 Mins Read
Share Facebook Twitter Pinterest LinkedIn Tumblr Telegram Email Copy Link
Follow Us
Google News Flipboard
Tencent Hunyuan Releases HPC-Ops: A Excessive Efficiency LLM Inference Operator Library
Share
Facebook Twitter LinkedIn Pinterest Email


Tencent Hunyuan has open sourced HPC-Ops, a manufacturing grade operator library for big language mannequin inference structure units. HPC-Ops focuses on low stage CUDA kernels for core operators corresponding to Consideration, Grouped GEMM, and Fused MoE, and exposes them by means of a compact-C and Python API for integration into current inference stacks.

HPC-Ops runs in giant scale inner companies. In these deployments it delivers about 30 p.c queries per minute enchancment for Tencent-HY fashions and about 17 p.c enchancment for DeepSeek fashions on mainstream inference playing cards. These positive factors are reported on the service stage, in order that they mirror the cumulative impact of quicker kernels inside an actual inference pipeline.

Scope and design of HPC-Ops

HPC-Ops is a manufacturing grade, excessive efficiency, and simple to make use of operator library for LLM inference, developed by the Tencent Hunyuan AI Infra group. The undertaking doesn’t attempt to substitute serving frameworks. As a substitute it offers kernels and clear APIs that may be known as from techniques that already deal with scheduling, KV cache administration, batching, and transport.

The API is designed for seamless use inside in style inference frameworks corresponding to vLLM and SGLang. Which means the framework group can swap in HPC-Ops kernels behind their very own abstractions with out altering the exterior conduct of their servers.

HPC-Ops makes use of C++ and CUDA with CuTe and CUTLASS as constructing blocks. Kernels are written as comparatively small examples that additionally function a contemporary CUDA tutorial.

Kernel efficiency traits

The undertaking publishes most noticed speedup numbers for every operator relative to established baselines. These are microbenchmarks, and the analysis group stress that efficiency varies throughout shapes and workloads, however they present the optimization ceiling.

For Consideration in bf16, in contrast with FlashInfer, FlashAttention two, FlashAttention three, and TensorRT LLM, HPC Ops studies as much as 1.33 instances speedup in prefill and as much as 2.22 instances in decode. For Consideration in fp8, in contrast with FlashInfer, FlashAttention three, and TensorRT LLM, it studies as much as 1.12 instances in prefill and as much as 2.0 instances in decode.

For FusedMoE fp8, in contrast with TensorRT LLM and vLLM, most noticed speedup is as much as 1.49 instances in prefill and 1.14 instances in decode. For GroupGEMM fp8, in contrast with DeepGEMM, the reported positive factors are as much as 1.1 instances in prefill and 1.88 instances in decode.

These numbers matter as a result of decode is often the latency bottleneck in autoregressive technology, the place batch sizes shrink and reminiscence site visitors dominates. The truth that Consideration and GroupGEMM present the most important relative positive factors in decode means that HPC-Ops focuses on the a part of the pipeline that almost all customers discover.

Supported kernels and precision

The present launch teams its performance into three operator households:

  • Consideration kernels cowl each prefill and decode and embrace assist for paged consideration. Paged consideration is the reminiscence structure that frameworks like vLLM use to position key and worth cache blocks in a paged construction, which improves reminiscence reuse for lengthy sequences.
  • Grouped GEMM is carried out as quantized GroupGEMM with fp8 weights. HPC-Ops helps block clever and per tensor scaling, so groups can commerce off quantization granularity in opposition to parameter storage and calibration value.
  • Fused-MoE combines combination of specialists routing and skilled computation in a single quantized operator. It additionally makes use of fp8 skilled weights and helps block clever and per tensor scaling methods.

Throughout these kernels, HPC-Ops offers native assist for bf16 and fp8 information sorts. That matches the present manufacturing pattern to maneuver inference towards decrease precision codecs that protect accuracy whereas decreasing reminiscence bandwidth and bettering tensor core utilization.

Key Takeaways

  • Tencent Hunyuan open-sourced HPC-Ops as a manufacturing grade operator library for LLM inference on NVIDIA SM90 GPUs, together with H20, with C++ and CUDA kernels constructed on CuTe and CUTLASS.
  • In manufacturing deployments HPC-Ops studies about 30 p.c QPM achieve for Tencent-HY fashions and about 17 p.c QPM achieve for DeepSeek fashions on mainstream inference playing cards.
  • Operator microbenchmarks present most speedups as much as 2.22 instances for bf16 Consideration decode, as much as 2.0 instances for fp8 Consideration decode, as much as 1.49 instances for fp8 FusedMoE prefill, and as much as 1.88 instances for fp8 GroupGEMM decode in contrast with robust baselines like FlashInfer, FlashAttention, TensorRT LLM, and DeepGEMM.
  • The library focuses on three operator households, Consideration with paged consideration assist, quantized GroupGEMM with fp8 weights, and quantized Fused MoE with fp8 skilled weights, with each block clever and per tensor scaling, and native bf16 plus fp8 precision assist.
  • HPC-Ops is designed as an operator layer that integrates into current inference frameworks corresponding to vLLM and SGLang, and the roadmap targets sparse consideration for lengthy context LLMs, prolonged quantization together with 4 bit and eight bit methods, and kernels that higher overlap computation with multi GPU communication.

Try the Repo right here. Additionally, be happy to observe us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our E-newsletter. Wait! are you on telegram? now you possibly can be part of us on telegram as effectively.


Michal Sutter is a knowledge science skilled with a Grasp of Science in Information Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.

Elevate your perspective with NextTech Information, the place innovation meets perception.
Uncover the newest breakthroughs, get unique updates, and join with a worldwide community of future-focused thinkers.
Unlock tomorrow’s tendencies as we speak: learn extra, subscribe to our e-newsletter, and grow to be a part of the NextTech neighborhood at NextTech-news.com

Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
NextTech
  • Website

Related Posts

Moonshot AI Releases Kimi K2.5: An Open Supply Visible Agentic Intelligence Mannequin with Native Swarm Execution

January 28, 2026

How Tree-KG Allows Hierarchical Information Graphs for Contextual Navigation and Explainable Multi-Hop Reasoning Past Conventional RAG

January 27, 2026

DSGym Provides a Reusable Container Primarily based Substrate for Constructing and Benchmarking Knowledge Science Brokers

January 27, 2026
Add A Comment
Leave A Reply Cancel Reply

Economy News

Pc Historical past Museum Opens Just about

By NextTechJanuary 28, 2026

In case your travels take you close to Mountain View, California, you’ll be able to…

Blizzard declares 2026 Showcase Occasion with main updates throughout World of Warcraft, Overwatch, Diablo and Hearthstone

January 28, 2026

Canadian Recreation Awards to be held in Could alongside massive gaming occasion

January 28, 2026
Top Trending

Pc Historical past Museum Opens Just about

By NextTechJanuary 28, 2026

In case your travels take you close to Mountain View, California, you’ll…

Blizzard declares 2026 Showcase Occasion with main updates throughout World of Warcraft, Overwatch, Diablo and Hearthstone

By NextTechJanuary 28, 2026

Blizzard Leisure has revealed the schedule for its 2026 showcase, a sequence…

Canadian Recreation Awards to be held in Could alongside massive gaming occasion

By NextTechJanuary 28, 2026

The 2026 Canadian Recreation Awards will formally happen someday in Could. Initially…

Subscribe to News

Get the latest sports news from NewsSite about world, sports and politics.

NEXTTECH-LOGO
Facebook X (Twitter) Instagram YouTube

AI & Machine Learning

Robotics & Automation

Space & Deep Tech

Web3 & Digital Economies

Climate & Sustainability Tech

Biotech & Future Health

Mobility & Smart Cities

Global Tech Pulse

Cybersecurity & Digital Rights

Future of Work & Education

Creator Economy & Culture

Trend Radar & Startup Watch

News By Region

Africa

Asia

Europe

Middle East

North America

Oceania

South America

2025 © NextTech-News. All Rights Reserved
  • About Us
  • Contact Us
  • Privacy Policy
  • Terms Of Service
  • Advertise With Us
  • Write For Us
  • Submit Article & Press Release

Type above and press Enter to search. Press Esc to cancel.

Subscribe For Latest Updates

Sign up to best of Tech news, informed analysis and opinions on what matters to you.

Invalid email address
 We respect your inbox and never send spam. You can unsubscribe from our newsletter at any time.     
Thanks for subscribing!