New Open Source Qwen3&#x2d;Next Models Preview Hybrid MoE Architecture Delivering Improved Accuracy and

Anu Srivastava

As AI models grow larger and process longer sequences of text, efficiency becomes just as important as scale. To showcase what’s next, Alibaba released two new…

NVIDIA

•

Anu Srivastava

•4 min read•intermediate•

--

•View Original

Hugging FaceLessTransformer

Overview

The article discusses the release of two new open-source models, Qwen3-Next 80B-A3B-Thinking and Qwen3-Next 80B-A3B-Instruct, which utilize a hybrid Mixture of Experts (MoE) architecture to enhance efficiency and accuracy in processing long sequences of text. It highlights the models' capabilities, deployment options, and the significance of NVIDIA's technology in optimizing their performance.

What You'll Learn

1

How to deploy Qwen3-Next models using SGLang framework

2

How to run Qwen3-Next models with vLLM serving framework

3

How to utilize NVIDIA NIM for production-ready deployment of AI models

4

Why the hybrid MoE architecture improves model efficiency

Prerequisites & Requirements

Understanding of AI model architectures and deployment frameworks
Familiarity with NVIDIA NIM and SGLang(optional)

Key Questions Answered

What is the significance of the hybrid Mixture of Experts architecture in Qwen3-Next models?

The hybrid Mixture of Experts architecture allows Qwen3-Next models to activate only a subset of their 80 billion parameters, enhancing efficiency while maintaining high accuracy. This architecture enables the models to process long input sequences effectively, with only 3 billion parameters activated per token, optimizing resource usage and performance.

How can developers deploy Qwen3-Next models?

Developers can deploy Qwen3-Next models using frameworks like SGLang and vLLM, or through NVIDIA NIM for production-ready deployments. Specific commands are provided for each framework, allowing easy integration and testing of the models in various environments.

What are the performance benefits of using NVIDIA's Blackwell NVLink with Qwen3-Next models?

NVIDIA's Blackwell NVLink provides 1.8 TB/s of direct GPU-to-GPU bandwidth, crucial for minimizing latency during expert routing in MoE models. This high-speed interconnect enhances inference speed and token throughput, making it essential for efficient AI processing.

What are the architectural features of the Qwen3-Next models?

The Qwen3-Next models consist of 48 layers, with a combination of GQA attention and linear attention mechanisms. This design allows for efficient processing of long context lengths, with 10 experts activated per token, improving reasoning capabilities and overall model performance.

Key Statistics & Figures

Total parameters in Qwen3-Next models

80 billion

Each model activates only 3 billion parameters per token due to its sparse MoE structure.

Bandwidth provided by Blackwell's NVLink

1.8 TB/s

This bandwidth is essential for minimizing latency during expert routing in the MoE architecture.

Number of experts in the MoE module

512 routed experts and 1 shared expert

The model activates 10 experts per token for efficient processing.

Technologies & Tools

Some links below are affiliate links. We may earn a commission if you make a purchase.

Backend

Nvidia Nim

Used for deploying Qwen3-Next models in production environments.

Backend

Sglang

Framework for serving Qwen3-Next models.

Backend

Vllm

Another framework for serving Qwen3-Next models.

Hardware

Blackwell Nvlink

Provides high-speed interconnect for GPU communication.

Key Actionable Insights

1
Leverage the hybrid MoE architecture to optimize AI model performance in your applications.
By using the Qwen3-Next models, developers can achieve significant efficiency gains while maintaining high accuracy, especially for applications requiring long context processing.

2
Utilize NVIDIA NIM for deploying AI models in production environments.
NVIDIA NIM provides a streamlined way to deploy and manage AI models, ensuring that developers can focus on building applications without worrying about underlying infrastructure.

3
Experiment with different deployment frameworks like SGLang and vLLM to find the best fit for your needs.
Each framework offers unique features and optimizations, allowing developers to tailor their deployment strategy based on specific application requirements.

Common Pitfalls

1

Failing to optimize inter-GPU communication can lead to increased latency.

Without utilizing high-speed connections like NVLink, the performance of MoE models can suffer significantly, impacting overall inference speed.

Related Concepts

Mixture Of Experts Architecture

Long Context Processing

AI Model Deployment Frameworks