Data processing method and related device

By introducing multiple parallel strategies and communication operations into the Transformer model, splitting the weight matrix and input data, and optimizing the communication order, the problem of insufficient memory and computing resources in the processing of ultra-long sequences is solved, and the data processing efficiency is improved.

WO2026012454A9 Publication Date: 2026-06-18HUAWEI TECH CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
HUAWEI TECH CO LTD
Filing Date
2025-07-11
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

In the Transformer architecture, processing very long sequences requires a large amount of memory and computing resources, resulting in low data processing efficiency.

Method used

By introducing multiple parallel strategies into the multi-head attention network, including reduction spread, all-to-all and all-aggregate operations, the weight matrix and input data are split, and QKV computation is combined to optimize the communication order to reduce communication overhead.

🎯Benefits of technology

It significantly reduces communication and computing resource requirements, and improves the data processing efficiency of multi-head attention networks, especially in large model and long sequence training and inference scenarios.

✦ Generated by Eureka AI based on patent content.
Patent Text Reader

Abstract

The present application provides a data processing method, which can be applied to a compute cluster deployed with a multi-head attention network. A first compute card performs an allgather operation on first data to obtain second data, the first data being a sub-matrix of a feature matrix of input data of the multi-head attention network; the first compute card performs QKV computation on the second data and a first sub-model, the first sub-model being a sub-matrix obtained by performing row partitioning and column partitioning on a weight matrix of the multi-head attention network; and the first compute card performs a reducescatter operation on the result of the QKV computation to obtain third data, the third data being used for acquiring a processing result of the first data. In the present solution, in terms of model parallelism, two communication domains are set, the weight matrix of the multi-head attention network is partitioned along two dimensions, each training node only stores a sub-matrix, and the sub-matrix occupies less GPU memory, thereby significantly reducing communication volume.
Need to check novelty before this filing date? Find Prior Art