Data processing method and related device
By introducing multiple parallel strategies and communication operations into the Transformer model, splitting the weight matrix and input data, and optimizing the communication order, the problem of insufficient memory and computing resources in the processing of ultra-long sequences is solved, and the data processing efficiency is improved.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- HUAWEI TECH CO LTD
- Filing Date
- 2025-07-11
- Publication Date
- 2026-06-18
AI Technical Summary
In the Transformer architecture, processing very long sequences requires a large amount of memory and computing resources, resulting in low data processing efficiency.
By introducing multiple parallel strategies into the multi-head attention network, including reduction spread, all-to-all and all-aggregate operations, the weight matrix and input data are split, and QKV computation is combined to optimize the communication order to reduce communication overhead.
It significantly reduces communication and computing resource requirements, and improves the data processing efficiency of multi-head attention networks, especially in large model and long sequence training and inference scenarios.