Data processing method, related apparatus and medium

By performing attention computation by allocating text sequences and scattered image sequences in the processing unit, the problem of low efficiency in attention computation when text and images coexist is solved, achieving more efficient multimodal data processing.

CN122197967APending Publication Date: 2026-06-12TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2024-12-11
Publication Date
2026-06-12

Smart Images

  • Figure CN122197967A_ABST
    Figure CN122197967A_ABST
Patent Text Reader

Abstract

The method comprises: based on each processing unit, generating a text sub-sequence based on a text sequence, generating an image sub-sequence based on an image sequence, and generating a combined sub-sequence; based on the target query corresponding to the combined sub-sequence, the target key-value pair corresponding to the text sub-sequence, and the target key-value pair corresponding to the image sub-sequence, obtaining a first attention sub-result and a second attention sub-result through the processing unit; obtaining the image sub-sequence of other processing units, and based on the target query corresponding to the combined sub-sequence and the target key-value pair corresponding to the obtained image sub-sequence, obtaining a third attention sub-result; and based on the first attention sub-result, the second attention sub-result, and the third attention sub-result, generating an attention result. The embodiments of the present disclosure can improve the overall efficiency of model operation when the application data contains both text and images. The embodiments of the present disclosure can be applied to the fields of text-to-image and text-to-video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of chips, and in particular to a data processing method, related apparatus and medium. Background Technology

[0002] With the booming development of AI-generated content, various text-to-image and text-to-video models have emerged. These models are all based on the same network backbone, namely a novel generative model that combines diffusion and transformation models. This model uses a large amount of attention computation. Improving the efficiency of attention computation has become a major issue related to this model. Currently, there are techniques that distribute attention computation across multiple processing units (such as GPUs), but these techniques are only suitable for single-modal inputs. When the target application data contains both text and images, it can only mechanically integrate the text and image inputs into a single input for attention computation. However, in addition to attention computation, the two inputs also participate in other computations within the model. The other computations of the two inputs are inconsistent, causing conflicts between these other computations and the single integration of text and images, thus reducing the overall efficiency of the model. Summary of the Invention

[0003] This disclosure provides a data processing method, related apparatus, and medium that can improve the overall efficiency of model computation when the application data simultaneously contains text and images.

[0004] According to one aspect of this disclosure, a data processing method is provided, applied to a processing device comprising a group of processing units, the group of processing units comprising a first number of processing units, the data processing method comprising:

[0005] Acquire target application data, wherein the target application data comprises a text sequence and an image sequence;

[0006] For each processing unit, a text subsequence is generated based on the text sequence for processing by the processing unit, an image subsequence is generated based on the image sequence for processing by the processing unit, and a combined subsequence is generated based on the text subsequence and the image subsequence;

[0007] Through the processing unit, attention calculation is performed on the target query corresponding to the combined subsequence of the processing unit, the target key-value pairs corresponding to the text subsequence and the image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result;

[0008] The processing unit obtains image sub-sequences from other processing units among the first number of processing units, and performs attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the obtained image sub-sequence to obtain a third attention sub-result.

[0009] An attention result is generated based on the first attention result, the second attention result, and the third attention result.

[0010] According to one aspect of this disclosure, a data processing apparatus is provided, applied to a processing device comprising a group of processing units, the group of processing units comprising a first number of processing units, the data processing apparatus comprising:

[0011] A first acquisition unit is used to acquire target application data, wherein the target application data has a text sequence and an image sequence.

[0012] The first generation unit is configured to, based on each processing unit, generate a text subsequence for processing by the processing unit based on the text sequence, generate an image subsequence for processing by the processing unit based on the image sequence, and generate a combined subsequence based on the text subsequence and the image subsequence;

[0013] An attention calculation unit is used to perform attention calculation based on the target query corresponding to the combined subsequence of the processing unit, the target key-value pairs corresponding to the text subsequence and the image subsequence of the processing unit, respectively, to obtain a first attention sub-result and a second attention sub-result.

[0014] The second acquisition unit is used to acquire image sub-sequences of other processing units among the first number of processing units through the processing unit, and to perform attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the acquired image sub-sequence to obtain a third attention sub-result;

[0015] The second generation unit is used to generate an attention result based on the first attention sub-result, the second attention sub-result, and the third attention sub-result.

[0016] Optionally, the processing device includes a second number of processing unit groups, wherein the product of the first number and the second number is a third number;

[0017] The first generation unit is specifically used for:

[0018] The image sequence is divided into a third number of image sub-sequences;

[0019] The text sequence is assigned to each processing unit in the processing device, and the third number of image sub-sequences are respectively assigned to each processing unit in the processing device;

[0020] Through mutual communication between processing units with the same sequence number in different processing unit groups, the text sequence and the image subsequence assigned to the processing units with the same sequence number in different processing unit groups are converted into the text subsequence and the image subsequence.

[0021] Optionally, the text sequence is an M1×N matrix, the image subsequence is an M2×N matrix, and the second number is L;

[0022] The first generation unit is further configured to:

[0023] The text sequence and the image sub-sequence assigned to the processing unit are combined into a (M1+M2)×N combination matrix;

[0024] The columns of the combined matrix are divided according to a second number to obtain a second number of (M1+M2)×(N / L) submatrices;

[0025] The first submatrix corresponding to the processing unit group in the second number of submatrixes is retained, and the second submatrix corresponding to the other processing unit groups in the second number of submatrixes is sent to the processing unit with the same sequence number in the other processing unit groups;

[0026] Receive the third sub-matrix sent by the processing unit with the same sequence number in the other processing unit group, and integrate the retained first sub-matrix and the received third sub-matrix into an integrated matrix of (M1+M2)L×(N / L).

[0027] The portion corresponding to the text sequence in the integrated matrix is ​​determined as the text subsequence, and the portion corresponding to the image subsequence in the integrated matrix is ​​determined as the image subsequence.

[0028] Optionally, each processing unit contains hc attention heads;

[0029] The first generation unit is further configured to: divide the columns of the combined matrix according to a second number, such that the number of columns of each (M1+M2)×(N / L) submatrix is ​​an integer multiple of hc, thereby the (M1+M2)×(N / L) submatrix is ​​evenly distributed to hc attention heads for execution.

[0030] Optionally, the target key-value pair includes a target key and a target value;

[0031] The attention calculation unit is specifically used for:

[0032] The processing unit performs attention calculations based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, and obtains the first attention sub-result.

[0033] The processing unit performs attention calculations based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, and obtains the second attention sub-result.

[0034] Optionally, the combined subsequence is an M×N original combined matrix, and the text subsequence is an M1×N original text matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix;

[0035] The data processing device further includes:

[0036] The third generation unit is used to generate the M×N target query matrix based on the original combination matrix of M×N and the query transformation matrix of N×N.

[0037] The fourth generation unit is used to generate the first target key matrix of M1×N based on the original text matrix of M1×N and the key transformation matrix of N×N.

[0038] The fifth generation unit is used to generate the first target value matrix of M1×N based on the original text matrix of M1×N and the value transformation matrix of N×N.

[0039] Optionally, the combined subsequence is an M×N original combined matrix, and the image subsequence is an M2×N original image matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix;

[0040] The data processing device further includes:

[0041] The sixth generation unit is used to generate the M×N target query matrix based on the original M×N combination matrix and the N×N query transformation matrix;

[0042] The seventh generation unit is used to generate the second target key matrix of M2×N based on the original image matrix of M2×N and the key transformation matrix of N×N;

[0043] The eighth generation unit is used to generate the second target value matrix of M2×N based on the original image matrix of M2×N and the value transformation matrix of N×N.

[0044] Optionally, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix;

[0045] The attention calculation unit is also specifically used for:

[0046] Transpose the first target key matrix to obtain the first transpose matrix of N×M1;

[0047] Multiply the target query matrix and the first transpose matrix to obtain the first product matrix of M×M1;

[0048] Based on the first product matrix and the first target value matrix, an M×N first attention matrix is ​​generated as the first attention sub-result.

[0049] Optionally, each processing unit contains hc attention heads, each attention head having a size of hs, and N being the product of hc and hs;

[0050] The attention calculation unit is also specifically used for:

[0051] Based on the size hs of the attention head, the first product matrix is ​​scaled to obtain the first scaling matrix;

[0052] The first scaling matrix is ​​exponentially normalized to obtain the first exponentially normalized matrix;

[0053] Based on the first exponential normalization matrix and the first target value matrix, the first attention matrix is ​​generated as the first attention sub-result.

[0054] Optionally, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and the target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix;

[0055] The attention calculation unit is also specifically used for:

[0056] Transpose the second target key matrix to obtain an N×M2 second transpose matrix;

[0057] Multiply the target query matrix and the second transpose matrix to obtain the second product matrix of M×M2;

[0058] Based on the second product matrix and the second target value matrix, an M×N second attention matrix is ​​generated as the second attention sub-result.

[0059] Optionally, each processing unit contains hc attention heads, each attention head having a size of hs, and N being the product of hc and hs;

[0060] The attention calculation unit is also specifically used for:

[0061] Based on the size hs of the attention head, the second product matrix is ​​scaled to obtain the second scaling matrix;

[0062] The second scaling matrix is ​​exponentially normalized to obtain the second exponentially normalized matrix;

[0063] Based on the second exponential normalization matrix and the second target value matrix, the second attention matrix is ​​generated as the second attention sub-result.

[0064] Optionally, the data processing apparatus further includes:

[0065] The first sending unit is used to send the image subsequence corresponding to the j-th processing unit to the (j+1)-th processing unit in the processing unit group through the j-th processing unit in the first cycle.

[0066] The second sending unit is used to send the image subsequence received by the j-th processing unit in the (i-1)-th period to the (j+1)-th processing unit in the processing unit group through the j-th processing unit in the i-th period, where i is an integer greater than 1 and less than the first number.

[0067] Optionally, the second acquisition unit is further configured to:

[0068] In the first to the i-th cycles, the j-th processing unit receives the image subsequence of the (j-1)-th processing unit in the group of processing units.

[0069] Optionally, the processing device includes a second number of processing unit groups, wherein the first attention sub-result, the second attention sub-result, and the third attention sub-result are respectively an M×N first attention matrix, a second attention matrix, and a third attention matrix;

[0070] The second generation unit is specifically used for:

[0071] The first attention matrix, the second attention matrix, and the third attention matrix are superimposed to obtain the M×N superimposed attention matrix corresponding to the processing unit;

[0072] The attention result is generated based on the superimposed attention matrix corresponding to the processing units with the same index in the second number of processing unit groups.

[0073] Optionally, the second number is L;

[0074] The second generation unit is further used for:

[0075] The rows of the superimposed attention matrix are divided according to a second number to obtain a second number of (M / L)×N attention submatrices;

[0076] The first attention submatrix corresponding to the processing unit group in the second number of attention submatrixes is retained, and the second attention submatrix corresponding to the other processing unit groups in the second number of attention submatrixes is sent to the processing unit with the same index in the other processing unit groups;

[0077] Receive the third attention sub-matrix sent by the processing unit with the same sequence number in the other processing unit group, and integrate the retained first attention sub-matrix and the received third attention sub-matrix into a fourth attention sub-matrix of (M / L)×NL.

[0078] The fourth attention submatrix is ​​determined as the attention result.

[0079] According to one aspect of this disclosure, an electronic device is provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the data processing method described above.

[0080] According to one aspect of this disclosure, a computer-readable storage medium is provided, the storage medium storing a computer program that, when executed by a processor, implements the data processing method described above.

[0081] According to one aspect of this disclosure, a computer program product is provided, comprising a computer program that is read and executed by a processor of a computer device, causing the computer device to perform the data processing method as described above.

[0082] This embodiment of the disclosure processes text sequences and image sequences in the target application data separately. For text subsequences generated from text sequences, since their size is generally smaller than image subsequences, they are typically stored within individual processing units and not migrated to other processing units. Since image sequences are generally larger, multiple image subsequences are generated and distributed across various processing units to improve the processing efficiency of attention operations. During attention operations, attention calculations are performed based on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text and image subsequences stored within the processing unit itself, respectively, yielding a first attention sub-result and a second attention sub-result. For image subsequences stored in other processing units, they are obtained from those units, and attention calculations are performed based on the target query corresponding to the combined subsequence of the processing unit and the target key-value pairs corresponding to the obtained image subsequences, yielding a third attention sub-result. Then, based on the first, second, and third attention sub-results, an attention result is generated. Using multiple processing units simultaneously improves the processing efficiency of attention operations. Meanwhile, since each processing unit stores the entire text subsequence, it does not need to obtain text subsequences from other processing units during the attention operation. It only needs to obtain image subsequences stored by other processing units, which reduces the time of multimodal operation and improves the efficiency of attention operation.

[0083] Other features and advantages of this disclosure will be set forth in the following description and will be apparent in part from the description or may be learned by practicing the disclosure. The objectives and other advantages of this disclosure may be realized and obtained by means of the structures particularly pointed out in the description, claims and drawings. Attached Figure Description

[0084] The accompanying drawings are provided to further understand the technical solutions of this disclosure and constitute a part of the specification. They are used together with the embodiments of this disclosure to explain the technical solutions of this disclosure and do not constitute a limitation on the technical solutions of this disclosure.

[0085] Figure 1 This is a system architecture diagram of the data processing method applied according to the embodiments of this disclosure;

[0086] Figure 2A and Figure 2B This is a schematic diagram illustrating the application of the data processing method of this disclosure in a scenario of generating video based on text and images;

[0087] Figure 3 This is a general flowchart of the data processing method provided in the embodiments of this disclosure;

[0088] Figure 4 This is a schematic diagram of a processing device provided in an embodiment of this disclosure;

[0089] Figure 5A and Figure 5B This is a schematic diagram of a data processing method provided in an embodiment of this disclosure;

[0090] Figure 6 yes Figure 3 A flowchart of step 320 for generating text subsequences and image subsequences;

[0091] Figure 7 yes Figure 6 A schematic diagram illustrating the generation of text subsequences and image subsequences;

[0092] Figure 8 yes Figure 6 Step 630 is a flowchart of the process of converting text subsequences and image subsequences through mutual communication between processing units with the same sequence number in different processing unit groups.

[0093] Figure 9 yes Figure 8 A schematic diagram illustrating the conversion of text subsequences and image subsequences through mutual communication between processing units with the same sequence number in different processing unit groups;

[0094] Figure 10 yes Figure 8 A flowchart for obtaining the second number of submatrices in step 820;

[0095] Figure 11 yes Figure 3 A flowchart for obtaining the first attention sub-result and the second attention sub-result in step 330;

[0096] Figure 12 yes Figure 11 A flowchart of step 1110 for obtaining the first attention result;

[0097] Figure 13 yes Figure 12 A schematic diagram of the first attention quantum result obtained in the process;

[0098] Figure 14 yes Figure 12 A flowchart of step 1230 generating the first attention matrix;

[0099] Figure 15 yes Figure 11 A flowchart of step 1120 to obtain the second attention result;

[0100] Figure 16 yes Figure 15 A schematic diagram of the second attention quantum result obtained in the process;

[0101] Figure 17 yes Figure 15 A flowchart of step 1530 generating the second attention matrix;

[0102] Figure 18 yes Figure 3 A flowchart for generating attention results in step 350;

[0103] Figure 19 yes Figure 18 A schematic diagram of generating the superimposed attention matrix;

[0104] Figure 20 yes Figure 18 A flowchart of step 1820 generating the attention result;

[0105] Figure 21 yes Figure 20 A schematic diagram illustrating the generation of attention results;

[0106] Figure 22 This is a flowchart illustrating communication between processing units in a processing unit group provided in an embodiment of this disclosure;

[0107] Figure 23 yes Figure 22 A schematic diagram illustrating the communication between various processing units within a processing unit group;

[0108] Figure 24 yes Figure 3 A flowchart of step 340 obtaining the image subsequence of other processing units in the first number of processing units;

[0109] Figure 25 This is a flowchart of an embodiment of the present disclosure for calculating target key-value pairs corresponding to a text subsequence;

[0110] Figure 26 yes Figure 25 A schematic diagram for calculating the target key-value pairs corresponding to a subsequence of text;

[0111] Figure 27 This is a flowchart of a method for calculating target key-value pairs corresponding to image subsequences provided in this disclosure;

[0112] Figure 28 yes Figure 27 A schematic diagram for calculating target key-value pairs corresponding to image subsequences;

[0113] Figure 29 This is another schematic diagram of the data processing method provided in the embodiments of this disclosure;

[0114] Figure 30A and Figure 30BThis is a schematic diagram of different text-based video models provided in the embodiments of this disclosure in the same processing unit;

[0115] Figure 31A and Figure 31B This is another schematic diagram of different textual video models provided in the embodiments of this disclosure in the same processing unit;

[0116] Figure 32 This is a schematic diagram of the structure of a data processing apparatus according to an embodiment of the present disclosure;

[0117] Figure 33 This is a terminal structure diagram of performing a data processing method according to an embodiment of the present disclosure;

[0118] Figure 34 This is a server structure diagram of a data processing method performed according to an embodiment of the present disclosure. Detailed Implementation

[0119] To make the objectives, technical solutions, and advantages of this disclosure clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and are not intended to limit the scope of this disclosure.

[0120] Before providing a further detailed description of the embodiments of this disclosure, the terms and concepts used in these embodiments are explained, and they are subject to the following interpretations:

[0121] Artificial Intelligence Generated Content (AIGC) is a technology that uses artificial intelligence to automatically generate content such as text, images, audio, and video. Based on generative adversarial networks (GANs), large-scale pre-trained models, and other AI technologies, it learns patterns and rules from vast amounts of data to generate relevant content with appropriate generalization capabilities. The development of AIGC technology has not only improved the efficiency and quality of content generation but also lowered the barriers and costs associated with content creation and interaction.

[0122] Diffusion Transformer (Di-T) is an innovative diffusion model architecture that replaces the traditional U-Net convolutional neural network with a transform model (Transformer) architecture. This replacement brings significant advantages, especially in image generation tasks. By introducing the self-attention mechanism of the transform model, Di-T can more effectively process and fuse feature information, thereby improving the quality of the generated data.

[0123] Attention mechanisms are widely used techniques in deep learning. Their core idea is to simulate the selective attention capabilities of the human brain when processing information. By assigning different processing weights to different parts of the model, attention mechanisms enable the model to prioritize and strengthen the features or regions most critical to the current task while suppressing or ignoring less important information.

[0124] Attention heads are an important concept in transform model architectures, primarily used to enhance the model's understanding and processing of input data. In transform models, attention mechanisms allow the model to focus on different parts of the input sequence, while multi-head attention mechanisms achieve this by introducing multiple parallel attention heads.

[0125] Multimodal data refers to data containing multiple different types, such as text, images, audio, and video. This data typically originates from different sensory channels or information sources, and each data type has unique characteristics and information expression methods. The processing and analysis of multimodal data requires considering the correlations and complementarities between different modalities, and achieving a more comprehensive understanding and decision-making process through the fusion of these data.

[0126] Graphics Processing Unit (GPU): A GPU is a microprocessor specifically designed for processing image and graphics-related computations. It possesses powerful parallel computing capabilities, enabling it to efficiently handle large-scale data streams, making it outstanding in fields such as graphics rendering, deep learning, and machine learning. AI-generated content tasks, such as text-to-image and text-to-video creation, are typically assigned to GPUs for execution.

[0127] With the booming development of AI-generated content, various text-to-image and text-to-video models have emerged. These models are all based on the same network backbone, namely a novel generative model that combines diffusion and transformation models. This model uses a large amount of attention computation. Improving the efficiency of attention computation has become a major issue related to this model. Currently, there are techniques that distribute attention computation across multiple processing units (such as GPUs), but these techniques are only suitable for single-modal inputs. When the target application data contains both text and images, it can only mechanically integrate the text and image inputs into a single input for attention computation. However, in addition to attention computation, the two inputs also participate in other computations within the model. The other computations of the two inputs are inconsistent, causing conflicts between these other computations and the single integration of text and images, thus reducing the overall efficiency of the model.

[0128] Based on this, embodiments of the present disclosure provide a data processing method, related apparatus, and medium that can improve the overall efficiency of model computation when application data simultaneously contains text and images.

[0129] System architecture and scenario description of the embodiments disclosed herein

[0130] Figure 1 This is a system architecture diagram of the data processing method applied according to embodiments of this disclosure. Figure 1 As shown, the system architecture used in the data processing method includes: terminal 110, Internet 120, gateway 130, and server 140.

[0131] Terminal 110 is a device that generates data processing tasks. Objects can upload application data, such as text sequences, image sequences, and video sequences, to Terminal 110. Terminal 110 can take various forms, including desktop computers, laptops, tablets, PDAs (personal digital assistants), mobile phones, in-vehicle terminals, home theater terminals, and dedicated terminals. Furthermore, it can be a single device or a collection of multiple devices. For example, multiple devices can be connected via a local area network, sharing a single display device to work collaboratively, forming a single terminal. Terminal 110 can also communicate with the Internet 120 via wired or wireless means to exchange data.

[0132] Gateway 130, also known as an internetwork connector or protocol converter, is a computer system or device that enables network interconnection at the transport layer and acts as a translator. It bridges the gap between two systems using different communication protocols, data formats, languages, or even completely different architectures. Gateway 130 also provides filtering and security functions. Messages sent from target terminal 110 to server 140 are forwarded to the corresponding server 140 via gateway 130. Messages sent from server 140 to target terminal 110 are also forwarded to the corresponding target terminal 110 via gateway 130.

[0133] Server 140 refers to a computer system that provides data processing services. Server 140 is a processing device containing a group of processing units, which in turn contains multiple processing units. For data processing tasks such as text-to-image and text-to-video processing, the processing unit is typically a graphics processing unit (GPU). Compared to terminal 110, server 140 has higher requirements in terms of stability, security, and performance. Server 140 can be a single high-performance computer in a network platform, a cluster of multiple high-performance computers, a portion of a single high-performance computer (e.g., a virtual machine), or a combination of portions of multiple high-performance computers (e.g., virtual machines). Server 140 can also communicate with the Internet 120 via wired or wireless means to exchange data.

[0134] The embodiments disclosed herein can be applied in various scenarios, such as Figure 2A and Figure 2B The examples shown include scenarios where videos are generated based on text and images.

[0135] The model for generating videos based on text and images is based on a novel generative model that combines a diffusion model and a transformation model architecture, and this model uses a large number of attention operations. Figure 2A The illustrated technique for generating video based on text and images distributes attention operations across multiple processing units. However, for text A and image A, it mechanically integrates them into a single input and assigns them to processing units 1, 2, and 3 for attention operations and other calculations. But apart from attention operations, the other operations for text A and image A are inconsistent. These other operations conflict with the single integration of text and image, requiring extensive communication and computation between processing units 1, 2, and 3 to perform attention operations and other calculations to generate video A. In this process, the communication time between multiple processing units is 600ms, and the computation time of each processing unit is 800ms. Therefore, the time required to generate video A based on text A and image A is 3000ms.

[0136] Figure 2B The scenario shown illustrates the application of the data processing method provided in this disclosure embodiment to generate video A based on text A and image A. Figure 2B In this process, the complete data corresponding to text A is stored in processing units 1, 2, and 3, and the data corresponding to image A is also distributed among these three units. During processing, text data does not need to be migrated to other processing units; instead, image data stored in other processing units is retrieved, reducing the time spent on multimodal computation. Furthermore, this setup meets the single integration requirement for text and image data in all computations except attention operations. Therefore, in the process of generating video A, the communication time between processing units is reduced to 200ms, and the computation time for each processing unit is 300ms. Thus, the time spent generating video A from text A and image A is 1100ms.

[0137] As can be seen from the above, the data processing method provided in this embodiment can reduce the multimodal computation time and improve the attention computation efficiency for multimodal data containing text and images.

[0138] It should be understood that the above description only illustrates some application scenarios of this disclosure. The business scenarios to which this disclosure can be applied may include, but are not limited to, the specific embodiments described above.

[0139] General Description of Embodiments in this Disclosure

[0140] It should be noted that the embodiments of this disclosure are applicable to various application scenarios, such as generating images from text and images, generating videos from text and images, and generating videos from text and videos. Related technologies that distribute attention operations across multiple processing units are only suitable for single-modal inputs. When the target application data contains both text and images, it can only mechanically integrate the text and image inputs into a single input for attention operations. However, besides attention operations, the two inputs also participate in other operations within the model. These other operations for the two inputs are inconsistent, causing conflicts between these other operations and the single integration of text and images, thus reducing the overall efficiency of the model. Some embodiments of this disclosure provide a data processing method, related apparatus, and medium that can improve the overall efficiency of model operations when application data simultaneously contains text and images.

[0141] The data processing method of this disclosure is a method for allocating acquired target application data containing text sequences and image sequences to a processing unit group containing a first number of processing units to generate attention results. This method can improve the overall efficiency of model computation when the application data contains both text and images.

[0142] The data processing method of this disclosure can be run on server 140, on terminal 110, or partially on server 140 and partially on terminal 110.

[0143] like Figure 3 As shown, a data processing method according to an embodiment of the present disclosure is applied to a processing device including a group of processing units, the group of processing units including a first number of processing units, and the data processing method includes:

[0144] Step 310: Obtain target application data, which includes text sequences and image sequences;

[0145] Step 320: Based on each processing unit, generate a text subsequence for processing by the processing unit based on the text sequence, generate an image subsequence for processing by the processing unit based on the image sequence, and generate a combined subsequence based on the text subsequence and the image subsequence;

[0146] Step 330: Through the processing unit, attention is calculated based on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text subsequence and image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result.

[0147] Step 340: Through the processing unit, obtain the image subsequences of other processing units in the first number of processing units, and perform attention calculation based on the target query corresponding to the combined subsequence of the processing unit and the target key-value pair corresponding to the obtained image subsequence to obtain the third attention sub-result;

[0148] Step 350: Generate attention results based on the first attention sub-result, the second attention sub-result, and the third attention sub-result.

[0149] Steps 310 to 350 are described in detail below.

[0150] A processing device is a device that performs data processing methods. A processing device can be a terminal, a server, a combination of a server and a terminal, or multiple servers, as long as the processing device contains multiple processing units.

[0151] A processing unit is a unit within a processing device that performs specific processing on application data, such as text and images. For novel generative models based on diffusion and transformation models, such as text-to-image and text-to-video generation, the processing unit is a graphics processing unit (GPU).

[0152] A processing unit group refers to a group consisting of a first number of processing units. A processing device includes at least one processing unit group. A processing unit group contains a first number of processing units, where the first number is an integer greater than 1. Figure 4 A schematic diagram of a processing device provided in an embodiment of this disclosure. Figure 4 The processing device shown includes processing unit group A, processing unit group B and processing unit group C. If application data is distributed to multiple processing units in processing unit group A for execution, and processing unit group A includes processing unit A1, processing unit A2, processing unit A3 and processing unit A4, then the first number is 6.

[0153] In step 310, target application data is obtained, which includes text sequences and image sequences.

[0154] The target application data is data that requires data processing, and it is multimodal data. Assuming the data processing method provided in this embodiment is applied to a text-to-image model, the target application data includes text and images. Assuming the data processing method provided in this embodiment is applied to a text-to-video model, the target application data includes text and images, or text and video.

[0155] A text sequence refers to the text portion of the target application data, while an image sequence refers to the image portion of the target application data.

[0156] Assuming the target application data consists of the text "Generate an image of a person picking flowers in image A" and an image A showing a person, then the text sequence corresponds to the text "Generate an image of a person picking flowers in image A", and the text sequence can be [person in image A, picking flowers]. The image sequence corresponds to image A.

[0157] Suppose the target application data is the text "Generate a video of a person picking flowers in video A, and generate a video with the same style as video A" and a video A with an animated style and displaying a task. Then the text sequence corresponds to the text "Generate a video of a person picking flowers in video A, and generate a video with the same style as video A", while the image sequence is the sequence of each image frame corresponding to video A.

[0158] In target application data, images contain image style, pixel values, and other information; the amount of information contained in images is usually much greater than that contained in text. Therefore, the size of an image sequence is usually much larger than the size of a text sequence.

[0159] In step 320, for each processing unit, a text subsequence for processing by the processing unit is generated based on the text sequence, an image subsequence for processing by the processing unit is generated based on the image sequence, and a combined subsequence is generated based on the text subsequence and the image subsequence.

[0160] A text subsequence refers to a text-related sequence that needs to be processed by a processing unit within a processing unit group. Text subsequences are generated based on text sequences. Since the size of a text sequence is typically much smaller than the size of an image sequence, each text subsequence is placed within a processing unit group; that is, each processing unit stores all the text subsequences.

[0161] Reference Figure 5A The processing unit group includes processing unit 1 and processing unit 2. It generates a text subsequence 1 based on a text sequence in the target application data, and sets the text subsequence 1 in both processing unit 1 and processing unit 2. The text subsequence 1 can be considered a text sequence. Alternatively, if text subsequence 1 and text subsequence 2 are generated based on the text sequence, then text subsequence 1 and text subsequence 2 are set in both processing unit 1 and processing unit 2.

[0162] An image subsequence refers to an image-related sequence that needs to be processed by one of the processing units in a processing unit group. Image subsequences are generated based on image sequences, which are generally large. Therefore, to reduce the computational demands on each processing unit, multiple image subsequences are generated based on the image sequence and distributed across various processing units for processing. Each processing unit in the processing unit group only stores a portion of the image sequence's corresponding image subsequence, and the image subsequences corresponding to each processing unit are all unique.

[0163] Reference Figure 5A The processing unit group includes processing unit 1 and processing unit 2. The image sequence in the target application data is processed to generate image subsequence 1 and image subsequence 2. Then, image subsequence 1 is assigned to processing unit 1 and image subsequence 2 is assigned to processing unit 2.

[0164] It should be noted that the number of image subsequences is a multiple of the first number, which facilitates the equal distribution of each image subsequence to the first number of processing units for processing.

[0165] A combined subsequence refers to a text- or image-related sequence that a processing unit in a processing unit group needs to process. A combined subsequence is obtained by combining a text subsequence and an image subsequence. Therefore, the combined subsequences of each processing unit are different. For example, if the text subsequence in processing unit A is [object A, picking flowers] and the image subsequence is [DhPPY&_%*Fk19jkf], then the combined subsequence is [object A, picking flowers, DhPPY&_%*Fk19jkf].

[0166] Reference Figure 5A The combined subsequence 1 in processing unit 1 is obtained by combining text subsequence 1 and image subsequence 1, and the combined subsequence 2 in processing unit 2 is obtained by combining text subsequence 1 and image subsequence 2.

[0167] In step 330, the processing unit performs attention calculations on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text subsequence and image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result.

[0168] The query (Q) is an important component of the attention mechanism in the transformer model, and the target query corresponding to the combined subsequence refers to the Q obtained by calculating the combined subsequence.

[0169] Key-value pairs refer to the combination of key (K) and value (V) in the attention mechanism of a transformer model. Keys and values ​​are also important components of the attention mechanism. Target key-value pairs for text subsequences refer to the K and V combinations corresponding to the text subsequence. Target key-value pairs for image subsequences refer to the K and V combinations corresponding to the image subsequence.

[0170] The first attention sub-result corresponds to the text sub-sequence. In any processing unit of the processing unit group, attention operations are performed based on the target query corresponding to the combined sub-sequence and the target key-value pair corresponding to the text sub-sequence to obtain the first attention sub-result.

[0171] Reference Figure 5A Q1 represents the target query corresponding to the combined subsequence of processing unit 1, and K1-T and V1-T represent the target key-value pairs corresponding to the text subsequence 1 of processing unit 1. Attention operations are performed based on Q1, K1-T, and V1-T to obtain the first attention sub-result O1-1 for processing unit 1. Similarly, Q2 represents the target query corresponding to the combined subsequence of processing unit 2, and K2-T and V2-T represent the target key-value pairs corresponding to the text subsequence 1 of processing unit 2. Attention operations are performed based on Q2, K2-T, and V2-T to obtain the first attention sub-result O2-1 for processing unit 2.

[0172] The second attention result corresponds to the image subsequence. In any processing unit of the processing unit group, attention operations are performed based on the target query corresponding to the combined subsequence and the target key-value pair corresponding to the image subsequence to obtain the second attention result.

[0173] Reference Figure 5A Q1 represents the target query corresponding to the combined subsequence of processing unit 1, and K1-I and V1-I are the target key-value pairs corresponding to image subsequence 1 of processing unit 1. Attention operations are performed based on Q1, K1-I, and V1-I to obtain the second attention sub-result O1-2 of processing unit 1. Similarly, Q2 represents the target query corresponding to the combined subsequence of processing unit 2, and K2-I and V2-I are the target key-value pairs corresponding to image subsequence 2 of processing unit 2. Attention operations are performed based on Q2, K2-I, and V2-I to obtain the second attention sub-result O2-2 of processing unit 2.

[0174] It should be noted that if the first number of processing units in the processing unit group all possess the complete text subsequences, then the target key-value pairs corresponding to the text subsequences of the first number of processing units in the processing unit group are all equal. For example, Figure 5A The target key-value pair corresponding to processing unit 1 is equal to the target key-value pair corresponding to processing unit 2, that is, K1-T is equal to K2-T, and V1-T is equal to V2-T.

[0175] In step 340, the processing unit obtains the image subsequences of other processing units in the first number of processing units, and performs attention calculation based on the target query corresponding to the combined subsequence of the processing unit and the target key-value pair corresponding to the obtained image subsequence to obtain the third attention sub-result.

[0176] The third attention sub-result corresponds to the other processing units in the processing unit group besides the current processing unit. The third attention sub-result is obtained by performing attention operations on the target query corresponding to the combined sub-sequence of the current processing unit and the target key-value pairs corresponding to the image sub-sequences from other processing units.

[0177] Reference Figure 5A The processing unit group includes processing unit 1 and processing unit 2. For processing unit 1, attention operations are performed on K2-I and V2-I based on the target query Q1 corresponding to the combined subsequence and the target key value corresponding to the image subsequence 2 from processing unit 2, to obtain the third attention sub-result 01-3. For processing unit 2, attention operations are performed on K1-I and V1-I based on the target query Q2 corresponding to the combined subsequence and the target key value corresponding to the image subsequence 1 from processing unit 1, to obtain the third attention sub-result 02-3.

[0178] It should be noted that, for any processing unit in the processing unit group, the number of third attention sub-results is related to the number of processing units in the processing unit group, i.e., the first number. Specifically, the number of third attention sub-results is the first number minus 1. Assuming the processing unit group contains 7 processing units, then for any processing unit, there are 6 third attention sub-results.

[0179] It should be noted that the target key-value pairs corresponding to the image subsequences of other processing units can be processed in other processing units. Therefore, the current processing unit only needs to obtain the target key-value pairs of the image subsequences of other processing units in the first number of processing units.

[0180] In step 350, an attention result is generated based on the first attention result, the second attention result, and the third attention result.

[0181] The attention result refers to the attention result of one of the processing units in the processing unit group. The attention result of a processing unit is determined based on the first attention sub-result, the second attention sub-result, and the third attention sub-result. (See reference...) Figure 5AThe attention result O1 of processing unit 1 is determined based on the first attention sub-result O1-1, the second attention sub-result O1-2, and the third attention sub-result O1-33. The attention result O2 of processing unit 2 is determined based on the first attention sub-result O2-1, the second attention sub-result O2-2, and the third attention sub-result O2-33.

[0182] It should be noted that the formula for calculating the attention of the target application data is assumed to be:

[0183]

[0184] Where O represents the attention result corresponding to the target application data, Q represents the query corresponding to the target application data, K and V represent the key-value pairs corresponding to the target application data, and hs and softmax represent the relevant parameters for attention computation. From the above formula, if Q is split row-wise, the consensus of attention computation can be expressed as:

[0185]

[0186] Here, Q1 and Q2 are the results obtained by splitting Q. According to this formula, Q1 and Q2 need to undergo attention calculation with the complete K and T, and for Q1 or Q2, K and T only participate in the attention operation once. Therefore, in this embodiment, the target query corresponds to the combined subsequence. Since each processing unit has a complete text subsequence, to improve the accuracy of the attention results, only the image subsequences of other processing units in the processing unit group are obtained, ensuring that the target key-value pair corresponding to the text subsequence participates in the operation only once in the current processing unit.

[0187] Furthermore, assuming the text sequence has a length of L1 and the image sequence has a length of L2, if attention operations are performed on the target data using a single processing unit, then the latency of the attention operation can be expressed as:

[0188] T = O((L1 + L2) 2 ),

[0189] Where T represents the latency of the attention operation on the target application data, and O() represents the latency function corresponding to the processing unit. In this embodiment, the sub-sequences corresponding to the text sequence are set in each processing unit, while the sub-sequences corresponding to the image sequence are set in different processing units. The latency of the attention operation can be expressed as:

[0190]

[0191] Where N is the number of processing units corresponding to the attention operation, and T comp T represents the computational latency of the processing unit. commTo account for the communication delay between the processing unit and other processing units, the size L1 of the text sequence is much smaller than the size L2 of the image sequence. Therefore, a near-linear acceleration effect can be obtained in the computation section, and the communication delay is usually less than the computation delay, thereby improving the computational efficiency of multimodal processing.

[0192] In addition, in this embodiment of the present disclosure, the processing unit only needs all the text subsequences and some image subsequences, and the memory consumption is less than that of the method of performing attention operation with a single processing unit.

[0193] In the embodiments described in steps 310 to 350 above, text sequences and image sequences in the target application data are processed separately. For text subsequences generated from text sequences, since their size is generally smaller than image subsequences, they are typically placed within individual processing units and not migrated to other processing units. Since image sequences are generally larger, multiple image subsequences are generated and distributed across various processing units to improve the processing efficiency of attention operations. During attention operations, attention calculations are performed based on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text and image subsequences stored within the processing unit itself, respectively, yielding a first attention sub-result and a second attention sub-result. For image subsequences stored in other processing units, they are obtained from other processing units, and attention calculations are performed based on the target query corresponding to the combined subsequence of the processing unit and the target key-value pairs corresponding to the obtained image subsequences, yielding a third attention sub-result. Then, based on the first, second, and third attention sub-results, an attention result is generated. Using multiple processing units simultaneously improves the processing efficiency of attention operations. Meanwhile, since each processing unit stores the entire text subsequence, it does not need to obtain text subsequences from other processing units during the attention operation. It only needs to obtain image subsequences stored by other processing units, which reduces the time of multimodal operation and improves the efficiency of attention operation.

[0194] The above is a general description of steps 310 to 350. Since steps 310 and 340 have been described in sufficient detail above, the specific implementation processes of steps 320, 330 and 350 will be described in detail below.

[0195] Detailed description of step 320

[0196] In step 320, for each processing unit, a text subsequence for processing by the processing unit is generated based on the text sequence, an image subsequence for processing by the processing unit is generated based on the image sequence, and a combined subsequence is generated based on the text subsequence and the image subsequence.

[0197] In one embodiment, the processing device includes a second number of processing unit groups, the product of the first number and the second number being a third number, as referred to Figure 6 Step 320 includes:

[0198] Step 610: Divide the image sequence into a third number of image subsequences;

[0199] Step 620: Assign a text sequence to each processing unit in the processing device, and assign the third number of image sub-sequences to each processing unit in the processing device respectively;

[0200] Step 630: Through mutual communication between processing units with the same sequence number in different processing unit groups, the text sequence and image subsequence assigned to the processing units with the same sequence number in different processing unit groups are converted into text subsequence and image subsequence.

[0201] Steps 610 to 630 are described in detail below.

[0202] The second number is the number of processing unit groups in the processing equipment that participate in data processing. Figure 4 The processing device shown includes processing unit group A, processing unit group B and processing unit group C, so the second number is 3.

[0203] The third number is the product of the first and second numbers, and the third number is the number of processing units involved in data processing within the processing device. (Refer to...) Figure 4 The processing device contains 3 processing unit groups, and each processing unit group contains 6 processing units. Therefore, the first number is 6, the second number is 3, and the third number is 18. Figure 4 The number of processing units involved in data processing in the processing equipment is 18.

[0204] In step 610, the image sequence is divided into a third number of image sub-sequences.

[0205] Image segmentation is the result of segmenting an image sequence based on a third number of data processing units. Since the number of processing units involved in data processing in the processing device is the third number, and in order to improve data processing efficiency, the image sequence is assigned to each processing unit differently, thus dividing the image sequence into a third number of image segments.

[0206] for Figure 5A The image device shown divides the image sequence into two image sub-sequences. (Refer to...) Figure 5B The image sequence is divided into 4 image subsequences.

[0207] To ensure the proper execution of the attention operation, the third number of image subsequences are of equal size.

[0208] In step 620, a text sequence is assigned to each processing unit in the processing device, and a third number of image sub-sequences are respectively assigned to each processing unit in the processing device.

[0209] Since the size of the text sequence is much smaller than that of the image sequence, in order to improve the efficiency of multimodal computing, the text sequence is distributed to each processing unit in the processing device, and the third number of processing sub-sequences are distributed to each processing unit respectively.

[0210] Reference Figure 7 The text sequence from the target application is assigned to GPUs 0 through 3. Based on the image sequence from the target application, image sub-sequences 0, 1, 2, and 3 are obtained and assigned to GPUs 0 through 3. Ultimately, GPU 0 includes the text sequence and image sub-sequence 0, GPU 1 includes the text sequence and image sub-sequence 1, GPU 2 includes the text sequence and image sub-sequence 2, and GPU 3 includes the text sequence and image sub-sequence 3.

[0211] In step 630, through mutual communication between processing units with the same sequence number in different processing unit groups, the text sequence and image subsequence assigned to the processing units with the same sequence number in different processing unit groups are converted into text subsequence and image subsequence.

[0212] Each processing unit in each processing unit group has the same attention head, and the attention heads of processing units with the same index in different processing unit groups constitute the complete attention head of the attention mechanism. In order to determine the activation value of the corresponding combined subsequence on the complete attention head, the processing unit needs to obtain the image subsequence from the processing units with the same index in other processing unit groups in the second number of processing unit groups, and then convert the text sequence and image subsequence into text subsequence and image subsequence, respectively.

[0213] Figure 7 The processing unit group 1 shown includes GUP0 and GPU2, where GPU0 is numbered 1 and GPU2 is numbered 2. Figure 7 The processing unit group 2 shown includes GPU1 and GPU3, where GPU1 is numbered 1 and GPU3 is numbered 2. Therefore, through communication between GPU0 and GPU1, text sequences and image subsequences in GPU0 and GPU1 can be converted into text subsequences and image subsequences. Similarly, through communication between GPU2 and GPU3, text sequences and image subsequences in GPU2 and GPU3 can be converted into text subsequences and image subsequences.

[0214] In the embodiments described in steps 610 to 630 above, a second number of processing unit groups are provided in the processing device, and each processing unit in the second number of processing unit groups participates in the attention calculation of the target application data, further improving the multimodal computing efficiency. Furthermore, the attention heads of processing units with the same sequence number in different processing unit groups constitute the complete attention head of the attention mechanism. Therefore, the text subsequences and image subsequences obtained through mutual communication between processing units with the same sequence number in different processing unit groups correspond to the complete attention head, ensuring that the final attention result is calculated via the complete attention head, thereby improving the accuracy of the attention result.

[0215] The above is a general description of steps 610 to 630. Since steps 610 and 620 have been described in sufficient detail above, only the specific implementation process of step 630 will be described in detail below.

[0216] In step 630, through mutual communication between processing units with the same sequence number in different processing unit groups, the text sequence and image subsequence assigned to the processing units with the same sequence number in different processing unit groups are converted into text subsequence and image subsequence.

[0217] In one embodiment, the text sequence is an M1×N matrix, the image sequence is an M2×N matrix, and the second number is L. (Refer to...) Figure 8 Step 630 includes:

[0218] Step 810: Combine the text sequence and image sub-sequence allocated to the processing unit into a (M1+M2)×N combination matrix;

[0219] Step 820: Divide the columns of the combined matrix according to the second number to obtain the second number of (M1+M2)×(N / L) submatrices;

[0220] Step 830: Keep the first submatrix corresponding to the processing unit group in the second number of submatrixes, and send the second submatrix corresponding to the other processing unit groups in the second number of submatrixes to the processing unit with the same sequence number in the other processing unit groups.

[0221] Step 840: Receive the third sub-matrix sent by the processing unit with the same sequence number in other processing unit groups, and integrate the retained first sub-matrix and the received third sub-matrix into an integrated matrix of (M1+M2)L×(N / L).

[0222] Step 850: Determine the part corresponding to the text sequence in the integrated matrix as the text subsequence, and determine the part corresponding to the image subsequence in the integrated matrix as the image subsequence.

[0223] Steps 810 to 850 are described in detail below.

[0224] To facilitate combination operations between text sequences and image sequences, both sequences are set as matrices with the same number of columns. Therefore, the text sequence is set as an M1×N matrix, and the image sequence as an M2×N matrix. M1 represents the number of rows in the matrix corresponding to the text sequence, M2 represents the number of rows in the matrix corresponding to the image sequence, and N represents the number of rows in the matrices corresponding to the text sequence and the image sequence, respectively. (Refer to...) Figure 9 If the text sequence is a 1×3 matrix and the image sequence is a 4×3 matrix, then M1 equals 1, M2 equals 4, and N equals 3.

[0225] L is the value of the second number, that is, the number of processing unit groups in the processing device. (Refer to...) Figure 5B and Figure 7 The second number L is 2.

[0226] In step 810, the text sequence and image sub-sequence allocated to the processing unit are combined into a combination matrix of (M1+M2)×N.

[0227] The combination matrix is ​​the result of combining the matrix corresponding to the text sequence and the matrix corresponding to the image segment sequence. The matrices corresponding to the text sequence and the image segment sequence are combined along the row directions of the matrix, so the number of rows of the combination matrix is ​​M1+M2 and the number of columns is N.

[0228] Reference Figure 9 If the text sequence of the processing unit is a 1×3 matrix and the image sequence is a 4×3 matrix, then the combined matrix corresponding to this processing unit is a 5×3 matrix.

[0229] In step 820, the columns of the combined matrix are divided according to the second number to obtain the second number of (M1+M2)×(N / L) submatrices.

[0230] In the second number of processing unit groups, processing units with the same index are assigned different attention heads. The columns of the combination matrix correspond to the attention heads. Therefore, the columns of the combination matrix are divided according to the second number to determine the second number of sub-matrices corresponding to each processing unit group. (Refer to...) Figure 9 Assuming that the processing units in processing unit group 1 are assigned attention head 1, the processing units in processing unit group 2 are assigned attention head 2, and the processing units in processing unit group 3 are assigned attention head 3, then the first column of the resulting 5×3 combination matrix corresponds to attention head 1, the second column corresponds to attention head 2, and the third column corresponds to attention head 3.

[0231] M1+M2 is the number of rows in the second number of submatrices corresponding to the combined matrix, and N / L is the number of columns in the second number of submatrices corresponding to the combined matrix. (Refer to...) Figure 9The processing unit is set up with 3 processing unit groups, that is, the second number is 3. The 5×3 combination matrix is ​​divided into 3 sub-matrices of 5×1 by columns. In addition, assuming that the combination matrix is ​​a 5×9 matrix and the second number is 3, then the size of the sub-matrices is 5×3 and the number of sub-matrices is 3.

[0232] In step 830, the first submatrix corresponding to the processing unit group in the second number of submatrixes is retained, and the second submatrix corresponding to the other processing unit groups in the second number of submatrixes is sent to the processing unit with the same sequence number in the other processing unit groups.

[0233] The second set of submatrices corresponds one-to-one with the attention heads of the processing unit groups. The first submatrix is ​​the submatrix within the second set of submatrices that corresponds to the processing unit group to which the current processing unit resides. The second submatrix is ​​the submatrix within the second set of submatrices that corresponds to other processing unit groups. Therefore, the number of first submatrices is 1, and the number of second submatrices is the second set of submatrices minus 1.

[0234] After determining the second submatrix, the second submatrix is ​​sent to the processing unit in the processing unit group corresponding to the second submatrix that has the same sequence number as the current processing unit.

[0235] Reference Figure 9 The processing device comprises three processing unit groups, and the current processing unit is the first processing unit in processing unit group 2, i.e., its sequence number is 1. The submatrix corresponding to the second column of the combined matrix corresponds to processing unit group 2; therefore, it is the first submatrix for the current processing unit. The submatrix corresponding to the first and third columns of the combined matrix is ​​the second submatrix. The submatrix corresponding to the first column of the combined matrix is ​​sent to the first processing unit in processing unit group 1, and the submatrix corresponding to the third column of the combined matrix is ​​sent to the first processing unit in processing unit group 3.

[0236] In step 840, the third sub-matrix sent by the processing unit with the same sequence number in other processing unit groups is received, and the retained first sub-matrix and the received third sub-matrixes are integrated into an integrated matrix of (M1+M2)L×(N / L).

[0237] The third submatrix refers to the submatrix derived from other processing unit groups that have the same sequence number as the current processing unit. The number of third submatrixes is the second number minus one. (See reference...) Figure 9 If the current processing unit is the first processing unit in processing unit group 2, then the processing unit obtains a total of 2 third sub-matrices. One of the third sub-matrices comes from the first processing unit in processing unit group 1, and the other third sub-matrice comes from the first processing unit in processing unit group 3.

[0238] The integrated matrix is ​​the result of integrating the first submatrix and each of the third matrices column by column. (M1+M2)L is the number of rows in the integrated matrix, and N / L is the number of columns in the integrated matrix.

[0239] The integrated matrix represents the values ​​of the text sequence and image sub-sequence corresponding to the same sequence number processing unit in the second number of processing units group at the attention head of the current processing unit. Correspondingly, if the first number of processing units in the processing unit group are assigned the same attention head, then the integrated matrix of the first number of processing units in the processing unit group represents the value of the target application data at the attention head corresponding to that processing unit group.

[0240] Reference Figure 9 The first processing unit in processing unit group 2 receives the third submatrix from the first processing unit in processing unit group 1 and the third submatrix from the first processing unit in processing unit group 3. Then, the first submatrix and the two third submatrixes are integrated to obtain an integrated matrix. The size of the first and third submatrixes is 5×1, and the size of the integrated matrix is ​​15×1. Furthermore, in the integrated matrix, the submatrixes corresponding to processing unit group 1, processing unit group 2, and processing unit group 3 are arranged sequentially.

[0241] It should be noted that the processing unit receives the third submatrix sent by the processing unit with the same sequence number in other processing unit groups through all2all communication. All2all is a many-to-many communication method and also an efficient data loop mechanism. It allows each processing unit to send data to multiple processing units and receive data from multiple processing units.

[0242] In step 850, the part corresponding to the text sequence in the integrated matrix is ​​determined as the text subsequence, and the part corresponding to the image subsequence in the integrated matrix is ​​determined as the image subsequence.

[0243] In the processing device, each of the third number of processing units is assigned a text sequence, and the integrated matrix is ​​determined based on the combination matrix of processing units with the same index in the second number of processing unit groups. Therefore, the portion corresponding to the text sequence in the integrated matrix needs to be divided into the second number of parts, and one of these parts is taken as a text subsequence. The size of the text subsequence is M1×(N / L). For example, Figure 7 The integrated matrix of size 15×1 includes three 1×1 parts corresponding to the text sequence. Since the text parts in the first subsequence and the third subsequence are the same, only one of the 1×1 parts corresponding to the text sequence is taken as the text subsequence.

[0244] The image sequence is divided into a third number of image sub-sequences and assigned to a third number of processing units. Therefore, the parts corresponding to the image sub-sequences in the first sub-matrix and each of the third sub-matrixes in the integrated matrix are different, and all parts corresponding to the image sub-sequences in the integrated matrix can be determined as image sub-sequences. The size of the image sub-sequence is (M²*L)×(N / L). For example, based on... Figure 7 The portion corresponding to the image subsequence of the integrated matrix of size 15×1 can be used to determine that the image subsequence is a matrix of size 12×1.

[0245] In addition, the third number of processing units in the processing device of this embodiment all store complete text sequences. Therefore, the text sequence of size M1×N can be directly divided into M1×(N / L) sequences by column, and the sequence in the M1×(N / L) sequences that corresponds to the processing unit group in which the processing unit is located is taken as the text subsequence.

[0246] The combined subsequence is determined based on the text subsequence and the image subsequence, and the integrated matrix can be directly used as the combined subsequence. For example, in Figure 9 In the processing unit shown, the integrated matrix of 15×1 can be used as a combined subsequence.

[0247] In the embodiments described in steps 810 to 850, the M1×N text sequence and the M2×N image sub-sequences are combined to form a (M1+M2)×N combination matrix. Since multiple attention heads in the attention mechanism are assigned to processing units with the same index in different processing unit groups, to obtain the calculation results of the text sequence and image sub-sequences in other processing units from the other attention heads, the columns of the combination matrix are divided according to a second number. The first sub-matrix corresponding to the processing unit group in the second number of sub-matrixes is retained, and the second sub-matrix corresponding to other processing unit groups is sent to the processing units with the same index in those processing unit groups. The processing unit also receives the third sub-matrix sent by the processing units with the same index in other processing unit groups, and integrates the first sub-matrix and each third sub-matrix to obtain the integrated matrix, thereby determining the text sub-sequence, image sub-sequence, and combination sub-sequence. This facilitates determining the results of the text sequence and image sub-sequence of the processing unit with the same index in other processing unit groups on the attention head of that processing unit, improving the accuracy of the attention results.

[0248] The above is a general description of steps 810 to 850. Since steps 810 and 830 to 850 have been described in sufficient detail above, the specific implementation process of step 820 will be described in detail below.

[0249] In step 820, the columns of the combined matrix are divided according to the second number to obtain the second number of (M1+M2)×(N / L) submatrices.

[0250] In one embodiment, each processing unit includes hc attention heads, as referenced Figure 10 Step 820 includes:

[0251] Step 1010: Divide the columns of the combined matrix according to the second number, so that the number of columns of each (M1+M2)×(N / L) submatrix is ​​an integer multiple of hc, so that the (M1+M2)×(N / L) submatrix is ​​evenly distributed to hc attention heads for execution.

[0252] Step 1010 will be described in detail below.

[0253] In step 1010, the columns of the combined matrix are divided according to the second number, such that the number of columns of each (M1+M2)×(N / L) submatrix is ​​an integer multiple of hc, so that the (M1+M2)×(N / L) submatrix is ​​evenly distributed to hc attention heads for execution.

[0254] hc represents the number of attention heads in each processing unit of the processing device. During attention calculation, the matrix is ​​evenly distributed column-wise among the attention heads for processing. Therefore, the number of columns in the submatrix obtained by partitioning the combined matrix should be an integer multiple of the number of attention heads in the processing unit, i.e., N / L is an integer multiple of hc. For example, if hc is 3, then N / L is 3 or a multiple of 3. This demonstrates... Figure 9 The processing unit shown contains only one attention head.

[0255] In the embodiment of step 1010 above, the number of columns of the sub-matrix obtained by dividing the combined matrix is ​​an integer multiple of the number of attention heads in the processing unit. This makes it easier to evenly distribute the text subsequence, image subsequence, and combined subsequence to multiple attention heads for execution, which is beneficial to the normal performance of attention operations and improves the efficiency of multimodal data attention operations.

[0256] Detailed description of step 330

[0257] In step 330, the processing unit performs attention calculations on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text subsequence and image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result.

[0258] In one embodiment, the target key-value pair includes a target key and a target value, as referred to Figure 11 Step 330 includes:

[0259] Step 1110: Through the processing unit, attention is calculated based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, to obtain the first attention sub-result;

[0260] Step 1120: Through the processing unit, attention is calculated based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, to obtain the second attention sub-result.

[0261] Steps 1110 and 1120 are described in detail below.

[0262] In step 1110, attention is calculated by the processing unit based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, to obtain the first attention sub-result.

[0263] The target key and target value are important components of the attention mechanism. The target key is the Key (K) in the attention mechanism, and the target value is the Value (V). (See reference...) Figure 5A K1-T is the target key corresponding to text subsequence 1 of processing unit 1, and V1-T is the target value corresponding to text subsequence 1 of processing unit 1. K2-T is the target key corresponding to text subsequence 1 of processing unit 2, and V2-T is the target value corresponding to text subsequence 1 of processing unit 2.

[0264] The first attention result corresponds to the text subsequence, and it is obtained by performing attention operations based on the target query corresponding to the combined subsequence, the target key corresponding to the text subsequence, and the target value. Figure 5A In the first processing unit, attention operations are performed based on the target query Q1, the target key K1-T, and the target value V1-T to obtain the first attention sub-result in processing unit 1. Similarly, attention operations are performed based on the target query Q2, the target key K2-T, and the target value V2-T to obtain the first attention sub-result in processing unit 2.

[0265] In step 1120, attention is calculated by the processing unit based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, to obtain the second attention sub-result.

[0266] The second attention result corresponds to the image subsequence, and it is obtained by performing attention operations based on the target query corresponding to the combined subsequence, the target key corresponding to the image subsequence, and the target value. (Refer to...) Figure 5BK1-I is the target key corresponding to image subsequence 1 of processing unit 1, and V1-I is the target value corresponding to image subsequence 1 of processing unit 1. K2-I is the target key corresponding to image subsequence 2 of processing unit 2, and V2-I is the target value corresponding to image subsequence 2 of processing unit 2. Attention operations are performed based on target query Q1, target key K1-I, and target value V1-I to obtain the second attention sub-result in processing unit 1. Attention operations are performed based on target query Q2, target key K2-I, and target value V2-I to obtain the second attention sub-result in processing unit 2.

[0267] In the embodiments of steps 1110 and 1120 above, the target key-value pair includes the target key and the target value. The first attention sub-result is determined based on the target query corresponding to the combined sub-sequence, the target key and the target value corresponding to the text sub-sequence, and the second attention sub-result is determined based on the target query corresponding to the combined sub-sequence, the target key and the target value corresponding to the image sub-sequence. This makes the first and second attention sub-results more accurate and improves the accuracy of the attention results corresponding to each processing unit.

[0268] The above is a general description of steps 1110 and 1120. The specific implementation process of steps 1110 and 1120 will be described in detail below.

[0269] In step 1110, attention is calculated by the processing unit based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, to obtain the first attention sub-result.

[0270] In one embodiment, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix. (Refer to...) Figure 12 Step 1110 includes:

[0271] Step 1210: Transpose the first target key matrix to obtain the first transpose matrix of N×M1;

[0272] Step 1220: Multiply the target query matrix and the first transpose matrix to obtain the first product matrix of M×M1;

[0273] Step 1230: Based on the first product matrix and the first target value matrix, generate an M×N first attention matrix as the first attention sub-result.

[0274] Steps 1210 to 1230 are described in detail below.

[0275] The target query matrix is ​​the matrix representation of the target query corresponding to the combined subsequences, where M is the number of rows in the target query matrix and N is the number of columns. (See reference...) Figure 13 If the target query matrix is ​​a 4×4 matrix, then both M and N are equal to 4.

[0276] The first target key matrix is ​​the matrix representation of the target keys corresponding to the text subsequences, where M1 is the number of rows and N is the number of columns. The first target value matrix is ​​the matrix representation of the target values ​​corresponding to the text subsequences. The first target value matrix and the first target key matrix have the same size, M1×N. (Refer to...) Figure 13 If both the first target key matrix and the first target value matrix are 1×4 matrices, then M is 1 and N is 4.

[0277] It should be noted that, in order to perform attention operations, the target query matrix, the first target key matrix, and the first target value matrix have the same number of columns.

[0278] In step 1210, the first target key matrix is ​​transposed to obtain the first transpose matrix of N×M1.

[0279] The first transpose matrix is ​​the transpose of the first target key matrix. The number of rows in the first transpose matrix is ​​equal to the number of columns in the first target key matrix, and the number of columns in the first transpose matrix is ​​equal to the number of rows in the first target key matrix. Therefore, the first transpose matrix has N rows and M1 columns.

[0280] Reference Figure 13 If the first target key matrix is ​​a 1×4 matrix, then the first transpose matrix is ​​a 4×1 matrix.

[0281] In step 1220, the target query matrix and the first transpose matrix are multiplied to obtain the first product matrix of M×M1.

[0282] The first product matrix is ​​the result of multiplying the target query matrix and the first transpose matrix. The number of rows in the first product matrix is ​​the same as the number of rows in the target query matrix, both being M. The number of columns in the first product matrix is ​​the same as the number of rows in the first transpose matrix, both being M1.

[0283] Reference Figure 13 If the target query matrix is ​​a 4×4 matrix and the first transpose matrix is ​​a 4×1 matrix, then the first product matrix is ​​a 4×1 matrix.

[0284] In step 1230, based on the first product matrix and the first target value matrix, an M×N first attention matrix is ​​generated as the first attention sub-result.

[0285] The first attention matrix is ​​the matrix representation of the first attention sub-result. It is determined based on the first product matrix and the first target value matrix. The size of the first attention matrix is ​​the same as the size of the target query matrix, with M rows and N columns.

[0286] Reference Figure 13 Based on the 4×1 first product matrix and the 1×4 first target value matrix, the first attention matrix can be obtained. The first attention matrix is ​​the same as the target query matrix, both being 4×4 matrices.

[0287] The embodiments of steps 1210 to 1230 above process the target query matrix of size M×N, the first target key matrix of size M1×N, and the first target value matrix of size M1×N to determine the first attention sub-result of size M×N. In the calculation process of the first attention sub-result, the determination of each matrix ensures that the first attention sub-matrix is ​​the same size as the target query matrix, thereby improving the accuracy of the first attention sub-result.

[0288] The above is a general description of steps 1210 to 1230. Since steps 1210 and 1220 have been described in sufficient detail above, only the specific implementation process of step 1230 will be described in detail below.

[0289] In step 1230, based on the first product matrix and the first target value matrix, an M×N first attention matrix is ​​generated as the first attention sub-result.

[0290] In one embodiment, each processing unit includes hc attention heads, each attention head having a size of hs, and N being the product of hc and hs. (Refer to...) Figure 14 Step 1230 includes:

[0291] Step 1410: Based on the size hs of the attention head, scale the first product matrix to obtain the first scaling matrix;

[0292] Step 1420: Perform exponential normalization on the first scaling matrix to obtain the first exponentially normalized matrix;

[0293] Step 1430: Generate a first attention matrix based on the first exponential normalization matrix and the first target value matrix, as the first attention sub-result.

[0294] Steps 1410 to 1430 are described in detail below.

[0295] hc is the number of attention heads in each processing unit of the processing device, and hs is the dimension of each processing unit. Assuming the query matrix size corresponding to a single attention head is M×hs, then the size of the attention matrix output by that attention head is also M×hs. If the attention matrices output by hc attention heads are combined, then the size of the query matrix and the attention matrix is ​​M×(hc*hs), thus N can be determined as the product of hc and hs. If a processing unit includes 4 attention heads, and each attention head has a size of 3, then both the target query matrix and the first attention matrix have 12 columns.

[0296] In step 1410, the first product matrix is ​​scaled based on the size hs of the attention head to obtain the first scaling matrix.

[0297] The first scaling matrix is ​​obtained by scaling the first product matrix based on the size of the attention head. Scaling the first product matrix based on the size of the attention head does not change the number of rows and columns of the matrix. (See reference...) Figure 13 The first scaling matrix and the first product matrix are both 4×1 matrices.

[0298] Furthermore, the first product matrix is ​​scaled according to the square root of hs. Specifically, the ratio of the first product matrix to the square root of hs is used as the first scaling matrix.

[0299] In step 1420, the first scaling matrix is ​​exponentially normalized to obtain the first exponentially normalized matrix.

[0300] Exponential normalization is achieved by normalizing the exponential function, i.e., the softmax function. As an activation function, the softmax function transforms the activation values ​​of the previous layer, i.e., the first scaling matrix, into a probability distribution. This transformation helps the model more clearly express its preference for text and graphics.

[0301] The first exponentially normalized matrix is ​​the normalized result of the first scaling matrix. The size of the first exponentially normalized matrix is ​​the same as the size of the first scaling matrix. (See reference...) Figure 13 The first scaling matrix is ​​a 4×1 matrix, and the first exponentially normalized matrix obtained by exponentially normalizing the first scaling matrix is ​​also a 4×1 matrix.

[0302] In step 1430, a first attention matrix is ​​generated based on the first exponential normalization matrix and the first target value matrix, which serves as the first attention sub-result.

[0303] The first attention matrix is ​​the matrix representation of the first attention sub-result. It is determined by the product of the first exponentially normalized matrix and the first target value matrix. Therefore, the number of rows in the first attention matrix is ​​equal to the number of rows in the first exponentially normalized matrix, and the number of columns in the first attention matrix is ​​equal to the number of columns in the first target value matrix. (See reference...) Figure 13 If the first exponent normalization matrix is ​​a 4×1 matrix and the first target value matrix is ​​a 1×4 matrix, then the first attention result is a 4×4 matrix.

[0304] It should be noted that the formula for calculating the first attention matrix can be expressed as:

[0305]

[0306] Where O is the first attention matrix, Q is the target query matrix corresponding to the combined sequence, K is the first target key matrix corresponding to the text subsequence, V is the first target value matrix corresponding to the text subsequence, and K T Let QK be the first transpose matrix. T Let be the first product matrix. This is the first scaling matrix. This is the first exponent normalized matrix.

[0307] In the embodiments of steps 1410 to 1430 above, after determining the first product matrix and the first target value matrix, the first product matrix is ​​scaled by the size of the attention head in the processing unit, and the scaled first scaling matrix is ​​exponentially normalized. This enables more effective extraction of key information in text subsequences and combined subsequences, improves the accuracy of the first attention result, and thus improves the efficiency of data processing.

[0308] In step 1120, attention is calculated by the processing unit based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, to obtain the second attention sub-result.

[0309] In one embodiment, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix. (Refer to...) Figure 15 Step 1120 includes:

[0310] Step 1510: Transpose the second target key matrix to obtain an N×M2 second transpose matrix;

[0311] Step 1520: Multiply the target query matrix and the second transpose matrix to obtain the second product matrix of M×M2;

[0312] Step 1530: Based on the second product matrix and the second target value matrix, generate an M×N second attention matrix as the second attention sub-result.

[0313] Steps 1510 to 1530 are described in detail below.

[0314] The target query matrix is ​​the matrix representation of the target query corresponding to the combined subsequences, where M is the number of rows in the target query matrix and N is the number of columns. (See reference...) Figure 16 If the target query matrix is ​​a 4×4 matrix, then both M and N are equal to 4.

[0315] The second target key matrix is ​​the matrix representation of the target keys corresponding to the image subsequences, where M² is the number of rows and N is the number of columns. The second target value matrix is ​​the matrix representation of the target values ​​corresponding to the image subsequences. The second target value matrix and the second target key matrix have the same size, M² × N. (Refer to...) Figure 16 If both the second target bond matrix and the second target value matrix are 3×4 matrices, then M2 is 3 and N is 4.

[0316] It should be noted that, in order to perform attention operations, the target query matrix, the second target key matrix, and the second target value matrix have the same number of columns.

[0317] In step 1510, the second target key matrix is ​​transposed to obtain an N×M2 second transpose matrix.

[0318] The second transpose matrix is ​​the transpose of the second target key matrix. The number of rows in the second transpose matrix is ​​equal to the number of columns in the second target key matrix, and the number of columns in the second transpose matrix is ​​equal to the number of rows in the second target key matrix. Therefore, the second transpose matrix has N rows and M2 columns.

[0319] Reference Figure 16 If the second target bond matrix is ​​a 3×4 matrix, then the second transpose matrix is ​​a 4×3 matrix.

[0320] In step 1520, the target query matrix and the second transpose matrix are multiplied to obtain the second product matrix of M×M2.

[0321] The second product matrix is ​​the result of multiplying the target query matrix and the second transpose matrix. The number of rows in the second product matrix is ​​the same as the number of rows in the target query matrix, both being M. The number of columns in the second product matrix is ​​the same as the number of rows in the second transpose matrix, both being M².

[0322] Reference Figure 16 If the target query matrix is ​​a 4×4 matrix and the second transpose matrix is ​​a 4×3 matrix, then the first product matrix is ​​a 4×3 matrix.

[0323] In step 1530, based on the second product matrix and the second target value matrix, an M×N second attention matrix is ​​generated as the second attention sub-result.

[0324] The second attention matrix is ​​the matrix representation of the second attention sub-result. It is determined based on the second product matrix and the second target value matrix. The size of the second attention matrix is ​​the same as that of the target query matrix, with M rows and N columns.

[0325] Reference Figure 16 Based on the 4×3 second product matrix and the 3×4 second target value matrix, the second attention matrix can be obtained. The second attention matrix is ​​the same as the target query matrix, both being 4×4 matrices.

[0326] The embodiments of steps 1510 to 1530 above process the target query matrix of size M×N, the second target key matrix of size M2×N, and the second target value matrix of size M2×N to determine the second attention sub-result of size M×N. In the calculation process of the second attention sub-result, the determination of each matrix ensures that the second attention sub-matrix is ​​the same size as the target query matrix, thereby improving the accuracy of the second attention sub-result.

[0327] The above is a general description of steps 1510 to 1530. Since steps 1510 and 1520 have been described in sufficient detail above, only the specific implementation process of step 1530 will be described in detail below.

[0328] In step 1530, based on the second product matrix and the second target value matrix, an M×N second attention matrix is ​​generated as the second attention sub-result.

[0329] In one embodiment, each processing unit includes hc attention heads, each attention head having a size of hs, and N being the product of hc and hs. (Refer to...) Figure 17 Step 1530 includes:

[0330] Step 1710: Based on the size hs of the attention head, scale the second product matrix to obtain the second scaling matrix;

[0331] Step 1720: Perform exponential normalization on the second scaling matrix to obtain the second exponentially normalized matrix;

[0332] Step 1730: Based on the second exponential normalization matrix and the second target value matrix, generate the second attention matrix as the second attention sub-result.

[0333] Steps 1710 to 1730 are described in detail below.

[0334] `hc` represents the number of attention heads in each processing unit of the processing device, and `hs` represents the dimension of each processing unit. Assuming the query matrix size corresponding to a single attention head is M × `hs`, then the size of the attention matrix output by that attention head is also M × `hs`. If the attention matrices output by `hc` attention heads are combined, then the size of the query matrix and the attention matrix is ​​M × (hc * hs), thus N can be determined as the product of `hc` and `hs`. If a processing unit includes 4 attention heads, and each attention head has a size of 3, then the target query matrix and the second attention matrix both have 12 columns.

[0335] In step 1710, based on the size hs of the attention head, the second product matrix is ​​scaled to obtain the second scaling matrix.

[0336] The second scaling matrix is ​​obtained by scaling the second product matrix based on the size of the attention head. Scaling the second product matrix based on the size of the attention head does not change the number of rows and columns of the matrix. (See reference...) Figure 16 The second scaling matrix and the second product matrix are both 4×3 matrices.

[0337] Furthermore, the second product matrix is ​​scaled according to the square root of hs. Specifically, the ratio of the second product matrix to the square root of hs is used as the second scaling matrix.

[0338] In step 1720, the second scaling matrix is ​​exponentially normalized to obtain the second exponentially normalized matrix.

[0339] Exponential normalization is achieved by normalizing the exponential function, i.e., the softmax function. As an activation function, the softmax function transforms the activation values ​​of the previous layer, i.e., the second scaling matrix, into a probability distribution. This transformation helps the model more clearly express its preference for text and graphics.

[0340] The second exponential normalization matrix refers to the normalized result of the second scaling matrix. The size of the second exponential normalization matrix is ​​the same as the size of the second scaling matrix. (See reference...) Figure 16 The second scaling matrix is ​​a 4×3 matrix, and the second exponentially normalized matrix obtained by exponentially normalizing the second scaling matrix is ​​also a 4×3 matrix.

[0341] In step 1730, a second attention matrix is ​​generated based on the second exponential normalization matrix and the second target value matrix, which serves as the second attention sub-result.

[0342] The second attention matrix is ​​the matrix representation of the second attention sub-result. It is determined by the product of the second exponentially normalized matrix and the second target value matrix. Therefore, the number of rows in the second attention matrix is ​​equal to the number of rows in the second exponentially normalized matrix, and the number of columns in the second attention matrix is ​​equal to the number of columns in the second target value matrix. (See reference...) Figure 16 The second exponential normalization matrix is ​​a 4×3 matrix, the second objective value matrix is ​​a 3×4 matrix, and the second attention result is a 4×4 matrix.

[0343] It should be noted that the formula for calculating the second attention matrix can be expressed as:

[0344]

[0345] Where O is the second attention matrix, Q is the target query matrix corresponding to the combined sequence, K is the second target key matrix corresponding to the image subsequence, V is the second target value matrix corresponding to the image subsequence, and K T Let QK be the second transpose matrix. T This is the second product matrix. This is the second scaling matrix. This is the second exponential normalized matrix.

[0346] It should be noted that the target key obtained by the processing unit from the image subsequences of other processing units is a third target key matrix of M2×N, and the target value is also a third target value matrix of M2×N. The third key matrix is ​​transposed to obtain an N×M2 third transpose matrix. Then, the target query matrix and the third transpose matrix are multiplied to obtain an M×M2 third product matrix. Based on the size hs of the attention head in the processing unit, the third product matrix is ​​scaled to obtain a third scaling matrix. This third scaling matrix is ​​then exponentially normalized to obtain a third exponentially normalized matrix. Finally, based on the third exponentially normalized matrix and the third target value matrix, a third attention matrix is ​​generated as the third attention sub-result.

[0347] In the embodiments of steps 1710 to 1730 above, after determining the second product matrix and the second target value matrix, the second product matrix is ​​scaled by the size of the attention head in the processing unit, and the scaled second scaling matrix is ​​exponentially normalized. This enables more effective extraction of key information in image subsequences and combined subsequences, improves the accuracy of the second attention result, and thus improves the efficiency of data processing.

[0348] Detailed description of step 350

[0349] In step 350, an attention result is generated based on the first attention result, the second attention result, and the third attention result.

[0350] In one embodiment, the processing device includes a second number of processing unit groups, wherein the first attention sub-result, the second attention sub-result, and the third attention sub-result are respectively an M×N first attention matrix, a second attention matrix, and a third attention matrix. (Refer to...) Figure 18Step 350 includes:

[0351] Step 1810: Superimpose the first attention matrix, the second attention matrix, and the third attention matrix to obtain the M×N superimposed attention matrix corresponding to the processing unit;

[0352] Step 1820: Generate attention results based on the superimposed attention matrix corresponding to the processing units with the same sequence number in the second number of processing unit groups.

[0353] Steps 1810 and 1820 are described in detail below.

[0354] The second number is the number of processing unit groups involved in data processing within the processing equipment. (Refer to...) Figure 4 The processing equipment includes three processing unit groups, so the second number is 3. Figure 5B In this case, processing unit group 1 and processing unit group 2 jointly participate in the data processing of the target application data, so the second number is 2.

[0355] The first attention matrix is ​​the matrix representation of the first attention sub-result. The second attention matrix is ​​the matrix representation of the second attention sub-result. The third attention matrix is ​​the matrix representation of the third attention sub-result. All attention matrices obtained after attention operations have the same size as the target query matrix; therefore, the first, second, and third attention matrices have the same size, and M is the number of rows and N is the number of columns in the three attention matrices.

[0356] In step 1810, the first attention matrix, the second attention matrix, and the third attention matrix are superimposed to obtain the M×N superimposed attention matrix corresponding to the processing unit.

[0357] The superimposed attention matrix is ​​the result of superimposing the first attention matrix, the second attention matrix, and the third attention matrix. The superposition is calculated based on the corresponding elements of the first, second, and third attention matrices, and the final superimposed attention matrix has the same size as the first, second, and third attention matrices, which is M×N.

[0358] The superimposed attention matrix can be the sum of the first attention matrix, the second attention matrix, and the third attention matrix, or it can be a weighted sum of the first attention matrix, the second attention matrix, and the third attention matrix. Since each processing unit in the processing unit group is assigned the same attention head, the superimposed attention matrix is ​​usually the sum of the first attention matrix, the second attention matrix, and the third attention matrix.

[0359] Reference Figure 19 The superimposed attention matrix is ​​the sum of the first attention matrix, the second attention matrix, and the third attention matrix. Each element of the superimposed attention matrix is ​​the sum of the elements at corresponding positions in the first, second, and third attention matrices. For example, the element 16 in the first row and third column of the superimposed attention matrix is ​​obtained by summing the element 6 in the first row and third column of the first attention matrix, the element 4 in the first row and third column of the second attention matrix, and the element 6 in the third attention matrix.

[0360] Reference Figure 5B In processing unit 1, the first attention matrix O1-1, the second attention matrix O1-2, and the third attention matrix O1-3 are superimposed to obtain attention matrix O1. In processing unit 2, the first attention matrix O2-1, the second attention matrix O2-2, and the third attention matrix O2-3 are superimposed to obtain attention matrix O2.

[0361] In step 1820, attention results are generated based on the superimposed attention matrix corresponding to the processing units with the same sequence number in the second number of processing unit groups.

[0362] If the processing device consists of only one group of processing units, the superimposed attention matrix can be directly used as the attention result of the processing unit. For example, Figure 5A The processing device shown includes a group of processing units, which can directly use the attention matrix O1 as the attention result of processing unit 1 and the attention matrix O2 as the attention result of processing unit 2.

[0363] If the processing device includes a second number of processing unit groups, the superimposed attention matrix includes not only the attention values ​​of the text sequence and image sequence of the current processing unit at the attention head of that processing unit, but also the attention values ​​of the text sequence and image sequence corresponding to the processing units with the same index in other processing unit groups within the second number of processing unit groups at the attention head of that processing unit. Therefore, it is necessary to generate the attention result based on the superimposed attention matrix corresponding to the processing units with the same index in the second number of processing unit groups.

[0364] Reference Figure 5B As processing unit GPU0 with index 1 in processing unit group 1, attention result 0 is determined based on superimposed attention matrix 0 and superimposed attention matrix 1. Superimposed attention matrix 1 is processing unit GPU1 with index 1 in processing unit group 2.

[0365] In the embodiments of steps 1810 and 1820 above, the superimposed attention matrix is ​​a superposition of the matrices corresponding to the first attention sub-result, the second attention sub-result, and the third attention sub-result. Its size is the same as the matrix size of each attention sub-result, which improves the accuracy of the superimposed attention matrix. Furthermore, the superimposed attention matrix includes not only the attention values ​​of the text sequence and image sequence of the current processing unit at the attention head of that processing unit, but also the attention values ​​of the text sequence and image sequence corresponding to the processing units with the same index in other processing unit groups within the second number of processing unit groups at the attention head of that processing unit. Therefore, the attention result generated based on the superimposed attention matrix corresponding to the processing units with the same index in the second number of processing unit groups can reflect the values ​​of the data in that processing unit at other attention heads, improving the accuracy of the attention result and thus improving the accuracy of the multimodal data processing result.

[0366] The above is a general description of steps 1810 and 1820. Since step 1810 has been described in sufficient detail above, the specific implementation process of step 1820 will be described in detail below.

[0367] In step 1820, attention results are generated based on the superimposed attention matrix corresponding to the processing units with the same sequence number in the second number of processing unit groups.

[0368] In one embodiment, the second number is L, referring to Figure 20 Step 1820 includes:

[0369] Step 2010: Divide the rows of the superimposed attention matrix according to the second number to obtain the second number of (M / L)×N attention submatrices;

[0370] Step 2020: Retain the first attention submatrix corresponding to the processing unit group in the second number of attention submatrixes, and send the second attention submatrix corresponding to the other processing unit groups in the second number of attention submatrixes to the processing unit with the same sequence number in the other processing unit groups.

[0371] Step 2030: Receive the third attention sub-matrix sent by the processing unit with the same sequence number in other processing unit groups, and integrate the retained first attention sub-matrix and the received third attention sub-matrix into a fourth attention sub-matrix of (M / L)×NL.

[0372] Step 2040: Determine the fourth attention submatrix as the attention result.

[0373] Steps 2010 to 2040 are described in detail below.

[0374] In step 2010, the rows of the superimposed attention matrix are divided according to a second number to obtain a second number of (M / L)×N attention submatrices.

[0375] Corresponding to steps 810 and 850, the superimposed attention matrix includes not only the attention values ​​of the text sequence and image sequence of the current processing unit at the attention head of that processing unit, but also the attention values ​​of the text sequence and image sequence corresponding to the processing units with the same sequence number in other processing unit groups within the second number of processing unit groups at the attention head of that processing unit. Therefore, the rows of the superimposed attention matrix need to be divided according to the second number.

[0376] The attention submatrix is ​​the result of partitioning the superimposed attention. If the superimposed attention matrix is ​​an M×N matrix, then the number of rows in the attention submatrix is ​​M / L, and the number of columns is N. The second number of submatrices corresponds one-to-one with the second number of processing unit groups.

[0377] Reference Figure 21 The processing device includes processing unit group 1 and processing unit group 2. In the second processing unit of processing unit group 2, a 6×2 attention matrix is ​​superimposed. The superimposed attention matrix is ​​divided to obtain two 3×2 attention sub-matrices. One attention sub-matrice corresponds to processing unit group 1 and the other attention sub-matrice corresponds to processing unit group 2.

[0378] In step 2020, the first attention submatrix corresponding to the processing unit group in the second number of attention submatrixes is retained, and the second attention submatrix corresponding to other processing unit groups in the second number of attention submatrixes is sent to the processing unit with the same sequence number in the other processing unit group.

[0379] The first attention submatrix is ​​the attention submatrix among the second number of submatrixes that corresponds to the processing unit group to which the current processing unit is located. The second attention submatrix is ​​the attention submatrix among the second number of submatrixes that corresponds to other processing unit groups. The number of first and second attention submatrixes is the same, but the number of first attention submatrixes is 1, and the number of second attention submatrixes is the second number minus 1.

[0380] Since the first attention submatrix corresponds to the processing unit group where the current processing unit is located, the first attention submatrix is ​​retained. The second attention submatrix is ​​then sent to the processing unit with the same sequence number as the current processing unit in the processing unit group corresponding to the second attention submatrix.

[0381] Reference Figure 21The processing device comprises two processing unit groups, and the current processing unit is the second processing unit in processing unit group 2, i.e., the sequence number is 2. In the second processing unit of processing unit group 2, the first attention sub-matrix is ​​retained, and the second attention sub-matrix is ​​sent to the second processing unit in processing unit group 1.

[0382] In step 2030, the third attention sub-matrix sent by the processing unit with the same sequence number in other processing unit groups is received, and the retained first attention sub-matrix and the received third attention sub-matrix are integrated into a fourth attention sub-matrix of (M / L)×NL.

[0383] The third attention submatrix refers to the attention submatrix from other processing unit groups that have the same sequence number as the current processing unit. The number of third attention submatrixes is the second number minus 1. The third attention matrix can represent the result of the text sequence and image subsequence of the current processing unit on the attention head of the processing unit that generated the third attention matrix.

[0384] Reference Figure 21 The current processing unit is the second processing unit in processing unit group 2. Since there are two processing unit groups in the processing device, the processing unit receives only one third attention sub-matrix, which comes from the second processing unit in processing unit group 1. Assuming the second number is 9, the number of third attention sub-matrices is 8.

[0385] The fourth attention submatrix is ​​the combined result of the first attention and all the third attention submatrixes. The fourth attention submatrix has M / L rows and NL rows. It represents the numerical values ​​of the text sequence and image subsequence of the current processing unit on the complete attention head.

[0386] Reference Figure 21 The second processing unit of processing unit group 2 receives the third attention sub-matrix from the second processing unit of processing unit group 1, and integrates the 3×2 first attention sub-matrix and the 3×2 third attention sub-matrix in the direction of the rows of the matrix to obtain the 3×4 fourth attention sub-matrix.

[0387] In step 2040, the fourth attention submatrix is ​​determined as the attention result.

[0388] The fourth attention submatrix represents the numerical values ​​of the text sequence and image subsequence of the current processing unit within the complete attention head. Therefore, the fourth attention submatrix can be used to determine the attention result of the processing unit. For example, Figure 4 The attention result of the second processing unit in processing unit group 2 is a 3×4 fourth attention submatrix.

[0389] The embodiments of steps 2010 to 2040 described above, which involve overlaying attention matrices, include the attention values ​​of processing units with the same index in the second number of processing unit groups at the attention head of the current processing unit. Therefore, the overlay attention matrix is ​​divided by rows, and the first attention sub-matrix corresponding to the current processing unit group and the second attention sub-matrix corresponding to other processing unit groups are determined in the division result. This ensures that processing units with the same index in each processing unit group can obtain the values ​​of the text sequence and image sub-sequence at the complete attention head. Furthermore, the processing unit also receives a third attention sub-matrix sent by processing units with the same index in other processing unit groups, so that the attention result includes the values ​​of the text sequence and image sub-sequence of the current processing unit at the complete attention head, improving the accuracy of the attention result and thus improving the efficiency of attention calculation.

[0390] Communication between processing units in the processing unit group

[0391] Each processing unit in the processing unit group is assigned the same attention head, but the image subsequences corresponding to each processing unit in the processing unit group are different. Therefore, to determine the values ​​of the text sequence and the image subsequence on the complete attention head, it is necessary to communicate with other processing units in the processing unit group to obtain the image subsequences from the other processing units in the first number of processing units in the processing unit group.

[0392] In one embodiment, reference is made to Figure 22 Prior to step 340, the data processing method provided in this embodiment of the disclosure further includes:

[0393] Step 2210: In the first cycle, the image subsequence corresponding to the j-th processing unit is sent to the (j+1)-th processing unit in the processing unit group through the j-th processing unit.

[0394] Step 2220: In the i-th cycle, the image subsequence received by the j-th processing unit in the i-1-th cycle is sent to the j+1-th processing unit in the processing unit group through the j-th processing unit, where i is an integer greater than 1 and less than the first number.

[0395] Steps 2210 and 2220 are described in detail below.

[0396] In step 2210, during the first cycle, the image subsequence corresponding to the j-th processing unit is sent to the (j+1)-th processing unit in the processing unit group through the j-th processing unit.

[0397] The j-th processing unit is a processing unit in the processing unit group; therefore, j is an integer greater than or equal to 1 and less than or equal to the first number. For example, Figure 5A and Figure 5B The processing unit groups shown each contain only two processing units, so the range of j is [1, 2]. For Figure 23 The processing unit group shown has a value range of [1,3].

[0398] In the first cycle, each processing unit in the processing unit group sends its own image subsequence to the next processing unit, and a first number of processing units send their image subsequences to the first processing unit. (See reference...) Figure 23 In the first cycle, processing unit 1 sends image subsequence 1 to processing unit 2, processing unit 2 sends image subsequence 2 to processing unit 3, and processing unit 3 sends image subsequence 3 to processing unit 1.

[0399] In step 2220, during the i-th period, the j-th processing unit sends the image sub-sequence received by the j-th processing unit in the (i-1)-th period to the (j+1)-th processing unit in the processing unit group, where i is an integer greater than 1 and less than the first number.

[0400] If i is an integer greater than 1 and less than a first number, the processing unit needs one cycle less than the first number to acquire image subsequences from other processing units in the processing unit group. If the first number is 13, the processing unit needs 12 cycles to acquire image subsequences from other processing units in the processing unit group.

[0401] In the i-th cycle, each processing unit in the processing unit group sends the image sub-sequence received in the (i-1)-th cycle to the next processing unit, and the first number of processing units sends the image sub-sequence to the first processing unit. (Refer to...) Figure 23 In the first cycle, processing unit 1 receives image subsequence 3 from processing unit 3, processing unit 2 receives image subsequence 1 from processing unit 1, and processing unit 3 receives image subsequence 2 from processing unit 2. Therefore, in the second cycle, processing unit 1 sends image subsequence 3 to processing unit 2, processing unit 2 sends image subsequence 1 to processing unit 2, and processing unit 3 sends processing unit 2 to processing unit 1.

[0402] It should be noted that the processing units in the processing unit group communicate in a peer-to-peer (P2P) manner, meaning that a processing unit is only allowed to receive data from one processing unit and can only send data to one processing unit.

[0403] The embodiments of steps 2210 and 2220 above employ a ring communication strategy for communication between processing units in the processing unit group. In the first cycle, the processing unit sends its own image subsequence to the next processing unit. In the i-th cycle, the processing unit sends the image subsequence received in the i-1-th cycle to the next processing unit to ensure that the processing unit can receive multiple image subsequences corresponding to the processing unit group, thereby improving the accuracy of the third attention result and thus improving the accuracy of multimodal attention operation.

[0404] In one embodiment, corresponding to steps 2210 and 2220, refer to Figure 24 Step 340 includes:

[0405] Step 2410: In the first to the i-th cycles, the image subsequence of the (j-1)-th processing unit in the processing unit group is received through the j-th processing unit.

[0406] Step 2410 will be described in detail below.

[0407] In step 2410, during the first to i cycles, the image subsequence of the (j-1)th processing unit in the processing unit group is received through the j-th processing unit.

[0408] In the first to i-th cycles, each processing unit in the processing unit group receives the image subsequence sent by the previous processing unit. (Refer to...) Figure 23 In the first cycle, processing unit 1 receives image subsequence 3 from processing unit 3, processing unit 2 receives image subsequence 1 from processing unit 1, and processing unit 3 receives image subsequence 2 from processing unit 2. In the second cycle, processing unit 1 receives image subsequence 2 from processing unit 3, processing unit 2 receives image subsequence 3 from processing unit 1, and processing unit 3 receives image subsequence 1 from processing unit 2.

[0409] In the embodiment of step 2410 above, the processing unit receives image subsequences from the previous processing unit in each of the first to i cycles. Therefore, after the i-th cycle ends, the processing unit can receive image subsequences from other processing units of the first number of processing units in the processing unit group, ensuring the correctness and integrity of the data received by the processing unit, thereby improving the correctness of multimodal attention operation.

[0410] Target query and calculation of target key-value pairs

[0411] The first attention sub-result is determined based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pairs corresponding to the text sub-sequence. The second attention sub-result is determined based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pairs corresponding to the image sub-sequence. The calculation process of the target query and target key-value pairs is described in detail below.

[0412] In the embodiments of steps 1110 and 1120 above, the target key-value pair includes a target key and a target value. Through the processing unit, attention calculation is performed based on the target query corresponding to the combined subsequence of the processing unit, the target key and target value corresponding to the text subsequence of the processing unit, and the target key and target value corresponding to the image subsequence, to obtain a first attention sub-result and a second attention sub-result, respectively. In this case, in one embodiment, the combined subsequence is an M×N original combined matrix, and the text subsequence is an M1×N original text matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix. (Refer to...) Figure 25 Prior to step 1110, the data processing method provided in this embodiment of the disclosure further includes:

[0413] Step 2510: Based on the original M×N combination matrix and the N×N query transformation matrix, generate the M×N target query matrix;

[0414] Step 2520: Based on the original text matrix M1×N and the key transformation matrix N×N, generate the first target key matrix M1×N;

[0415] Step 2530: Based on the original text matrix M1×N and the value transformation matrix N×N, generate the first target value matrix M1×N.

[0416] Steps 2510 to 2530 are described in detail below.

[0417] The original combination matrix is ​​the matrix representation of the combined subsequences. The original combination matrix has M rows and N columns. (See reference...) Figure 26 If the original combination matrix is ​​a 4×6 matrix, then M is 4 and N is 6.

[0418] The original text matrix is ​​the matrix representation of the text subsequences. The original text subsequences have M1 rows and N columns. (Refer to...) Figure 26 If the original text matrix is ​​a 1×6 matrix, then M1 is 1 and N is 6.

[0419] The target query matrix is ​​the matrix representation of the target query. The number of rows in the target query matrix is ​​the same as the number of rows in the original combined matrix, which is M.

[0420] The first target key matrix is ​​a matrix representation of the target keys of the text subsequence, and the first target value matrix is ​​a matrix representation of the target values ​​of the text subsequence. The first target key matrix and the first target value matrix have the same dimension.

[0421] In step 2510, an M×N target query matrix is ​​generated based on the original M×N combination matrix and the N×N query transformation matrix.

[0422] The query transformation matrix is ​​a transformation matrix between the original combined matrix and the target query matrix. Based on the query transformation matrix, the original combined matrix can be transformed into the target query matrix.

[0423] The target query matrix is ​​the product of the original combination matrix and the query transformation matrix. If the original combination matrix is ​​an M×N matrix and the query transformation matrix is ​​an N×N matrix, then the target query matrix is ​​also an M×N matrix. (Refer to...) Figure 26 If the original combination matrix is ​​a 4×6 matrix, the query transformation matrix is ​​a 6×6 matrix, then the target query matrix is ​​a 4×6 matrix.

[0424] It should be noted that if the original combination matrix is ​​an M×K matrix and the query transformation matrix is ​​a K×N matrix, then the generated target query matrix will still be an M×N matrix.

[0425] In step 2520, based on the original text matrix M1×N and the key transformation matrix N×N, the first target key matrix M1×N is generated.

[0426] A key transformation matrix is ​​a transformation matrix between the original text matrix and the first target key matrix. Based on the key transformation matrix, the original text matrix can be transformed into the first target key matrix.

[0427] The first target key matrix is ​​the product of the original text matrix and the key transformation matrix. If the original text matrix is ​​an M×N matrix and the key transformation matrix is ​​an N×N matrix, then the first target key matrix is ​​an M×N matrix. (Refer to...) Figure 26 If the original text matrix is ​​a 1×6 matrix and the key transformation matrix is ​​a 6×6 matrix, then the first target key matrix is ​​a 1×6 matrix.

[0428] In step 2530, based on the original text matrix M1×N and the value transformation matrix N×N, the first target value matrix M1×N is generated.

[0429] A value transformation matrix is ​​a transformation matrix between the original text matrix and the first target value matrix. Based on the value transformation matrix, the original text matrix can be transformed into the first target value matrix.

[0430] The first target value matrix is ​​the product of the original text matrix and the value transformation matrix. If the original text matrix is ​​an M×N matrix and the value transformation matrix is ​​an N×N matrix, then the first target value matrix is ​​an M×N matrix. (Refer to...) Figure 26 If the original text matrix is ​​a 1×6 matrix and the value transformation matrix is ​​a 6×6 matrix, then the first target value matrix is ​​a 1×6 matrix.

[0431] The embodiments of steps 2510 to 2530 above process the combined subsequence and text subsequence through query transformation matrix, key transformation matrix and value transformation matrix to obtain a target query matrix with the same number of columns, a first target key matrix and a first target value matrix. This facilitates the determination of the first attention sub-result through the target query matrix, the first target key matrix and the first target value matrix, thereby improving the accuracy of the first attention sub-result and the accuracy of attention operation.

[0432] In one embodiment, the combined subsequence is an M×N original combined matrix, and the image subsequence is an M2×N original image matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix. (Refer to...) Figure 27 Prior to step 1110, the data processing method provided in this embodiment of the disclosure further includes:

[0433] Step 2710: Based on the original M×N combination matrix and the N×N query transformation matrix, generate the M×N target query matrix;

[0434] Step 2720: Based on the original image matrix of M2×N and the key transformation matrix of N×N, generate the second target key matrix of M2×N;

[0435] Step 2730: Generate a second target value matrix of M2×N based on the original image matrix of M2×N and the value transformation matrix of N×N.

[0436] Steps 2710 to 2730 are described in detail below.

[0437] The original combination matrix is ​​the matrix representation of the combined subsequences. The original combination matrix has M rows and N columns. (See reference...) Figure 28 If the original combination matrix is ​​a 4×6 matrix, then M is 4 and N is 6.

[0438] The original image matrix is ​​a matrix representation of the text subsequence. The original text subsequence has M1 rows and N columns. (Refer to...) Figure 26 If the original text matrix is ​​a 3×6 matrix, then M1 is 3 and N is 6.

[0439] The target query matrix is ​​the matrix representation of the target query. The number of rows in the target query matrix is ​​the same as the number of rows in the original combined matrix, which is M.

[0440] The second target key matrix is ​​a matrix representation of the target keys of the text subsequence, and the second target value matrix is ​​a matrix representation of the target values ​​of the text subsequence. The first target key matrix and the first target value matrix have the same dimension.

[0441] In step 2710, an M×N target query matrix is ​​generated based on the original M×N combination matrix and the N×N query transformation matrix.

[0442] The query transformation matrix is ​​a transformation matrix between the original combined matrix and the target query matrix. Based on the query transformation matrix, the original combined matrix can be transformed into the target query matrix.

[0443] The target query matrix is ​​the product of the original combination matrix and the query transformation matrix. If the original combination matrix is ​​an M×N matrix and the query transformation matrix is ​​an N×N matrix, then the target query matrix is ​​also an M×N matrix. (Refer to...) Figure 28 If the original combination matrix is ​​a 4×6 matrix, the query transformation matrix is ​​a 6×6 matrix, then the target query matrix is ​​a 4×6 matrix.

[0444] It should be noted that if the original combination matrix is ​​an M×K matrix and the query transformation matrix is ​​a K×N matrix, then the generated target query matrix will still be an M×N matrix.

[0445] In step 2720, a second target key matrix of M2×N is generated based on the original image matrix of M2×N and the key transformation matrix of N×N.

[0446] The key transformation matrix is ​​the transformation matrix between the original image matrix and the second target key matrix. Based on the key transformation matrix, the original image matrix can be transformed into the second target key matrix.

[0447] The second target key matrix is ​​the product of the original image matrix and the key transformation matrix. If the original image matrix is ​​an M²×N matrix and the key transformation matrix is ​​an N×N matrix, then the second target key matrix is ​​an M²×N matrix. (Refer to...) Figure 28 If the original image matrix is ​​a 3×6 matrix and the key transformation matrix is ​​a 6×6 matrix, then the second target key matrix is ​​a 3×6 matrix.

[0448] In step 2730, a second target value matrix of M2×N is generated based on the original image matrix of M2×N and the value transformation matrix of N×N.

[0449] A value transformation matrix is ​​a transformation matrix between the original image matrix and the second target value matrix. Based on the value transformation matrix, the original image matrix can be transformed into the second target value matrix.

[0450] The second target value matrix is ​​the product of the original image matrix and the value transformation matrix. If the original image matrix is ​​an M1×N matrix and the value transformation matrix is ​​an N×N matrix, then the second target value matrix is ​​an M2×N matrix. (Refer to...) Figure 28 If the original image matrix is ​​a 3×6 matrix and the value transformation matrix is ​​a 6×6 matrix, then the second target value matrix is ​​a 3×6 matrix.

[0451] The embodiments of steps 2710 to 2730 above process the combined subsequence and the image subsequence through the query transformation matrix, the key transformation matrix, and the value transformation matrix to obtain a target query matrix, a second target key matrix, and a second target value matrix with the same number of columns. This facilitates the determination of the second attention result through the target query matrix, the second target key matrix, and the second target value matrix, thereby improving the accuracy of the second attention result and the accuracy of the attention operation.

[0452] In one embodiment, reference is made to Figure 29 This embodiment of the disclosure sets up a second number of processing unit groups, and each processing unit group contains a first number of processing units, the product of the first number and the second number being a third number. The image sequence is divided into a third number of image sub-sequences, and a text sequence and a single image sub-sequence are assigned to each processing unit. Unlike step 630, this embodiment of the disclosure directly uses the text sequence as a text sub-sequence, directly uses the image sub-sequence as an image sub-sequence, and determines a combined sub-sequence based on the text sequence and the image sub-sequence. Then, target key-value pairs for the text sub-sequence and target key-value pairs for the image sub-sequence are calculated, and the target query for the combined sub-sequence is calculated. Then, through mutual communication between processing units with the same sequence number in the second number of processing unit groups, the target query and target key-value pairs are transformed. Specifically, Figure 29In this context, GPU0 and GPU2 are processing units with the same index in different processing unit groups. If the target query Q is a (M1+M2)×N matrix and the second number is L, the target query is divided into the second number of (M1+M2)×(N / L) sub-matrices by column. The sub-matrix corresponding to processing unit group 1 is retained, and another sub-matrix is ​​sent to GPU2. The sub-matrix from GPU2 is then integrated with the retained sub-matrix to obtain an integrated matrix of (M1+M2)L×(N / L). The target query is then updated based on this integrated matrix. Similarly, the target keys of the M1×N text subsequences are updated to an M1L×(N / L) matrix, the target values ​​of the M1×N text subsequences are updated to an M1L×(N / L) matrix, the target keys of the M2×N image subsequences are updated to an M2L×(N / L) matrix, and the target keys of the M2×N image subsequences are updated to an M2L×(N / L) matrix. Attention operations are performed on the target query corresponding to the updated combined submatrix in the processing unit and the target key-value pairs corresponding to the text subsequence to obtain the first attention sub-result. Attention operations are then performed on the target query corresponding to the updated combined submatrix in the processing unit and the target key-value pairs corresponding to the image subsequence to obtain the second attention sub-result. Target key-value pairs corresponding to the updated image subsequences from other processing units in the same processing unit group are obtained, and attention operations are performed on these target key-value pairs and the target query to obtain the third attention sub-result. The first, second, and third attention sub-results are then superimposed to obtain a superimposed attention matrix. The attention result is generated based on the superimposed attention matrices of the processing units with the same sequence number in the second number of processing unit groups. For example, if both superimposed attention matrix 0 and superimposed attention matrix 1 are (M1+M2)L×(N / L) matrices, for GPU0, superimposed attention matrix 0 and superimposed attention matrix 1 are combined row-wise to obtain a (M1+M2)L×N matrix. Then, the (M1+M2)L×N matrix is ​​divided into a second number of (M1+M2)×N matrices, and the matrix corresponding to the current processing unit group is taken as attention result 0. Attention result 0 is a (M1+M2)×N matrix, and its size is the same as the size of the combined subsequence before the update. Using multiple processing units simultaneously improves the processing efficiency of attention operations. Simultaneously, since each processing unit stores the entire text subsequence, it is not necessary to obtain text subsequences from other processing units during the attention operation; only image subsequences stored by other processing units need to be obtained, reducing the time of multimodal operations and improving the efficiency of attention operations.

[0453] In one embodiment, CogVideoX is a text-to-video model that includes various types of models, such as CogVideoX-2B and CogVideoX-5B. CogVideoX-5B generates videos with better quality and visual effects, but its structure is more complex and has more parameters. Figure 30A This is a schematic diagram illustrating different data processing schemes within the same processing unit of the CogVideoX-2B. Figure 30A The vertical axis of the diagram represents the time required for different data processing schemes to generate 49 frames of 720x480 resolution video through 50 iterations. It is clear that as the parallelism increases, i.e., the number of processing units increases, the inference latency continuously decreases. The optimal configuration is to set up three processing unit groups on the processing device, with each group consisting of two processing units. Compared to the single-processing-unit method, this achieves a 3.18x speedup, requiring only 0.66 seconds per iteration, or 33 seconds in total, to generate 49 frames, or 6 seconds of video.

[0454] Figure 30B This is a schematic diagram illustrating different data processing schemes within the same processing unit of the CogVideoX-5B. Figure 30A and Figure 30B For the same processing unit, it is evident that, similar to CogVideoX-2B, the inference latency continuously decreases with increasing parallelism. Compared to the single-processor-unit method, the optimal solution achieves an inference speed improvement of up to 4.09 times.

[0455] Figure 31A and Figure 31B This is a schematic diagram illustrating different data processing schemes for the CogVideoX-2B and CogVideoX-5B within the same type of processing unit. Figure 31A and Figure 31B The corresponding processing unit and Figure 30A , Figure 30B The methods are different. However, it is equally evident that as the parallelism increases, i.e., the number of processing units increases, the inference latency continuously decreases. Compared to methods using a single processing unit, the solution of this disclosure achieves a several-fold increase in inference speed.

[0456] It is understood that although the steps in the above flowcharts are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated in this embodiment, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the above flowcharts may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages in other steps.

[0457] It should be noted that in various specific embodiments of this application, when processing is required based on data related to the characteristics of the target object, such as target object attribute information or a set of attribute information, the permission or consent of the target object will be obtained first. Furthermore, the collection, use, and processing of this data will comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require obtaining target object attribute information, separate permission or consent from the target object will be obtained through pop-ups or redirection to a confirmation page. Only after obtaining the target object's separate permission or consent will the necessary target object-related data for the normal operation of the embodiments of this application be obtained.

[0458] Device description of embodiments of this disclosure

[0459] Reference Figure 32 , Figure 32 This is a schematic diagram of the structure of a data processing apparatus provided in an embodiment of the present disclosure. The data processing apparatus 3200 is applied to a processing device including a group of processing units, the group of processing units including a first number of processing units, and the data processing apparatus 3200 includes:

[0460] The first acquisition unit 3210 is used to acquire target application data, which includes a text sequence and an image sequence.

[0461] The first generation unit 3220 is configured to generate, based on each processing unit, a text subsequence for processing by the processing unit based on the text sequence, an image subsequence for processing by the processing unit based on the image sequence, and a combined subsequence based on the text subsequence and the image subsequence;

[0462] Attention calculation unit 3230 is used to perform attention calculation based on the target query corresponding to the combined subsequence of the processing unit and the target key-value pairs corresponding to the text subsequence and image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result.

[0463] The second acquisition unit 3240 is used to acquire image sub-sequences of other processing units in the first number of processing units through the processing unit, and to perform attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the acquired image sub-sequence to obtain a third attention sub-result;

[0464] The second generation unit 3250 is used to generate an attention result based on the first attention sub-result, the second attention sub-result, and the third attention sub-result.

[0465] Optionally, the processing device includes a second number of processing unit groups, and the product of the first number and the second number is a third number;

[0466] The first generation unit 3220 is specifically used for:

[0467] Divide the image sequence into a third number of image subsequences;

[0468] Assign a text sequence to each processing unit in the processing device, and assign a third number of image sub-sequences to each processing unit in the processing device respectively;

[0469] Through communication between processing units with the same sequence number in different processing unit groups, the text sequences and image subsequences assigned to processing units with the same sequence number in different processing unit groups are converted into text subsequences and image subsequences.

[0470] Optionally, the text sequence is an M1×N matrix, the image sequence is an M2×N matrix, and the second number is L;

[0471] The first generating unit 3220 is also specifically used for:

[0472] The text sequence and image sub-sequence allocated to the processing unit are combined into a combination matrix of (M1+M2)×N;

[0473] Divide the columns of the combined matrix according to the second number to obtain the second number of (M1+M2)×(N / L) submatrices;

[0474] Retain the first submatrix corresponding to the processing unit group in the second number of submatrixes, and send the second submatrix corresponding to the other processing unit groups in the second number of submatrixes to the processing unit with the same sequence number in the other processing unit groups;

[0475] Receive the third sub-matrix sent by the processing unit with the same sequence number in other processing unit groups, and integrate the retained first sub-matrix and the received third sub-matrix into an integrated matrix of (M1+M2)L×(N / L).

[0476] The part corresponding to the text sequence in the integrated matrix is ​​determined as the text subsequence, and the part corresponding to the image subsequence in the integrated matrix is ​​determined as the image subsequence.

[0477] Optionally, each processing unit contains hc attention heads;

[0478] The first generation unit 3220 is further configured to: divide the columns of the combined matrix according to the second number, such that the number of columns of each (M1+M2)×(N / L) submatrix is ​​an integer multiple of hc, thereby distributing the (M1+M2)×(N / L) submatrix evenly to hc attention heads for execution.

[0479] Optionally, the target key-value pair includes a target key and a target value;

[0480] Attention computing unit 3230 is specifically used for:

[0481] Through the processing unit, attention is calculated based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, and the first attention sub-result is obtained;

[0482] Through the processing unit, attention is calculated based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, and the second attention sub-result is obtained.

[0483] Optionally, the combined subsequence is an M×N original combined matrix, and the text subsequence is an M1×N original text matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix;

[0484] The data processing device also includes:

[0485] The third generation unit is used to generate an M×N target query matrix based on the original M×N combination matrix and the N×N query transformation matrix.

[0486] The fourth generation unit is used to generate the first target key matrix of M1×N based on the original text matrix of M1×N and the key transformation matrix of N×N.

[0487] The fifth generation unit is used to generate the first target value matrix M1×N based on the original text matrix M1×N and the value transformation matrix N×N.

[0488] Optionally, the combined subsequence is an M×N original combined matrix, and the image subsequence is an M2×N original image matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix;

[0489] The data processing device also includes:

[0490] The sixth generation unit is used to generate an M×N target query matrix based on the original M×N combination matrix and the N×N query transformation matrix.

[0491] The seventh generation unit is used to generate a second target key matrix of M2×N based on the original image matrix of M2×N and the key transformation matrix of N×N;

[0492] The eighth generation unit is used to generate a second target value matrix of M2×N based on the original text matrix of M2×N and the value transformation matrix of N×N.

[0493] Optionally, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix;

[0494] The attention computing unit 3230 is also specifically used for:

[0495] Transpose the first target key matrix to obtain the first transpose matrix of N×M1;

[0496] Multiply the target query matrix and the first transpose matrix to obtain the first product matrix of M×M1;

[0497] Based on the first product matrix and the first target value matrix, an M×N first attention matrix is ​​generated as the first attention sub-result.

[0498] Optionally, each processing unit contains hc attention heads, each attention head having a size of hs, and N being the product of hc and hs;

[0499] The attention computing unit 3230 is also specifically used for:

[0500] Based on the size hs of the attention head, the first product matrix is ​​scaled to obtain the first scaling matrix;

[0501] The first scaling matrix is ​​exponentially normalized to obtain the first exponentially normalized matrix;

[0502] Based on the first exponential normalization matrix and the first target value matrix, a first attention matrix is ​​generated as the first attention sub-result.

[0503] Optionally, the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix;

[0504] The attention computing unit 3230 is also specifically used for:

[0505] Transpose the second target key matrix to obtain an N×M2 second transpose matrix;

[0506] Multiply the target query matrix and the second transpose matrix to obtain the second product matrix of M×M2;

[0507] Based on the second product matrix and the second target value matrix, an M×N second attention matrix is ​​generated as the second attention sub-result.

[0508] Optionally, each processing unit contains hc attention heads, each attention head having a size of hs, and N being the product of hc and hs;

[0509] The attention computing unit 3230 is also specifically used for:

[0510] Based on the size hs of the attention head, the second product matrix is ​​scaled to obtain the second scaling matrix;

[0511] The second scaling matrix is ​​exponentially normalized to obtain the second exponentially normalized matrix;

[0512] A second attention matrix is ​​generated based on the second exponential normalization matrix and the second target value matrix, which serves as the second attention sub-result.

[0513] Optionally, the data processing apparatus further includes:

[0514] The first sending unit is used to send the image subsequence corresponding to the j-th processing unit to the (j+1)-th processing unit in the processing unit group through the j-th processing unit in the first cycle.

[0515] The second transmitting unit is used to transmit the image subsequence received by the j-th processing unit in the (i-1)-th period to the (j+1)-th processing unit in the processing unit group through the j-th processing unit in the i-th period, where i is an integer greater than 1 and less than the first number.

[0516] Optionally, the second acquisition unit 3240 is further used for:

[0517] In the first to i cycles, the image subsequence of the (j-1)th processing unit in the processing unit group is received through the j-th processing unit.

[0518] Optionally, the processing device includes a second number of processing unit groups, wherein the first attention sub-result, the second attention sub-result, and the third attention sub-result are respectively an M×N first attention matrix, a second attention matrix, and a third attention matrix;

[0519] The second generation unit 3250 is specifically used for:

[0520] The first attention matrix, the second attention matrix, and the third attention matrix are superimposed to obtain the M×N superimposed attention matrix corresponding to the processing unit;

[0521] Attention results are generated based on the superimposed attention matrices corresponding to the processing units with the same index in the second number of processing unit groups.

[0522] Optionally, the second number is L;

[0523] The second generation unit 3250 is also specifically used for:

[0524] The rows of the superimposed attention matrix are divided according to the second number to obtain the second number of (M / L)×N attention submatrices;

[0525] Retain the first attention submatrix corresponding to the processing unit group in the second number of attention submatrixes, and send the second attention submatrix corresponding to the other processing unit groups in the second number of attention submatrixes to the processing unit with the same sequence number in the other processing unit groups;

[0526] Receive the third attention sub-matrix sent by the processing unit with the same sequence number in other processing unit groups, and integrate the retained first attention sub-matrix and the received third attention sub-matrix into a fourth attention sub-matrix of (M / L)×NL;

[0527] The fourth attention submatrix is ​​determined as the attention result.

[0528] In addition, this disclosure provides a computer device including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the data processing method of the preceding embodiments. Specifically, the computer device can be configured as a terminal, server, etc.

[0529] Reference Figure 33 , Figure 33This is a structural block diagram of a computer device, terminal 110. Terminal 110 includes: a radio frequency (RF) circuit 3310, a memory 3315, an input unit 3330, a display unit 3340, a sensor 3350, an audio circuit 3360, a wireless fidelity (WiFi) module 3370, a processor 3380, and a power supply 3390, etc. Those skilled in the art will understand that... Figure 33 The terminal 110 structure shown does not constitute a limitation on a mobile phone or computer, and may include more or fewer components than shown, or combine certain components, or have different component arrangements.

[0530] The RF circuit 3310 can be used to receive and transmit signals during information transmission or calls. In particular, it receives downlink information from the base station and processes it with the processor 3380; in addition, it transmits uplink data to the base station.

[0531] The memory 3315 can be used to store software programs and modules. The processor 3380 executes various functional applications and data processing of the content terminal by running the software programs and modules stored in the memory 3315.

[0532] The input unit 3330 can be used to receive input numeric or character information, and to generate key signal inputs related to the settings and function control of the content terminal. Specifically, the input unit 3330 may include a touch panel 3331 and other input devices 3332.

[0533] Display unit 3340 can be used to display input or provided information, as well as various menus of the content terminal. Display unit 3340 may include display panel 3341.

[0534] Audio circuit 3360, speaker 3361, and microphone 3362 provide an audio interface.

[0535] In this embodiment, the processor 3380 included in the terminal 110 can execute the data processing method of the previous embodiment.

[0536] Figure 34This is a block diagram of a computer device that is a server. The server 140 can vary significantly depending on its configuration or performance, and may include one or more Central Processing Units (CPUs) 3422 (e.g., one or more processors) and memory 3432, and one or more storage media 3430 (e.g., one or more mass storage devices) for storing application programs 3442 or data 3444. The memory 3432 and storage media 3430 can be temporary or persistent storage. The program stored in the storage media 3430 may include one or more modules (not shown in the diagram), each module including a series of instruction operations on the server. Furthermore, the CPU 3422 may be configured to communicate with the storage media 3430 and execute the series of instruction operations stored in the storage media 3430 on the server.

[0537] Server 140 may also include one or more power supplies 3426, one or more wired or wireless network interfaces 3450, one or more input / output interfaces 3458, and / or one or more operating systems 3441, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.

[0538] The central processing unit 3422 in server 140 can be used to execute the data processing method of the embodiments of this disclosure.

[0539] This disclosure also provides a computer-readable storage medium for storing a computer program for executing the data processing methods of the foregoing embodiments.

[0540] This disclosure also provides a computer program product comprising a computer program. A processor of a computer device reads and executes the computer program, causing the computer device to perform the data processing method described above.

[0541] In this application embodiment, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0542] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in this disclosure and the foregoing drawings are used to distinguish similar terms and are not necessarily used to describe a particular order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this disclosure described herein can be implemented, for example, in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “including,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that includes a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatuses.

[0543] It should be understood that in this disclosure, "at least one item" refers to one or more items, and "more than one item" refers to two or more items. "And / or" is used to describe the relationship between related content, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related content are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0544] It should be understood that in the description of the embodiments disclosed herein, "multiple" means two or more, "greater than", "less than", "exceeding" etc. are understood to exclude the number itself, and "above", "below", "within" etc. are understood to include the number itself.

[0545] In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, indirect coupling or communication connection between apparatuses or units, and may be electrical, mechanical, or other forms.

[0546] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0547] Furthermore, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0548] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0549] It should also be understood that the various implementation methods provided in this disclosure can be combined arbitrarily to achieve different technical effects.

[0550] The above is a detailed description of the embodiments of this disclosure. However, this disclosure is not limited to the above embodiments. Those skilled in the art can make various equivalent modifications or substitutions without departing from the spirit of this disclosure. All such equivalent modifications or substitutions are included within the scope defined by the claims of this disclosure.

Claims

1. A data processing method, characterized in that, The data processing method, applied to a processing device comprising a group of processing units, the group of processing units comprising a first number of processing units, includes: Acquire target application data, wherein the target application data comprises a text sequence and an image sequence; For each processing unit, a text subsequence is generated based on the text sequence for processing by the processing unit, an image subsequence is generated based on the image sequence for processing by the processing unit, and a combined subsequence is generated based on the text subsequence and the image subsequence; Through the processing unit, attention calculation is performed on the target query corresponding to the combined subsequence of the processing unit, the target key-value pairs corresponding to the text subsequence and the image subsequence of the processing unit, respectively, to obtain the first attention sub-result and the second attention sub-result; The processing unit obtains image sub-sequences from other processing units among the first number of processing units, and performs attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the obtained image sub-sequence to obtain a third attention sub-result. An attention result is generated based on the first attention result, the second attention result, and the third attention result.

2. The data processing method according to claim 1, characterized in that, The processing device includes a second number of processing unit groups, and the product of the first number and the second number is a third number; The step of generating a text subsequence for processing by the processing unit based on the text sequence and generating an image subsequence for processing by the processing unit based on the image sequence, based on each processing unit, includes: The image sequence is divided into a third number of image sub-sequences; The text sequence is assigned to each processing unit in the processing device, and the third number of image sub-sequences are respectively assigned to each processing unit in the processing device; Through mutual communication between processing units with the same sequence number in different processing unit groups, the text sequence and the image subsequence assigned to the processing units with the same sequence number in different processing unit groups are converted into the text subsequence and the image subsequence.

3. The data processing method according to claim 2, characterized in that, The text sequence is an M1×N matrix, the image sequence is an M2×N matrix, and the second number is L; The step of converting the text sequence and image subsequence assigned to the processing units with the same sequence number in different processing unit groups into the text subsequence and image subsequence through mutual communication includes: The text sequence and the image sub-sequence assigned to the processing unit are combined into a (M1+M2)×N combination matrix; The columns of the combined matrix are divided according to a second number to obtain a second number of (M1+M2)×(N / L) submatrices; The first submatrix corresponding to the processing unit group in the second number of submatrixes is retained, and the second submatrix corresponding to the other processing unit groups in the second number of submatrixes is sent to the processing unit with the same sequence number in the other processing unit groups; Receive the third sub-matrix sent by the processing unit with the same sequence number in the other processing unit group, and integrate the retained first sub-matrix and the received third sub-matrix into an integrated matrix of (M1+M2)L×(N / L). The portion corresponding to the text sequence in the integrated matrix is ​​determined as the text subsequence, and the portion corresponding to the image subsequence in the integrated matrix is ​​determined as the image subsequence.

4. The data processing method according to claim 3, characterized in that, Each processing unit contains hc attention heads; The step of dividing the columns of the combined matrix according to a second number to obtain a second number of (M1+M2)×(N / L) submatrices includes: dividing the columns of the combined matrix according to a second number such that the number of columns of each (M1+M2)×(N / L) submatrix is ​​an integer multiple of hc, thereby the (M1+M2)×(N / L) submatrices are evenly distributed to hc attention heads for execution.

5. The data processing method according to claim 1, characterized in that, The target key-value pair includes a target key and a target value; The process involves the processing unit performing attention calculations based on the target query corresponding to the combined subsequence of the processing unit, and the target key-value pairs corresponding to the text subsequence and the image subsequence of the processing unit, respectively, to obtain a first attention sub-result and a second attention sub-result, including: The processing unit performs attention calculations based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the text subsequence of the processing unit, and obtains the first attention sub-result. The processing unit performs attention calculations based on the target query corresponding to the combined subsequence of the processing unit, the target key and the target value corresponding to the image subsequence of the processing unit, and obtains the second attention sub-result.

6. The data processing method according to claim 5, characterized in that, The combined subsequence is an M×N original combined matrix, and the text subsequence is an M1×N original text matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix; Before obtaining the first attention sub-result by performing attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit, the target key and the target value corresponding to the text sub-sequence of the processing unit, and the processing unit, the data processing method further includes: Based on the original M×N combination matrix and the N×N query transformation matrix, the target query matrix of M×N is generated. Based on the original text matrix M1×N and the key transformation matrix N×N, the first target key matrix M1×N is generated; Based on the original text matrix M1×N and the value transformation matrix N×N, the first target value matrix M1×N is generated.

7. The data processing method according to claim 5, characterized in that, The combined subsequence is an M×N original combined matrix, and the image subsequence is an M2×N original image matrix; the target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix; Before obtaining the second attention sub-result by performing attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit, the target key and the target value corresponding to the image sub-sequence of the processing unit, and the processing unit, the data processing method further includes: Based on the original M×N combination matrix and the N×N query transformation matrix, the target query matrix of M×N is generated. Based on the original image matrix of M2×N and the key transformation matrix of N×N, the second target key matrix of M2×N is generated; Based on the original image matrix M2×N and the value transformation matrix N×N, the second target value matrix M2×N is generated.

8. The data processing method according to claim 5, characterized in that, The target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the text subsequence are respectively an M1×N first target key matrix and an M1×N first target value matrix; The first attention sub-result is obtained by performing attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit, the target key corresponding to the text sub-sequence of the processing unit, and the target value, through the processing unit, including: Transpose the first target key matrix to obtain the first transpose matrix of N×M1; Multiply the target query matrix and the first transpose matrix to obtain the first product matrix of M×M1; Based on the first product matrix and the first target value matrix, an M×N first attention matrix is ​​generated as the first attention sub-result.

9. The data processing method according to claim 8, characterized in that, Each processing unit contains hc attention heads, each attention head is of size hs, and N is the product of hc and hs; The step of generating an M×N first attention matrix based on the first product matrix and the first target value matrix, as the first attention sub-result, includes: Based on the size hs of the attention head, the first product matrix is ​​scaled to obtain the first scaling matrix; The first scaling matrix is ​​exponentially normalized to obtain the first exponentially normalized matrix; Based on the first exponential normalization matrix and the first target value matrix, the first attention matrix is ​​generated as the first attention sub-result.

10. The data processing method according to claim 5, characterized in that, The target query corresponding to the combined subsequence is an M×N target query matrix, and the target key and target value corresponding to the image subsequence are respectively an M2×N second target key matrix and an M2×N second target value matrix. The second attention sub-result is obtained by performing attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit, the target key corresponding to the image sub-sequence of the processing unit, and the target value, through the processing unit, including: Transpose the second target key matrix to obtain an N×M2 second transpose matrix; Multiply the target query matrix and the second transpose matrix to obtain the second product matrix of M×M2; Based on the second product matrix and the second target value matrix, an M×N second attention matrix is ​​generated as the second attention sub-result.

11. The data processing method according to claim 10, characterized in that, Each processing unit contains hc attention heads, each attention head is of size hs, and N is the product of hc and hs; The step of generating an M×N second attention matrix based on the second product matrix and the second target value matrix, as the second attention sub-result, includes: Based on the size hs of the attention head, the second product matrix is ​​scaled to obtain the second scaling matrix; The second scaling matrix is ​​exponentially normalized to obtain the second exponentially normalized matrix; Based on the second exponential normalization matrix and the second target value matrix, the second attention matrix is ​​generated as the second attention sub-result.

12. The data processing method according to claim 1, characterized in that, Before obtaining the image sub-sequences of other processing units among the first number of processing units through the processing unit, and performing attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the obtained image sub-sequence to obtain the third attention result, the data processing method includes: In the first cycle, the image subsequence corresponding to the j-th processing unit is sent to the (j+1)-th processing unit in the processing unit group through the j-th processing unit; In the i-th cycle, the j-th processing unit sends the image sub-sequence received in the (i-1)-th cycle to the (j+1)-th processing unit in the processing unit group, where i is an integer greater than 1 and less than the first number.

13. The data processing method according to claim 11, characterized in that, The step of obtaining image sub-sequences from other processing units among the first number of processing units through the processing unit includes: In the first to the i-th cycles, the j-th processing unit receives the image subsequence of the (j-1)-th processing unit in the group of processing units.

14. The data processing method according to claim 1, characterized in that, The processing device includes a second number of processing unit groups, wherein the first attention sub-result, the second attention sub-result, and the third attention sub-result are respectively an M×N first attention matrix, a second attention matrix, and a third attention matrix; The step of generating an attention result based on the first attention sub-result, the second attention sub-result, and the third attention sub-result includes: The first attention matrix, the second attention matrix, and the third attention matrix are superimposed to obtain the M×N superimposed attention matrix corresponding to the processing unit; The attention result is generated based on the superimposed attention matrix corresponding to the processing units with the same index in the second number of processing unit groups.

15. The data processing method according to claim 14, characterized in that, The second number is L; The attention result is generated based on the superimposed attention matrix corresponding to the processing units with the same index in the second number of processing unit groups, including: The rows of the superimposed attention matrix are divided according to a second number to obtain a second number of (M / L)×N attention submatrices; The first attention submatrix corresponding to the processing unit group in the second number of attention submatrixes is retained, and the second attention submatrix corresponding to the other processing unit groups in the second number of attention submatrixes is sent to the processing unit with the same index in the other processing unit groups; Receive the third attention sub-matrix sent by the processing unit with the same sequence number in the other processing unit group, and integrate the retained first attention sub-matrix and the received third attention sub-matrix into a fourth attention sub-matrix of (M / L)×NL. The fourth attention submatrix is ​​determined as the attention result.

16. A data processing apparatus, characterized in that, Applied to a processing device comprising a group of processing units, the group of processing units comprising a first number of processing units, the data processing apparatus comprising: A first acquisition unit is used to acquire target application data, wherein the target application data has a text sequence and an image sequence. The first generation unit is configured to, based on each processing unit, generate a text subsequence for processing by the processing unit based on the text sequence, generate an image subsequence for processing by the processing unit based on the image sequence, and generate a combined subsequence based on the text subsequence and the image subsequence; An attention calculation unit is used to perform attention calculation based on the target query corresponding to the combined subsequence of the processing unit, the target key-value pairs corresponding to the text subsequence and the image subsequence of the processing unit, respectively, to obtain a first attention sub-result and a second attention sub-result. The second acquisition unit is used to acquire image sub-sequences of other processing units among the first number of processing units through the processing unit, and to perform attention calculation based on the target query corresponding to the combined sub-sequence of the processing unit and the target key-value pair corresponding to the acquired image sub-sequence to obtain a third attention sub-result; The second generation unit is used to generate an attention result based on the first attention sub-result, the second attention sub-result, and the third attention sub-result.

17. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the data processing method according to any one of claims 1 to 15.

18. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the data processing method according to any one of claims 1 to 15.

19. A computer program product comprising a computer program that is read and executed by a processor of a computer device, causing the computer device to perform the data processing method according to any one of claims 1 to 15.