Model training method, electronic device, and storage medium

By introducing a composite mapping layer and hierarchical bidirectional information aggregation into the Transformer model, the problems of low computational efficiency and insufficient capture of long-range dependencies in long sequence processing are solved, achieving efficient parallel computing and accurate sequence modeling.

CN122198019APending Publication Date: 2026-06-12HONG KONG UNIV OF SCI & TECH (GUANGZHOU) +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HONG KONG UNIV OF SCI & TECH (GUANGZHOU)
Filing Date
2026-02-25
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies are computationally inefficient and complex when processing long sequences, making it difficult to effectively capture long-range dependencies, and they are also difficult to deploy efficiently on resource-constrained edge devices.

Method used

A composite mapping layer (CM) neural network structure is adopted, which combines dynamic weight generation and hierarchical bidirectional information aggregation strategy. By dividing the feature dimension of the input sequence into forward and backward components and performing hierarchical feature aggregation, the information is finally fused in parallel computing, thereby reducing computational complexity.

🎯Benefits of technology

While maintaining parallel computing capabilities, it significantly reduces computational complexity and improves the efficiency and accuracy of long sequence processing, making it suitable for fields such as natural language processing, time series prediction, and speech recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122198019A_ABST
    Figure CN122198019A_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a model training method, an electronic device and a storage medium. The method comprises: obtaining an input sequence; inputting the input sequence into a sequence processing model to obtain a target sequence output by the sequence processing model; wherein the sequence processing model performs the following processes: uniformly segmenting N feature dimensions of each character to obtain an initial forward component and an initial backward component of a character sequence; performing first feature aggregation processing on the initial forward component of the character sequence according to a forward propagation direction to obtain a target forward component of the character sequence; performing second feature aggregation processing on the initial backward component of the character sequence according to a backward propagation direction to obtain a target backward component of the character sequence; performing fusion processing on the target forward component of the character sequence and the target backward component of the character sequence to obtain a target component of the character sequence; and obtaining the target sequence according to the character sequence and the target component. The embodiments of the present application can improve the long sequence calculation efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to a model training method, electronic device, and storage medium. Background Technology

[0002] Sequence modeling is a core task in fields such as Natural Language Processing (NLP), speech recognition, machine translation, and bioinformatics, with the key being the effective capture of long-range dependencies in sequence data. Traditional methods primarily employ Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) for sequence processing. RNNs model sequences by recursively updating hidden states, but they suffer from vanishing or exploding gradients and are difficult to parallelize, limiting their application in long-sequence tasks. While CNNs support parallel computation, their fixed receptive field size makes it difficult to effectively capture long-range dependencies, and pooling operations can lead to the loss of sequence information.

[0003] While the Transformer model significantly improves sequence modeling capabilities, its self-attention mechanism can directly model global dependencies and support efficient parallel training. However, the computational complexity of this self-attention mechanism is as high as O(L²) (where L is the sequence length), which is computationally inefficient when dealing with very long sequences. Summary of the Invention

[0004] The main objective of this application is to propose a model training method, apparatus, electronic device, and storage medium that can solve the problem of low computational efficiency in processing long sequences in the prior art, reduce the complexity of long sequence processing, and improve computational efficiency.

[0005] To achieve the above objectives, a first aspect of this application proposes a model training method, the method comprising: Obtain an input sequence, which includes a character sequence consisting of multiple characters and N feature dimensions for describing each character; The input sequence is fed into the sequence processing model to obtain the target sequence output by the sequence processing model; The sequence processing model performs the following process: The N feature dimensions of each character are uniformly segmented to obtain the initial forward component and the initial backward component of the character sequence. The initial forward component includes N1 feature dimensions of each character, and the initial backward component includes N2 feature dimensions of each character. The sum of N1 and N2 is N, where N is an integer greater than or equal to 2. The initial forward component of the character sequence is subjected to a first feature aggregation process according to the forward propagation direction to obtain the target forward component of the character sequence, wherein the forward propagation direction is the propagation direction from the first character to the last character of the character sequence; The initial backward component of the character sequence is subjected to a second feature aggregation process according to the backward propagation direction to obtain the target backward component of the character sequence, wherein the backward propagation direction is the propagation direction from the last character to the first character of the character sequence; The target forward component and the target backward component of the character sequence are fused to obtain the target component of the character sequence. The target sequence is obtained based on the character sequence and the target component.

[0006] In some embodiments, the first feature aggregation process includes: for a first character in the character sequence, fusing the N1 feature dimensions of the first character and the N1 feature dimensions of a second character located in the forward propagation direction to obtain the N1 feature dimensions corresponding to the first character in the target forward component, wherein the second character is spaced n1 characters away from the first character, and n1 is a positive integer; And / or, The second feature aggregation process includes: for the third character in the character sequence, fusing the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character located in the backward propagation direction to obtain the N2 feature dimensions corresponding to the third character in the target backward component, wherein the third character and the fourth character are separated by n2 characters, where n2 is a positive integer.

[0007] In some embodiments, the number of times the first feature aggregation process is M, where M is determined according to the length of the character sequence. In the process of performing the i-th first feature aggregation process on the initial forward component of the character sequence according to the forward propagation direction, n1 is 2 to the power of i, and i is a natural number less than M. And / or, The second feature aggregation process is performed M times. During the j-th second feature aggregation process on the initial backward component of the character sequence according to the backward propagation direction, n2 is 2 to the power of j, and j is a natural number less than M.

[0008] In some embodiments, during the i-th first feature aggregation process on the initial forward components of the character sequence according to the forward propagation direction: Obtain N1 weight values ​​corresponding to the first character from the L×N weight matrix corresponding to the i-th first feature aggregation process, wherein the L×N weight matrix corresponding to the i-th first feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence, and L is the length of the character sequence. Obtain N1 feature dimensions of the first character and N1 feature dimensions of the second character from the initial forward component; Multiply the N1 weight values ​​element-wise with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions; Add the corresponding elements of the N1 feature dimensions of the second character to the N1 first target feature dimensions to obtain the target forward component of the second character after the i-th first feature aggregation process; And / or, During the i-th second feature aggregation process on the initial backward components of the character sequence according to the backward propagation direction: Obtain N2 weight values ​​corresponding to the third character from the L×N weight matrix corresponding to the i-th second feature aggregation process, wherein the L×N weight matrix corresponding to the i-th second feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence; Obtain the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character from the initial backward component; Multiply the N2 weight values ​​element-wise with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions; Add the corresponding elements of the N2 feature dimensions of the third character to the N2 second target feature dimensions to obtain the target forward component of the fourth character after the i-th second feature aggregation process.

[0009] In some embodiments, the three-dimensional reshaping process includes: The input sequence is subjected to a linear transformation to obtain an intermediate representation; The intermediate representation is converted into a three-dimensional tensor to obtain a three-dimensional structure, which includes the length L, the number M, and N feature dimensions of each character sequence. The three-dimensional structure is normalized to obtain M L×N weight matrices.

[0010] In some embodiments, the step of element-wise multiplying the N1 weight values ​​with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions includes: Obtain the first parameter matrix corresponding to the first feature aggregation processing. The first parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N1 first parameters corresponding to the first character in the first parameter matrix; The N1 feature dimensions of the first character, the N1 weight values, and the N1 first parameters are multiplied element-wise to obtain the N1 first target feature dimensions. And / or, The step of element-wise multiplying the N2 weight values ​​with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions includes: Obtain the second parameter matrix corresponding to the second feature aggregation process. The second parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N2 second parameters in the second parameter matrix that correspond to the third character; The N2 feature dimensions of the third character, the N2 weight values, and the N2 second parameters are multiplied element-wise to obtain the N2 second target feature dimensions.

[0011] In some embodiments, the target parameter matrix is ​​obtained according to the following process: Obtain the initial parameter matrix randomly generated by the sequence processing model. The initial parameter matrix includes the length L of the character sequence and N feature dimensions of each character. Obtain a training sample set, which includes training samples and the true label of each training sample; The training sample set is input into the sequence processing model, which then calculates the prediction result using the initial parameter matrix through target feature aggregation processing. Based on the prediction results and the true labels, the loss function value is obtained; Based on the loss function value, the gradient of the initial parameter matrix is ​​calculated using the backpropagation algorithm; Based on the gradient, the values ​​of the initial parameter matrix are adjusted to obtain a temporary parameter matrix; The temporary parameter matrix is ​​used as the initial parameter matrix. The process jumps to the step of inputting the training sample set into the sequence processing model, and the sequence processing model calculates the prediction result by using the initial parameter matrix through target feature aggregation processing, until the iteration stopping condition is met. The last obtained temporary parameter matrix is ​​used as the target parameter matrix; Wherein, when the target parameter matrix is ​​the first parameter matrix, the target feature aggregation processing adopts the first feature aggregation processing; when the target parameter matrix is ​​the second parameter matrix, the target feature aggregation processing adopts the second feature aggregation processing.

[0012] In some embodiments, after inputting the input sequence to a sequence processing model to obtain the target sequence output by the sequence processing model, the method further includes: The initial loss function value is calculated based on the target sequence. Based on the initial loss function value, the initial gradients of each parameter in the sequence processing model are calculated using the backpropagation algorithm. Calculate the gradient norm of the initial gradient of each of the parameters; Obtain the gradient norm, historical gradient norm, and the changing trend of historical loss function values; The gradient clipping threshold is dynamically adjusted based on the gradient norm, historical gradient norm, and the changing trend of historical loss function values. The gradient norm is compared with the gradient clipping threshold. If the gradient norm exceeds the gradient clipping threshold, the initial gradient of each parameter is proportionally reduced to the range of the gradient clipping threshold to obtain the target gradient of each parameter. The model parameters in the sequence processing model are updated based on the target gradients of each parameter.

[0013] To achieve the above objectives, a second aspect of this application provides a model training apparatus, the apparatus comprising: The acquisition module is used to acquire an input sequence, which includes a character sequence consisting of multiple characters and N feature dimensions for describing each character; The input module is used to input the input sequence into the sequence processing model to obtain the target sequence output by the sequence processing model; The sequence processing model performs the following process: The N feature dimensions of each character are uniformly segmented to obtain the initial forward component and the initial backward component of the character sequence. The initial forward component includes N1 feature dimensions of each character, and the initial backward component includes N2 feature dimensions of each character. The sum of N1 and N2 is N, where N is an integer greater than or equal to 2. The initial forward component of the character sequence is subjected to a first feature aggregation process according to the forward propagation direction to obtain the target forward component of the character sequence, wherein the forward propagation direction is the propagation direction from the first character to the last character of the character sequence; The initial backward component of the character sequence is subjected to a second feature aggregation process according to the backward propagation direction to obtain the target backward component of the character sequence, wherein the backward propagation direction is the propagation direction from the last character to the first character of the character sequence; The target forward component and the target backward component of the character sequence are fused to obtain the target component of the character sequence. The target sequence is obtained based on the character sequence and the target component.

[0014] To achieve the above objectives, a third aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the method described in the first aspect.

[0015] To achieve the above objectives, a fourth aspect of the present application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method described in the first aspect.

[0016] The model training method, apparatus, electronic device, and storage medium proposed in this application uniformly divide the N feature dimensions of each character in the input sequence into initial forward and backward components, and perform hierarchical feature aggregation along the forward and backward propagation directions respectively. Finally, the bidirectional aggregation results are fused to generate the target sequence. This achieves effective capture of long-term dependencies in the sequence while reducing computational complexity, significantly improving the efficiency and accuracy of sequence processing. The embodiments of this application, through an innovative bidirectional feature aggregation mechanism, enable each character to obtain complete contextual information while maintaining the model's parallel computing capabilities, thereby achieving more accurate sequence conversion and generation in natural language processing tasks such as machine translation and text generation. Attached Figure Description

[0017] Figure 1 This is a flowchart illustrating the model training method provided in the embodiments of this application; Figure 2 This is a schematic diagram of the structure of the model training device provided in the embodiments of this application; Figure 3 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0019] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, and the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.

[0020] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0021] Sequence modeling is a core task in fields such as Natural Language Processing (NLP), speech recognition, machine translation, and bioinformatics, with the key being the effective capture of long-range dependencies in sequence data. Traditional methods primarily employ Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) for sequence processing. RNNs model sequences by recursively updating hidden states, but they suffer from vanishing or exploding gradients and are difficult to parallelize, limiting their application in long-sequence tasks. While CNNs support parallel computation, their fixed receptive field size makes it difficult to effectively capture long-range dependencies, and pooling operations can lead to the loss of sequence information.

[0022] The introduction of Transformers significantly improved sequence modeling capabilities. Its self-attention mechanism can directly model global dependencies and supports efficient parallel training. However, the computational complexity of Transformer's self-attention is O(L²) (where L is the sequence length), which presents significant computational and memory overhead when processing long sequences (such as genomic data or long documents), limiting its application in practical industrial scenarios. Although existing research has attempted improvements, such as linear attention and state-space models (e.g., Mamba), these methods often sacrifice the model's accurate modeling ability for long-range dependencies or have shortcomings in parallel training while maintaining computational efficiency. Furthermore, existing models still have limitations in cross-domain and cross-language generalization capabilities and have a large number of parameters, making efficient deployment on resource-constrained edge devices difficult.

[0023] Based on this, embodiments of this application provide a model training method, apparatus, electronic device, and storage medium, aiming to provide a sequence modeling method with low computational complexity, high parallelization efficiency, and strong long-range modeling capabilities to meet the demands for real-time performance, accuracy, and scalability in industrial scenarios. Addressing the technical bottlenecks of traditional Transformer models in processing long sequences, such as high computational complexity, large memory consumption, and insufficient capture of long-range dependencies, this application proposes an innovative solution. By constructing a novel neural network structure, the Composite Mapping Layer (CM), combined with a dynamic weight generation mechanism and a hierarchical bidirectional information aggregation strategy, it achieves a reduction in computational complexity from that of traditional Transformers while maintaining the model's parallel computing capabilities. Reduce to This represents a breakthrough advancement. It can be widely applied in fields such as natural language processing, time series prediction, and speech recognition. For example, in long text machine translation tasks, given a source text of thousands of words, the composite mapping layer model can generate fluent and semantically consistent translations while reducing computational complexity. In financial time series prediction tasks, given historical transaction data, the model can efficiently capture long-term dependencies and output predictions of future trends. Therefore, this application not only improves the efficiency of long sequence modeling but also ensures the accuracy and practicality of the prediction results.

[0024] This application proposes a structural sharing method for information mapping functions. This method assumes that the relationship between any two time points depends only on their distance, i.e., it satisfies translation invariance. Furthermore, a method is proposed to decompose long-distance information mapping into an iterative combination of multiple fixed-distance unit mapping functions $F_1$, for example... This structure is essentially a type of Markov process, similar to the computational flow of a recurrent neural network (RNN), with both a small number of parameters and low computational complexity. However, due to the iterative nature of the structure, it is difficult to compute in parallel, and long-distance dependent information is prone to decay.

[0025] To address the aforementioned shortcomings, the CM layer employs a family of information mapping functions based on "distances that are powers of 2," such as... By combining these basic mapping functions, information transmission over arbitrary distances can be achieved. Specifically, to achieve information transmission from the first... Time to the Information mapping at time Then the difference The sum of several powers of 2 can be achieved by combining corresponding mapping functions, for example... = This method significantly shortens the information transmission path and effectively alleviates the problem of information attenuation during long-distance transmission.

[0026] To improve computational efficiency and reduce the number of parameters, the CM layer uses a grouped linear transformation approach for mapping calculations. Input features Classified as Groups, each group has a feature dimension of 1. Introduce a set of weighting coefficients. The result is obtained by performing a linear transformation on the input features and then normalizing them using Softmax. Subsequently, for a given distance... The input sequence is shifted along this dimension. The system calculates the position of each feature vector and performs a linear transformation on the translated feature vector. The transformation result is then added to the state representation of the current position.

[0027] This process is equivalent to a prefix sum operation with distance weights, that is, by shifting and accumulating layer by layer, information between different positions is gradually aggregated into the current sequence representation. In specific implementations, index slicing combined with prefix sum operations can be used to complete the shift and accumulation: for example, by slicing the first few bits of the sequence... The element and its after Align the elements, then multiply by the parameter matrix. Then add it to the corresponding position. This achieves efficient information aggregation while keeping the computational complexity at the logarithmic level. This processing flow is highly parallel and suitable for GPU acceleration.

[0028] Furthermore, to meet the need for full-sequence information modeling in certain applications, this application designs a bidirectional CM layer structure. It divides the input features into two channels, forward and reverse, performs information aggregation on each channel separately, and finally concatenates them along the feature dimension to form a complete bidirectional context representation. The forward part is processed in the same way as described above; the reverse part is shifted and accumulated in the opposite direction. Through this structure, the model can fully acquire global context information.

[0029] Following each CM layer, this application also introduces a nonlinear feedforward network to enhance the feature transformation capability. The feedforward network consists of two linear transformation layers and a LeakyReLU activation function, and its overall function is to enhance the nonlinear expressive power of the features. First, the first linear transformation layer transforms the input features from the original dimension... Projected to higher intermediate dimensions By expanding features in a higher-dimensional space, the model can learn more complex and diverse feature combinations. Subsequently, the introduced LeakyReLU activation function applies a nonlinear transformation in this higher-dimensional space. This not only effectively improves the model's ability to express complex patterns, but also avoids the neuron inactivation problem that may occur with the traditional ReLU activation function because it retains non-zero gradients in the negative region. Finally, a second linear transformation layer remaps the high-dimensional features after the nonlinear transformation back to the original dimension. This ensures dimensional consistency between the input and output of the feedforward network and maintains compatibility with the overall model structure. Therefore, the intermediate layer dimensions... , with input and output dimensions The difference lies in the fact that the former is used to expand and enrich the feature modeling capabilities in the latent space, while the latter ensures the uniformity and interoperability of network modules in terms of input and output interfaces.

[0030] The model training method, apparatus, electronic device, and storage medium provided in the embodiments of this application are specifically described through the following embodiments. First, the model training method in the embodiments of this application is described.

[0031] The model training method provided in this application relates to the field of deep learning technology. The model training method provided in this application can be applied to a terminal, a server, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, etc.; the server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; the software can be an application implementing the model training method, but is not limited to the above forms.

[0032] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0033] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirection to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data required for the proper functioning of these embodiments acquired.

[0034] Figure 1 This is an optional flowchart of the model training method provided in the embodiments of this application. Figure 1 The method may include, but is not limited to, steps S100 to S700.

[0035] Step S100: Obtain the input sequence, which includes a character sequence consisting of multiple characters and N feature dimensions for describing each character.

[0036] In this embodiment, the input sequence to be processed is obtained. This input sequence can be a piece of text, a feature sequence of a speech signal, or any other form of sequence data. The input sequence is a character sequence composed of multiple characters, which can include letters, punctuation marks, numbers, etc., and each character is described by N feature dimensions. These feature dimensions can include semantic features, syntactic features, positional features, etc. For example, in a machine translation task, the input sequence can be an English sentence, where each word or character is represented by a high-dimensional vector.

[0037] Step S200: Input the input sequence into the sequence processing model to obtain the target sequence output by the sequence processing model.

[0038] In this embodiment, the input sequence is fed into a sequence processing model. The sequence processing model is trained to process the input sequence and output a target sequence. The target sequence can be a translated sentence, a summary text, or other form of sequence output. For example, in long text machine translation tasks, with thousands of words of original text as input, the composite mapping layer model can generate fluent and semantically consistent translations while reducing computational complexity; in financial time series prediction tasks, with historical transaction data as input, the model can efficiently capture long-term dependencies and output predictions of future trends.

[0039] The sequence processing model performs the following process: Step S300: Perform uniform segmentation on the N feature dimensions of each character to obtain the initial forward component and initial backward component of the character sequence. The initial forward component includes N1 feature dimensions of each character, and the initial backward component includes N2 feature dimensions of each character. The sum of N1 and N2 is N, where N is an integer greater than or equal to 2.

[0040] In this embodiment, the model first performs a feature segmentation process. The N feature dimensions of each character are uniformly segmented, meaning all characters are segmented using the same rules, dividing their N feature dimensions into two parts: the first N1 dimensions are used as the initial forward component, and the last N2 dimensions are used as the initial backward component, where N1 + N2 = N. The purpose of this segmentation is to provide independent channels for subsequent bidirectional information processing, avoiding interference between information in a single channel. Preferably, N1 = N2.

[0041] Specifically, the input sequence is represented as a two-dimensional matrix, where the first dimension is the sequence length (L), representing the number of elements (such as characters or words) in the sequence; the second dimension is the feature dimension (d), representing the length of the feature vector for each element. The input sequence matrix is ​​then... (L is the sequence length, d is the feature dimension), the learnable weight matrix through the linear layer After linear transformation, an intermediate representation is generated. and to Perform 3D reshaping Softmax normalization is then performed to form a dynamic weight matrix. A distance scale dimension is introduced through 3D reshaping, enabling the model to handle dependencies at different distances in a hierarchical manner.

[0042] Then, input ,in where g is the sequence length and g is the number of groups. For feature dimensions. (The last part is a repetition of the previous sentence and can be omitted.) It is divided into two parts: the first half The corresponding feature slices are used as forward components. The second half The corresponding feature slices are used as backward components. In other words, index operations can be used in the implementation. Obtain the forward component, through The backward component is obtained. Then, the two parts are reshaped into (...). The tensor structure of the model is used to facilitate subsequent bidirectional information modeling. Through this partitioning, the model can process the forward and backward information flows in parallel, ensuring that each sequence position can obtain complete bidirectional contextual information, enhancing the model's ability to capture long-range dependencies, while maintaining low computational complexity.

[0043] Step S400: Perform a first feature aggregation process on the initial forward component of the character sequence according to the forward propagation direction to obtain the target forward component of the character sequence. The forward propagation direction is the propagation direction from the first character to the last character of the character sequence.

[0044] In this embodiment, the system performs approximately "logarithmic" iterations based on the base 2 sequence length. First, the aggregation step size is set, and the distance the information is forward-propagated is determined based on the current iteration count. In the i-th iteration (i = 0, 1, 2, 3...), information between elements with intervals of 2^i is processed. For example, in the 0th iteration, information between elements with intervals of 2^0 is processed; in the 1st iteration, information between elements with intervals of 2^1 is processed; in the 2nd iteration, information between elements with intervals of 2^2 is processed, and so on. This "power of 2" skipping propagation significantly shortens the steps required for information to traverse long sequences. For the initial forward component of the character sequence, the first feature aggregation process is performed in the forward propagation direction (i.e., from the first character to the last character of the character sequence). The core idea is to allow the forward features of each character to inherit and integrate the effective information from the preceding characters.

[0045] Specifically, in each iteration, information is updated for the portion of the sequence matrix from the "2nd power of i element position" to the "end of the sequence" (where i is the current iteration, counted from 0): the portion of the sequence "from the beginning to the 2nd power of i element position" is multiplied element-wise with the weights at the corresponding positions in the dynamic weight matrix, and then multiplied with the distance-related learnable parameter matrix corresponding to the current iteration. Finally, the result is superimposed on the portion "from the 2nd power of i element position to the end of the sequence", thus achieving information transfer and aggregation in the positive direction.

[0046] Step S500: Perform a second feature aggregation process on the initial backward component of the character sequence according to the backward propagation direction to obtain the target backward component of the character sequence. The backward propagation direction is the propagation direction from the last character to the first character of the character sequence.

[0047] In this embodiment, for the initial backward component of the character sequence, the second feature aggregation process is performed according to the "backward propagation direction" (i.e., from the last character to the first character of the character sequence). The logic is symmetrical with the first feature aggregation. The core is to allow the backward features of each character to "inherit and integrate the effective information of the subsequent characters".

[0048] Specifically, in parallel with forward propagation, in each iteration, information is updated for the portion of the sequence matrix "from the beginning to the position before the 2nd power of i element position": the portion of the sequence "from the 2nd power of i element position to the end of the sequence" is multiplied element-wise with the weights at the corresponding positions in the dynamic weight matrix, then multiplied with the distance-related learnable parameter matrix corresponding to the current iteration, and finally the result is added to the portion "from the beginning to the position before the 2nd power of i element position," achieving information transfer and aggregation in the reverse direction. Through bidirectional synchronous propagation, the model can capture long-range dependencies of the sequence from both directions simultaneously, avoiding information omissions caused by unidirectional propagation.

[0049] This embodiment achieves bidirectional information propagation through a logarithmic hierarchical loop: 1. Forward propagation: For arrive ,implement 2. Backpropagation: Execution is synchronized. ,in This represents element-wise multiplication. and The distance-dependent learnable parameter matrix is ​​generated from the weight matrix corresponding to the linear layer. Its initial values ​​are obtained during model construction using conventional neural network parameter initialization methods (such as uniform or normal random initialization) and are stored as trainable parameters along with other neural network parameters. During model training, these parameter matrices are iteratively updated through the backpropagation algorithm, aiming to minimize the task loss function, thereby gradually learning the information propagation rules adapted to different distance scales. In other words, the matrix... and These are not manually preset, but rather distance-related mapping parameters that are automatically learned during model training.

[0050] Step S600: The target forward component and the target backward component of the character sequence are fused to obtain the target component of the character sequence.

[0051] In this embodiment, the forward and backward components of the target obtained through bidirectional aggregation are fused. Fusion methods include, but are not limited to, vector concatenation, weighted summation, or combination via a fully connected layer. Preferably, vector concatenation is used to connect the two components along the feature dimension, forming the final representation of each character.

[0052] Step S700: Obtain the target sequence based on the character sequence and the target component.

[0053] In this embodiment, the target sequence is generated through an output layer based on the fused target components and the original character sequence. The output layer can be a softmax classifier used to predict the probability distribution of the characters output at each position.

[0054] Specifically, first, the target component is input into a linear transformation layer, which maps the N-dimensional complete feature of each character to the dimension of the target character candidate set; then, Softmax normalization is performed on the mapping result to obtain the probability of each target character in the candidate set corresponding to each character; finally, the candidate character with the highest probability for each character is selected and arranged in the character order of the input sequence to obtain the target sequence output by the model.

[0055] Specifically, output projection matrix This is the learnable parameter matrix for the linear layer. This matrix is ​​updated progressively through the backpropagation training process after model initialization, thereby achieving a linear transformation of the feature space. This is followed by the aforementioned bidirectional feature fusion. The resulting sequence representation and Multiply to get the final output. Its shape is (L×d). This output... It represents the contextual features of each sequence position, which can be used for downstream tasks depending on the application scenario. For example, in a natural language processing scenario, It can be input into the classification or prediction layer to generate translation results; in time series prediction scenarios, Future values ​​can be obtained through further regression.

[0056] Furthermore, before each parameter update, the gradient norm (e.g., L2 norm) of the current batch is first calculated, and a threshold that dynamically changes with the training process is set. This threshold can be adaptively adjusted based on training epochs, historical gradient distribution, or changes in the loss function. This strategy aims to prevent numerical instability caused by gradient explosion or gradient imbalance in long-sequence modeling, improve the stability and convergence speed of model training, and ensure more balanced updates of the parameter matrix at different distance scales, thereby improving overall model performance. A caching mechanism is used during the inference phase to reduce single-step complexity. Real-time sequence processing.

[0057] The above steps constitute the bidirectional hierarchical aggregation process of the sequence processing model. During the training phase, the target sequence output by the model is compared with the true label, and the loss function is calculated. Then, the gradient is calculated using the backpropagation algorithm, and the optimizer is used to update all learnable parameters in the model (such as the weight matrix involved in feature segmentation, aggregation, and fusion). By iteratively iterating through the above training process, the model's parameters are continuously optimized, ultimately enabling it to accurately handle various sequence tasks.

[0058] In a specific embodiment, if the input sequence is a 100×300 matrix, where L=100, representing 100 time points (e.g., 100 words) in the sequence; and d=300, representing 300 features describing each time point.

[0059] The 100×300 input X is equally divided along the feature dimension (300) to obtain two 100×150 matrices, where the forward component has a shape of 100×150 and the backward component has a shape of 100×150.

[0060] Given L=100, we can get log2(100)≈6.64, so the loop will iterate through i=0, 1, 2, 3, 4, 5, 6. The corresponding jump distance is 2. 0 =1, 2¹=2, 2²=4, 2³=8, 2 4 =16、2 5 =32、2 6 =64.

[0061] Taking i=2 (i.e., distance 2²=4) as an example: the forward transmission (aggregating information from left to right) involves passing the information from position 1 to position 5, the information from position 2 to position 6, and so on, until the information from position 96 is passed to position 100. After the forward transmission is completed, each position in the forward component has incorporated information from all distances (1, 2, 4, 8, 16, 32, 64) to its left.

[0062] Similarly, backward propagation (aggregating information from right to left) involves passing the information from position 100 to position 96, from position 99 to position 95, and so on, until the information from position 5 is passed to position 1. After backward propagation, each position in the backward component incorporates information from all distances (1, 2, 4, 8, 16, 32, 64) to its right.

[0063] This embodiment employs a unique bidirectional feature aggregation and fusion mechanism, enabling the model to simultaneously capture contextual information from the past and future of the sequence, effectively solving the problem of long-range dependency modeling. By dividing the feature dimension into two independent channels, forward and backward, the model can learn dependency patterns in different directions. Finally, through fusion, a richer and more comprehensive feature representation is obtained, improving the overall performance of the model. The feature aggregation mechanism used has a significantly lower computational complexity than the traditional Transformer, achieving efficient processing of long sequences and reducing resource consumption for training and inference. While ensuring efficient parallel computing, it achieves powerful modeling capabilities for long-range dependencies of sequences, providing an effective solution for industrial-scale sequence processing applications.

[0064] In some embodiments, the first feature aggregation process includes: for a first character in the character sequence, fusing the N1 feature dimensions of the first character and the N1 feature dimensions of a second character located in the forward propagation direction to obtain the N1 feature dimensions corresponding to the first character in the target forward component, wherein the second character is spaced n1 characters away from the first character, and n1 is a positive integer; And / or, The second feature aggregation process includes: for the third character in the character sequence, fusing the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character located in the backward propagation direction to obtain the N2 feature dimensions corresponding to the third character in the target backward component, wherein the third character and the fourth character are separated by n2 characters, where n2 is a positive integer.

[0065] In this embodiment, in the first feature aggregation process, for the first character in the character sequence (e.g., the character at position i), its N1 feature dimensions are fused with the N1 feature dimensions of the second character located in the forward propagation direction (e.g., the character at position i+n1). The interval n1 can be dynamically adjusted according to the needs of the model design. For example, in hierarchical aggregation, n1 can be a power of 2 (e.g., 1, 2, 4, etc.) to gradually expand the receptive field.

[0066] In the second feature aggregation process, for the third character in the character sequence (e.g., the character at position j), its N2 feature dimensions are fused with the N2 feature dimensions of the fourth character (e.g., the character at position j-n2) located in the backward propagation direction. The fusion process is symmetrical to the forward aggregation, but the propagation direction is reversed. The interval n2 can also be set according to the sequence length and task requirements; for example, n2 can be the same as or different from n1 to flexibly adapt to different contexts.

[0067] In one implementation of this embodiment, n1 and n2 can be a set of predefined values, such as {1, 2, 4, 8, …}. The model will be equipped with a distance-dependent learnable parameter matrix for forward and backward propagation. and The distance-related mapping parameters for each distance value (e.g., n1=2, n1=4) can be obtained from the learnable parameter matrix. For a character, its final target feature is the sum of its fusion results with all these characters at different distances. This aggregation strategy based on "distance as a power of 2" can achieve information transfer at arbitrary distances with a logarithmic computational complexity of O(L log L), greatly improving the efficiency of long sequence processing.

[0068] This embodiment uses this interval fusion mechanism, which allows the model to directly capture dependencies across multiple characters when processing long sequences, avoiding the bottleneck of passing information layer by layer in traditional methods. Compared with global random aggregation, interval fusion only selects characters that are clearly distant from the current character for fusion, reducing the introduction of irrelevant character features and improving the efficiency and targeting of feature aggregation.

[0069] In some embodiments, the number of times the first feature aggregation process is M, where M is determined according to the length of the character sequence. In the process of performing the i-th first feature aggregation process on the initial forward component of the character sequence according to the forward propagation direction, n1 is 2 to the power of i, and i is a natural number less than M. And / or, The second feature aggregation process is performed M times. During the j-th second feature aggregation process on the initial backward component of the character sequence according to the backward propagation direction, n2 is 2 to the power of j, and j is a natural number less than M.

[0070] In this embodiment, M represents the total number of times the first and second feature aggregation processes are performed. This number is directly determined by the length of the character sequence (denoted as L). Specifically, it is calculated by taking the logarithm of the sequence length L to the base 2 (log₂L); the integer part of this logarithm is then taken as the value of M (rounded down). For example, if the character sequence length L = 100, log₂100 ≈ 6.64, then M = 6. This keeps the number of aggregation processes extremely low (e.g., M is only 9 when L = 1000), significantly reducing computational complexity.

[0071] Specifically, the first feature aggregation process targets the initial forward component (each character contains N1 feature dimensions), and is executed in M ​​rounds along the forward propagation direction. The interval of the i-th round of aggregation is n1 = 2^i (i is a natural number less than M, i.e., i = 0, 1, 2, ..., M-1). The steps of each round are as follows: First, based on the current round i, calculate n1=2.i (e.g., when i=0, n1=2) 0 =1, when i=1, n1=2 1 =2, when i=2, n1=2 2 =4, and so on).

[0072] Secondly, for the first character at position k in the character sequence, k ≥ n1 + 1 must be satisfied (ensuring that there is a second character with an interval of n1 characters in the forward direction); the position of the second character is k - (n1 + 1), ensuring that there is an interval of n1 characters between the two. For example, when n1 = 1, the first character at k = 3 corresponds to the second character at k = 3 - (1 + 1) = 1.

[0073] Then, obtain the current forward features of the first character and the current forward features of the second character, and update the forward features of the first character through a fusion operation; for characters that do not satisfy k≥n1+1 (such as the first n1+1 characters of the sequence), their forward features remain unchanged (using the initial or previous round results).

[0074] Finally, following the order of i from 0 to M-1, M rounds of aggregation are performed sequentially. After each round of update, the forward features of the first character are gradually incorporated into intervals of 1, 2, 4, ..., 2. M-1 The information in each character is used to obtain the target forward component.

[0075] Similarly, the process of second feature aggregation can be obtained.

[0076] This embodiment uses M rounds of power-2 interval aggregation to cover dependencies at any distance in the sequence, since any integer distance can be decomposed into a power of 2 (e.g., interval 5 = 4 + 1, interval 7 = 4 + 2 + 1), ensuring that the target component contains global association information between characters; the number of aggregations M = log2L increases very slowly with the sequence length, far lower than the complexity of linear aggregation, and can efficiently process ultra-long character sequences.

[0077] In some embodiments, during the i-th first feature aggregation process on the initial forward components of the character sequence according to the forward propagation direction: Obtain N1 weight values ​​corresponding to the first character from the L×N weight matrix corresponding to the i-th first feature aggregation process, wherein the L×N weight matrix corresponding to the i-th first feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence, and L is the length of the character sequence. Obtain N1 feature dimensions of the first character and N1 feature dimensions of the second character from the initial forward component; Multiply the N1 weight values ​​element-wise with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions; Add the corresponding elements of the N1 feature dimensions of the second character to the N1 first target feature dimensions to obtain the target forward component of the second character after the i-th first feature aggregation process; And / or, During the i-th second feature aggregation process on the initial backward components of the character sequence according to the backward propagation direction: Obtain N2 weight values ​​corresponding to the third character from the L×N weight matrix corresponding to the i-th second feature aggregation process, wherein the L×N weight matrix corresponding to the i-th second feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence; Obtain the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character from the initial backward component; Multiply the N2 weight values ​​element-wise with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions; Add the corresponding elements of the N2 feature dimensions of the third character to the N2 second target feature dimensions to obtain the target forward component of the fourth character after the i-th second feature aggregation process.

[0078] In this embodiment, the model performs M iterations of first feature aggregation on the initial forward components. Taking the i-th iteration as an example, the model first generates a dynamic weight matrix for the i-th forward aggregation based on the original input sequence. The weight matrix is ​​an L×N two-dimensional matrix. The input sequence (initially an L×N matrix) is reshaped into an L×M×(N / g) three-dimensional tensor, where M is the total number of aggregations and g is the number of groups (in this embodiment, g is 2). Then, through Softmax normalization and dimension transformation, the L×N weight matrix corresponding to the i-th aggregation is obtained. This matrix perfectly matches the character sequence length L and the total feature dimension N, and the weight values ​​dynamically change with the content of the input sequence (different character sequences correspond to different weight matrices).

[0079] From the L×N weight matrix, obtain N1 weight values ​​corresponding to the first character (position k); from the initial forward component, extract N1 feature dimensions (denoted as F1) of the first character (position k) and the second character (position k-(n1+1), with an interval of n1=2). i The first character (position k) has N1 feature dimensions (denoted as F2). The N1 weight values ​​(denoted as W1) corresponding to the first character (position k) are multiplied element-wise with the N1 feature dimensions F1 of the first character to obtain N1 first target feature dimensions (denoted as F1'). The core of this operation is to dynamically adjust the contribution of the first character's features through weights (e.g., dimensions with higher weight values ​​correspond to more important features and are strengthened; dimensions with lower weight values ​​are weakened).

[0080] Finally, the N1 feature dimensions F2 of the second character are added element-wise to the first target feature dimension F1' to obtain the target forward component of the second character (denoted as F2') after the i-th first feature aggregation process. At this point, F2' retains the features of the second character itself and incorporates the weighted features of the first character, thus achieving targeted dependency capture.

[0081] Completely symmetrical to the forward process, the model performs a second feature aggregation process on the initial backward components in the i-th iteration. Let's take the i-th iteration as an example. First, based on the original input sequence, the model generates a dynamic weight matrix for the i-th backward aggregation. This weight matrix is ​​also an L×N two-dimensional matrix, generated by performing three-dimensional reshaping, Softmax normalization, and dimension transformation on the input sequence. It is adapted to the character sequence length L and the total feature dimension N, and dynamically changes with the input sequence.

[0082] From the L×N weight matrix, obtain N2 weight values ​​(denoted as W2, corresponding to the N2 feature dimensions) corresponding to the "third character" (position m); from the initial backward component, extract the N2 feature dimensions (denoted as F3) of the third character (position m) and the fourth character (position m + (n2 + 1), interval n2 = 2). i The N2 feature dimensions (denoted as F4) of the third character are obtained. The N2 weight values ​​W2 are multiplied element-wise with the feature dimension F3 of the third character to obtain the N2 second target feature dimensions (denoted as F3').

[0083] Finally, the feature dimension F3 of the third character is added to the corresponding element of the feature dimension F3' of the second target to obtain the target backward component of the fourth character (denoted as F4') after the i-th second feature aggregation process.

[0084] This embodiment introduces a weight matrix dynamically generated based on the input sequence content. The model no longer mechanically and equally transmits information. Instead, it intelligently determines, based on context, which feature dimensions of the source character are important for the current information transmission task, and with what strength the information from the source character should be transmitted to the target character. This adaptive mechanism greatly enhances the model's expressive power, enabling it to handle more complex and subtle sequence dependencies.

[0085] In some embodiments, the three-dimensional reshaping process includes: The input sequence is subjected to a linear transformation to obtain an intermediate representation; The intermediate representation is converted into a three-dimensional tensor to obtain a three-dimensional structure, which includes the length L, the number M, and N feature dimensions of each character sequence. The three-dimensional structure is normalized to obtain M L×N weight matrices.

[0086] In this embodiment, the input sequence is the feature matrix corresponding to the character sequence, with dimensions L×N (L is the length of the character sequence, N is the total feature dimension of each character, N=N1+N2). A learnable weight matrix is ​​used to perform a linear transformation on the input sequence. This transformation maps the input sequence to a new feature space, generating an intermediate representation. This transformation aims to extract the latent features of the input sequence, preparing it for subsequent processing.

[0087] The intermediate representation is reshaped from a two-dimensional matrix into a three-dimensional tensor. The reshaped dimensions include the sequence length (L), the logarithmic distance scale (the logarithm of the sequence length to base 2), and the group feature dimension (N). This reshaping operation introduces the distance scale dimension, enabling the model to handle dependencies at different distances hierarchically.

[0088] The reshaped 3D tensor is then subjected to Softmax normalization, typically along the distance scale dimension. This step ensures that each element in the resulting dynamic weight matrix has a value between 0 and 1, and that the sum of each row or column is 1, thus making it a probabilistic weight. The dynamic weight matrix reflects the intensity of information interaction at different locations and distance scales in the input sequence.

[0089] This embodiment introduces a content-adaptive weighting mechanism through a dynamic weight matrix. The weight matrix is ​​generated from the input sequence through a linear transformation and is strongly correlated with the sequence content (such as character semantics and position). Different sequences correspond to different weights, thus solving the problem of poor adaptability of fixed weights.

[0090] In some embodiments, the step of element-wise multiplying the N1 weight values ​​with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions includes: Obtain the first parameter matrix corresponding to the first feature aggregation processing. The first parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N1 first parameters corresponding to the first character in the first parameter matrix; The N1 feature dimensions of the first character, the N1 weight values, and the N1 first parameters are multiplied element-wise to obtain the N1 first target feature dimensions. And / or, The step of element-wise multiplying the N2 weight values ​​with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions includes: Obtain the second parameter matrix corresponding to the second feature aggregation process. The second parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N2 second parameters in the second parameter matrix that correspond to the third character; The N2 feature dimensions of the third character, the N2 weight values, and the N2 second parameters are multiplied element-wise to obtain the N2 second target feature dimensions.

[0091] In this embodiment, in the i-th first feature aggregation process, the first target feature dimension (N1) is obtained by multiplying the N1 feature dimensions of the first character, the corresponding N1 weight values, and the corresponding N1 first parameters element by element.

[0092] First, obtain the first parameter matrix. The first parameter matrix is ​​a distance-related mapping parameter automatically learned by the model during training. Its structure is L×M×N, providing a fixed, learnable set of parameters for each position and aggregation in the sequence. Initially, this matrix contains random values, which are adapted to task requirements through model training (backpropagation iterative optimization). From the L×M×N first parameter matrix, locate the i-th aggregation (corresponding to the i-th position in the M dimension), the first character (position k, corresponding to the k-th position in the L dimension), and N1 forward feature dimensions (corresponding to the first N1 positions in the N dimension), extracting N1 first parameters. Then, from the initial forward components, obtain the N1 feature dimensions of the first character (source character); and from the dynamic weight matrix, obtain the N1 weight values ​​corresponding to the first character.

[0093] Then, the N1 feature dimensions (source features) of the first character, the N1 weight values ​​(dynamic weights) corresponding to the first character obtained from the dynamic weight matrix, and the N1 first parameters (static parameters) corresponding to the first character in the first parameter matrix are multiplied element by element to obtain the N1 first target feature dimensions.

[0094] Finally, from the initial forward component, obtain the N1 feature dimensions of the second character (target character), add the weighted source features (i.e., the N1 first target feature dimensions) to the original features of the target character element by element, and obtain the target forward component of the second character after the i-th first feature aggregation process.

[0095] Similarly, the processing of the second feature aggregation (backward propagation) is completely symmetrical to the forward propagation process, and will not be elaborated here.

[0096] This embodiment introduces a static parameter matrix, making the model's information transmission mechanism more refined and powerful. The dynamic weight matrix determines the intensity of information transmission based on the specific content of the current input sequence, exhibiting high flexibility. The static parameter matrix performs a linear transformation on the feature representation of the sequence, aiming to enhance its expressive power and adapt to information transmission patterns at different distance scales, determining how information is transformed. This dynamic-static combination design allows the model to adapt to changing inputs while utilizing stable patterns learned during training, thereby improving the model's generalization ability and robustness.

[0097] In some embodiments, the objective parameter matrix is ​​obtained according to the following process: Obtain the initial parameter matrix randomly generated by the sequence processing model. The initial parameter matrix includes the length L of the character sequence and N feature dimensions of each character. Obtain a training sample set, which includes training samples and the true label of each training sample; The training sample set is input into the sequence processing model, which then calculates the prediction result using the initial parameter matrix through target feature aggregation processing. Based on the prediction results and the true labels, the loss function value is obtained; Based on the loss function value, the gradient of the initial parameter matrix is ​​calculated using the backpropagation algorithm; Based on the gradient, the values ​​of the initial parameter matrix are adjusted to obtain a temporary parameter matrix; The temporary parameter matrix is ​​used as the initial parameter matrix. The process jumps to the step of inputting the training sample set into the sequence processing model, and the sequence processing model calculates the prediction result by using the initial parameter matrix through target feature aggregation processing, until the iteration stopping condition is met. The last obtained temporary parameter matrix is ​​used as the target parameter matrix; Wherein, when the target parameter matrix is ​​the first parameter matrix, the target feature aggregation processing adopts the first feature aggregation processing; when the target parameter matrix is ​​the second parameter matrix, the target feature aggregation processing adopts the second feature aggregation processing.

[0098] In this embodiment, all learnable parameter matrices in the sequence processing model need to be initialized before training begins. These parameters include, but are not limited to: the first parameter matrix and the second parameter matrix, the linear transformation matrix used to generate dynamic weights, and all weight matrices involved in feature segmentation, fusion, output projection, and other processes. The initial values ​​of these parameter matrices are typically generated randomly, for example, by sampling from a specific normal or uniform distribution. These randomly generated initial matrices are called the initial parameter matrices. Model training is an iterative optimization process aimed at adjusting the parameter matrices so that the model's predictions are as close as possible to the true labels.

[0099] First, a training sample set is obtained, containing a large number of training samples (e.g., pairs of source language sentences and target language sentences) and the corresponding ground truth label for each training sample. The training samples are then input into a sequence processing model, which uses the current initial parameter matrix (or a temporary parameter matrix during iteration) and calculates the prediction result through target feature aggregation. If the first parameter matrix is ​​trained (with the target parameter matrix being the first parameter matrix), the target feature aggregation is performed as a first feature aggregation; if the second parameter matrix is ​​trained (with the target parameter matrix being the second parameter matrix), the target feature aggregation is performed as a second feature aggregation. The target forward / backward components are obtained after target feature aggregation, fused, and then used to generate the prediction result through the output layer.

[0100] The model's predictions are compared to the true labels, and the difference between the two is quantified using a predefined loss function. A larger loss function value indicates a worse current model performance. Based on the calculated loss function value, the gradient of the loss function with respect to each parameter in the model is calculated using the backpropagation algorithm. The gradient indicates the direction and magnitude by which each parameter should be adjusted to reduce the loss.

[0101] Adjust the values ​​of the initial parameter matrix based on the gradient to obtain a temporary parameter matrix. This adjustment can be efficiently achieved using an optimizer (such as Adam or SGD). Use the obtained temporary parameter matrix as the new initial parameter matrix and repeat the above steps until a preset iteration stopping condition is met. The stopping condition may be: reaching a preset maximum number of training epochs, the loss function value no longer decreasing on the validation set, or the model performance reaching a preset metric.

[0102] When the training process stops, the temporary parameter matrix obtained from the last iteration is saved and becomes the target parameter matrix of the model. If the first parameter matrix is ​​trained, the target parameter matrix is ​​used for the first feature aggregation process (adjusting the forward features); if the second parameter matrix is ​​trained, the target parameter matrix is ​​used for the second feature aggregation process (adjusting the backward features).

[0103] This embodiment enables the parameter matrix to capture task-specific patterns through iterative learning of the training sample set, thereby improving the model's prediction accuracy in specific scenarios. It also allows the model to automatically discover the optimal information delivery strategy in the data without human intervention, thus achieving powerful performance.

[0104] In some embodiments, after step S700, the following steps may also be included, but are not limited to: The initial loss function value is calculated based on the target sequence. Based on the initial loss function value, the initial gradients of each parameter in the sequence processing model are calculated using the backpropagation algorithm. Calculate the gradient norm of the initial gradient of each of the parameters; Obtain the gradient norm, historical gradient norm, and the changing trend of historical loss function values; The gradient clipping threshold is dynamically adjusted based on the gradient norm, historical gradient norm, and the changing trend of historical loss function values. The gradient norm is compared with the gradient clipping threshold. If the gradient norm exceeds the gradient clipping threshold, the initial gradient of each parameter is proportionally reduced to the range of the gradient clipping threshold to obtain the target gradient of each parameter. The model parameters in the sequence processing model are updated based on the target gradients of each parameter.

[0105] In this embodiment, the initial loss function value is calculated based on the target sequence output by the model and the true labels in the training samples. Based on this loss value, the initial gradients of all learnable parameters in the model are calculated using the standard backpropagation algorithm. To measure the overall magnitude of the gradients, the gradient norm of the initial gradients of all parameters is calculated. The most commonly used is the L2 norm (i.e., the Euclidean norm), which is the square root of the sum of the squares of the gradients of all parameters.

[0106] To dynamically set the pruning threshold, the model needs to refer to historical training states. This includes historical gradient norms (the sequence of gradient norms calculated over the past K iterations) and the trend of historical loss function values ​​(whether the loss value has shown a downward, oscillating, or upward trend over the past K iterations).

[0107] The model intelligently sets a gradient pruning threshold based on current and historical information. This threshold is not fixed but dynamically adjusted according to the real-time state of the training process. Specifically, it can be adjusted based on the current training epoch, historical gradient distribution, and the changing trend of the loss function. For example, in the early stages, when the model needs to quickly learn basic dependencies, the threshold can be appropriately relaxed (allowing for larger gradient updates); in later stages, when fine-tuning is needed, the threshold is tightened (to suppress large fluctuations). If the gradient norm has been continuously increasing over the past K epochs (e.g., 5 epochs), it indicates that gradient explosion may be approaching, and the current threshold needs to be lowered (e.g., set to 0.8 times the historical maximum value); if historical gradients are stable, the threshold can be maintained or slightly increased. If the loss decreases rapidly, it indicates that the current gradient is effective, and the threshold can be relaxed to accelerate convergence; if the loss oscillates (e.g., alternating between positive and negative), the threshold needs to be lowered to stabilize training.

[0108] The current gradient norm is compared with the dynamic gradient clipping threshold. If the gradient norm is less than or equal to the threshold, the gradient magnitude is within a reasonable range and clipping is unnecessary. The initial gradient becomes the target gradient. If the gradient norm is greater than the threshold, gradient explosion has occurred, and clipping is required. The initial gradients of all parameters are proportionally reduced until the overall gradient norm equals the threshold.

[0109] Finally, the pruned (or uncrunted) target gradient is used to update all model parameters in the sequence processing model via an optimizer (such as Adam).

[0110] This embodiment introduces a dynamic gradient pruning mechanism to ensure that the gradient norm does not exceed a threshold, thus avoiding model oscillations caused by gradient explosion. At the same time, it uses historical information to avoid gradient vanishing caused by over-pruning, balancing update magnitude and stability. The dynamic threshold increases when the loss decreases rapidly (preserving effective gradients) and decreases when the loss stagnates (suppressing ineffective oscillations), making parameter updates more accurate and shortening the convergence period.

[0111] This application's embodiments significantly reduce the computational complexity of long sequence processing from the traditional O(L²) to O(L log L) through an innovative iterative aggregation mechanism, achieving efficient processing of ultra-long sequences. Simultaneously, the model, through bidirectional feature fusion and a weighting mechanism combining static and dynamic elements, can adaptively capture complex long-range dependencies, significantly improving modeling accuracy and expressive power. Furthermore, the introduced dynamic gradient pruning strategy effectively solves the gradient explosion problem during training, ensuring stable model training and rapid convergence. This application's embodiments, while maintaining extremely low computational cost and high parallelism, achieve powerful, adaptive, and stable modeling of long sequence data, providing a solid technical guarantee for large-scale industrial sequence applications.

[0112] Please see Figure 2This application also provides a model training apparatus 800, which can implement the above-described model training method. The apparatus includes: The acquisition module 10 is used to acquire an input sequence, which includes a character sequence consisting of multiple characters and N feature dimensions for describing each character; Input module 20 is used to input the input sequence into the sequence processing model to obtain the target sequence output by the sequence processing model; The sequence processing model performs the following process: The N feature dimensions of each character are uniformly segmented to obtain the initial forward component and the initial backward component of the character sequence. The initial forward component includes N1 feature dimensions of each character, and the initial backward component includes N2 feature dimensions of each character. The sum of N1 and N2 is N, where N is an integer greater than or equal to 2. The initial forward component of the character sequence is subjected to a first feature aggregation process according to the forward propagation direction to obtain the target forward component of the character sequence, wherein the forward propagation direction is the propagation direction from the first character to the last character of the character sequence; The initial backward component of the character sequence is subjected to a second feature aggregation process according to the backward propagation direction to obtain the target backward component of the character sequence, wherein the backward propagation direction is the propagation direction from the last character to the first character of the character sequence; The target forward component and the target backward component of the character sequence are fused to obtain the target component of the character sequence. The target sequence is obtained based on the character sequence and the target component.

[0113] In some implementations, the first feature aggregation process includes: for the first character in the character sequence, fusing the N1 feature dimensions of the first character and the N1 feature dimensions of the second character located in the forward propagation direction to obtain the N1 feature dimensions corresponding to the first character in the target forward component, wherein the second character is spaced n1 characters away from the first character, and n1 is a positive integer; And / or, The second feature aggregation process includes: for the third character in the character sequence, fusing the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character located in the backward propagation direction to obtain the N2 feature dimensions corresponding to the third character in the target backward component, wherein the third character and the fourth character are separated by n2 characters, where n2 is a positive integer.

[0114] In some implementations, the number of times the first feature aggregation process is M, where M is determined according to the length of the character sequence. In the process of performing the i-th first feature aggregation process on the initial forward component of the character sequence according to the forward propagation direction, n1 is 2 to the power of i, and i is a natural number less than M. And / or, The second feature aggregation process is performed M times. During the j-th second feature aggregation process on the initial backward component of the character sequence according to the backward propagation direction, n2 is 2 to the power of j, and j is a natural number less than M.

[0115] In some implementations, during the i-th first feature aggregation process on the initial forward components of the character sequence according to the forward propagation direction: Obtain N1 weight values ​​corresponding to the first character from the L×N weight matrix corresponding to the i-th first feature aggregation process, wherein the L×N weight matrix corresponding to the i-th first feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence, and L is the length of the character sequence. Obtain N1 feature dimensions of the first character and N1 feature dimensions of the second character from the initial forward component; Multiply the N1 weight values ​​element-wise with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions; Add the corresponding elements of the N1 feature dimensions of the second character to the N1 first target feature dimensions to obtain the target forward component of the second character after the i-th first feature aggregation process; And / or, During the i-th second feature aggregation process on the initial backward components of the character sequence according to the backward propagation direction: Obtain N2 weight values ​​corresponding to the third character from the L×N weight matrix corresponding to the i-th second feature aggregation process, wherein the L×N weight matrix corresponding to the i-th second feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence; Obtain the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character from the initial backward component; Multiply the N2 weight values ​​element-wise with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions; Add the corresponding elements of the N2 feature dimensions of the third character to the N2 second target feature dimensions to obtain the target forward component of the fourth character after the i-th second feature aggregation process.

[0116] In some implementations, the three-dimensional reshaping process includes: The input sequence is subjected to a linear transformation to obtain an intermediate representation; The intermediate representation is converted into a three-dimensional tensor to obtain a three-dimensional structure, which includes the length L, the number M, and N feature dimensions of each character sequence. The three-dimensional structure is normalized to obtain M L×N weight matrices.

[0117] In some implementations, the step of element-wise multiplying the N1 weight values ​​with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions includes: Obtain the first parameter matrix corresponding to the first feature aggregation processing. The first parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N1 first parameters corresponding to the first character in the first parameter matrix; The N1 feature dimensions of the first character, the N1 weight values, and the N1 first parameters are multiplied element-wise to obtain the N1 first target feature dimensions. And / or, The step of element-wise multiplying the N2 weight values ​​with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions includes: Obtain the second parameter matrix corresponding to the second feature aggregation process. The second parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N2 second parameters in the second parameter matrix that correspond to the third character; The N2 feature dimensions of the third character, the N2 weight values, and the N2 second parameters are multiplied element-wise to obtain the N2 second target feature dimensions.

[0118] In some implementations, the target parameter matrix is ​​obtained according to the following process: Obtain the initial parameter matrix randomly generated by the sequence processing model. The initial parameter matrix includes the length L of the character sequence and N feature dimensions of each character. Obtain a training sample set, which includes training samples and the true label of each training sample; The training sample set is input into the sequence processing model, which then calculates the prediction result using the initial parameter matrix through target feature aggregation processing. Based on the prediction results and the true labels, the loss function value is obtained; Based on the loss function value, the gradient of the initial parameter matrix is ​​calculated using the backpropagation algorithm; Based on the gradient, the values ​​of the initial parameter matrix are adjusted to obtain a temporary parameter matrix; The temporary parameter matrix is ​​used as the initial parameter matrix. The process jumps to the step of inputting the training sample set into the sequence processing model, and the sequence processing model calculates the prediction result by using the initial parameter matrix through target feature aggregation processing, until the iteration stopping condition is met. The last obtained temporary parameter matrix is ​​used as the target parameter matrix; Wherein, when the target parameter matrix is ​​the first parameter matrix, the target feature aggregation processing adopts the first feature aggregation processing; when the target parameter matrix is ​​the second parameter matrix, the target feature aggregation processing adopts the second feature aggregation processing.

[0119] In some embodiments, the apparatus further includes: The first calculation module is used to calculate the initial loss function value based on the target sequence; The second calculation module is used to calculate the initial gradient of each parameter in the sequence processing model based on the initial loss function value using the backpropagation algorithm. The third calculation module is used to calculate the gradient norm of the initial gradient of each parameter; The second acquisition module is used to acquire the gradient norm, historical gradient norm, and the changing trend of historical loss function values; The setting module is used to set a dynamically adjusted gradient clipping threshold based on the gradient norm, historical gradient norm, and the changing trend of historical loss function values. The comparison module is used to compare the gradient norm with the gradient clipping threshold. If the gradient norm exceeds the gradient clipping threshold, the initial gradient of each parameter is proportionally reduced to the range of the gradient clipping threshold to obtain the target gradient of each parameter. The update module is used to update the model parameters in the sequence processing model according to the target gradient of each parameter.

[0120] The specific implementation of this model training device is basically the same as the specific implementation of the model training method described above, and will not be repeated here.

[0121] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described model training method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.

[0122] Please see Figure 3 , Figure 3 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes: The processor 801 can be implemented using a general-purpose central processing unit (CPU), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application. The memory 802 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 802 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 802 and is called and executed by the processor 801 using the model training method of the embodiments of this application. The 803 input / output interface is used to implement information input and output. The communication interface 804 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.). Bus 805 transmits information between various components of the device (e.g., processor 801, memory 802, input / output interface 803, and communication interface 804); The processor 801, memory 802, input / output interface 803, and communication interface 804 are connected to each other within the device via bus 805.

[0123] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described model training method.

[0124] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0125] The model training method, model training device, electronic device, and storage medium provided in this application embodiment uniformly divide the N feature dimensions of each character in the input sequence into initial forward and initial backward components, and perform hierarchical feature aggregation along the forward and backward propagation directions respectively. Finally, the bidirectional aggregation results are fused to generate the target sequence. This achieves effective capture of long-term dependencies in the sequence while reducing computational complexity, significantly improving the efficiency and accuracy of sequence processing. Through an innovative bidirectional feature aggregation mechanism, this application embodiment enables each character to obtain complete contextual information while maintaining the model's parallel computing capabilities, thereby achieving more accurate sequence conversion and generation in natural language processing tasks such as machine translation and text generation.

[0126] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0127] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.

[0128] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0129] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.

[0130] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0131] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0132] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0133] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0134] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0135] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0136] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.

Claims

1. A model training method, characterized in that, The method includes: Obtain an input sequence, which includes a character sequence consisting of multiple characters and N feature dimensions for describing each character; The input sequence is fed into the sequence processing model to obtain the target sequence output by the sequence processing model; The sequence processing model performs the following process: The N feature dimensions of each character are uniformly segmented to obtain the initial forward component and the initial backward component of the character sequence. The initial forward component includes N1 feature dimensions of each character, and the initial backward component includes N2 feature dimensions of each character. The sum of N1 and N2 is N, where N is an integer greater than or equal to 2. The initial forward component of the character sequence is subjected to a first feature aggregation process according to the forward propagation direction to obtain the target forward component of the character sequence, wherein the forward propagation direction is the propagation direction from the first character to the last character of the character sequence; The initial backward component of the character sequence is subjected to a second feature aggregation process according to the backward propagation direction to obtain the target backward component of the character sequence, wherein the backward propagation direction is the propagation direction from the last character to the first character of the character sequence; The target forward component and the target backward component of the character sequence are fused to obtain the target component of the character sequence. The target sequence is obtained based on the character sequence and the target component.

2. The method according to claim 1, characterized in that, The first feature aggregation process includes: for the first character in the character sequence, fusing the N1 feature dimensions of the first character and the N1 feature dimensions of the second character located in the forward propagation direction to obtain the N1 feature dimensions corresponding to the first character in the target forward component, wherein the second character is spaced n1 characters away from the first character, and n1 is a positive integer; And / or, The second feature aggregation process includes: for the third character in the character sequence, fusing the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character located in the backward propagation direction to obtain the N2 feature dimensions corresponding to the third character in the target backward component, wherein the third character and the fourth character are separated by n2 characters, where n2 is a positive integer.

3. The method according to claim 2, characterized in that, The first feature aggregation process is performed M times, where M is determined based on the length of the character sequence. During the i-th first feature aggregation process on the initial forward component of the character sequence according to the forward propagation direction, n1 is 2 to the power of i, and i is a natural number less than M. And / or, The second feature aggregation process is performed M times. During the j-th second feature aggregation process on the initial backward component of the character sequence according to the backward propagation direction, n2 is 2 to the power of j, and j is a natural number less than M.

4. The method according to claim 3, characterized in that, During the i-th first feature aggregation process on the initial forward components of the character sequence according to the forward propagation direction: Obtain N1 weight values ​​corresponding to the first character from the L×N weight matrix corresponding to the i-th first feature aggregation process, wherein the L×N weight matrix corresponding to the i-th first feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence, and L is the length of the character sequence. Obtain N1 feature dimensions of the first character and N1 feature dimensions of the second character from the initial forward component; Multiply the N1 weight values ​​element-wise with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions; Add the corresponding elements of the N1 feature dimensions of the second character to the N1 first target feature dimensions to obtain the target forward component of the second character after the i-th first feature aggregation process; And / or, During the i-th second feature aggregation process on the initial backward components of the character sequence according to the backward propagation direction: Obtain N2 weight values ​​corresponding to the third character from the L×N weight matrix corresponding to the i-th second feature aggregation process, wherein the L×N weight matrix corresponding to the i-th second feature aggregation process is obtained by performing three-dimensional reshaping processing on the input sequence; Obtain the N2 feature dimensions of the third character and the N2 feature dimensions of the fourth character from the initial backward component; Multiply the N2 weight values ​​element-wise with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions; Add the corresponding elements of the N2 feature dimensions of the third character to the N2 second target feature dimensions to obtain the target forward component of the fourth character after the i-th second feature aggregation process.

5. The method according to claim 4, characterized in that, The three-dimensional reshaping process includes: The input sequence is subjected to a linear transformation to obtain an intermediate representation; The intermediate representation is converted into a three-dimensional tensor to obtain a three-dimensional structure, which includes the length L, the number M, and N feature dimensions of each character sequence. The three-dimensional structure is normalized to obtain M L×N weight matrices.

6. The method according to claim 4, characterized in that, The step of element-wise multiplying the N1 weight values ​​with the N1 feature dimensions of the first character to obtain the N1 first target feature dimensions includes: Obtain the first parameter matrix corresponding to the first feature aggregation processing. The first parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N1 first parameters corresponding to the first character in the first parameter matrix; The N1 feature dimensions of the first character, the N1 weight values, and the N1 first parameters are multiplied element-wise to obtain the N1 first target feature dimensions. And / or, The step of element-wise multiplying the N2 weight values ​​with the N2 feature dimensions of the third character to obtain N2 second target feature dimensions includes: Obtain the second parameter matrix corresponding to the second feature aggregation process. The second parameter matrix includes the length L of the character sequence, the number M, and N feature dimensions for each character. Obtain the N2 second parameters in the second parameter matrix that correspond to the third character; The N2 feature dimensions of the third character, the N2 weight values, and the N2 second parameters are multiplied element-wise to obtain the N2 second target feature dimensions.

7. The method according to claim 6, characterized in that, The objective parameter matrix is ​​obtained through the following process: Obtain the initial parameter matrix randomly generated by the sequence processing model. The initial parameter matrix includes the length L of the character sequence and N feature dimensions of each character. Obtain a training sample set, which includes training samples and the true label of each training sample; The training sample set is input into the sequence processing model, which then calculates the prediction result using the initial parameter matrix through target feature aggregation processing. Based on the prediction results and the true labels, the loss function value is obtained; Based on the loss function value, the gradient of the initial parameter matrix is ​​calculated using the backpropagation algorithm; Based on the gradient, the values ​​of the initial parameter matrix are adjusted to obtain a temporary parameter matrix; The temporary parameter matrix is ​​used as the initial parameter matrix. The process jumps to the step of inputting the training sample set into the sequence processing model, and the sequence processing model calculates the prediction result by using the initial parameter matrix through target feature aggregation processing, until the iteration stopping condition is met. The last obtained temporary parameter matrix is ​​used as the target parameter matrix; Wherein, when the target parameter matrix is ​​the first parameter matrix, the target feature aggregation process adopts the first feature aggregation process; When the target parameter matrix is ​​the second parameter matrix, the target feature aggregation process adopts the second feature aggregation process.

8. The method according to claim 1, characterized in that, After inputting the input sequence into the sequence processing model to obtain the target sequence output by the sequence processing model, the method further includes: The initial loss function value is calculated based on the target sequence. Based on the initial loss function value, the initial gradients of each parameter in the sequence processing model are calculated using the backpropagation algorithm. Calculate the gradient norm of the initial gradient of each of the parameters; Obtain the gradient norm, historical gradient norm, and the changing trend of historical loss function values; The gradient clipping threshold is dynamically adjusted based on the gradient norm, historical gradient norm, and the changing trend of historical loss function values. The gradient norm is compared with the gradient clipping threshold. If the gradient norm exceeds the gradient clipping threshold, the initial gradient of each parameter is proportionally reduced to the range of the gradient clipping threshold to obtain the target gradient of each parameter. The model parameters in the sequence processing model are updated based on the target gradients of each parameter.

9. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the model training method according to any one of claims 1 to 8.

10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the model training method according to any one of claims 1 to 8.