Model processing method, device and medium for semantic understanding

By optimizing the inter-layer feature transfer of the large language model through adaptive diffusion and dynamic connection modules, the problems of rigidity of inter-layer connections and uniformity of information diffusion intensity are solved, thereby improving the accuracy of semantic understanding and reasoning speed, and making it suitable for a variety of text processing tasks.

CN122088515BActive Publication Date: 2026-06-26XIAMEN UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAMEN UNIV OF TECH
Filing Date
2026-04-21
Publication Date
2026-06-26

Smart Images

  • Figure CN122088515B_ABST
    Figure CN122088515B_ABST
Patent Text Reader

Abstract

A model processing method, device and medium for semantic understanding relate to the technical field of natural language processing. The method comprises: obtaining text information to be processed. The text information to be processed is preprocessed, and a text feature matrix is generated for each word element. The text feature matrix of each word element is encoded layer by layer to obtain an output feature matrix of each word element at each layer. The feature importance and inter-layer similarity are calculated based on the output feature matrix to generate a diffusion intensity matrix of each word element at each layer. The input feature matrix of the current round is optimized based on the diffusion intensity matrix of the previous round. The inter-layer dynamic connection path is determined based on the inter-layer similarity of the previous round and the current calculation load. When a certain word element skips the first layer connection in the current round, simulation generation and completion are performed to obtain a feature fusion completion matrix. After completion, subsequent layer encoding is continued according to the dynamic connection path to output a feature matrix, and finally a semantic understanding result is obtained.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing technology, and more specifically, to a model processing method, device, and medium for semantic understanding. Background Technology

[0002] Currently, large language models face the challenge of balancing efficiency and accuracy in natural language processing tasks. These models need to handle diverse language structures, ranging from simple to complex, such as short sentences and long, difficult sentences, while also distinguishing the importance of different language features, such as key entities and auxiliary function words. This requires models to dynamically adapt to varying input semantic complexity and differentiate the processing of different features to achieve efficient utilization of computational resources and accurate capture of semantic information.

[0003] To optimize model efficiency, existing technologies have proposed several improvement schemes. Sparse Transformers reduce the number of fixed connections between model layers to decrease the number of parameters and computational overhead. Hybrid expert models, on the other hand, dynamically activate some network layers (experts) to process input, aiming to reduce the computational cost of each inference. These methods primarily optimize the model structure by reducing computational operations.

[0004] However, these existing improvements still have significant limitations. Sparse Transformers only statically reduce connections, failing to dynamically adjust information flow paths based on the semantic complexity of the input text. Hybrid expert models, while dynamically activating network layers, do not finely adjust the strength of information transfer between layers. Therefore, neither fundamentally solves the problems of inefficient feature transfer and wasted computational resources caused by rigid inter-layer connections and uniform information diffusion strength. Summary of the Invention

[0005] The present invention provides a model processing method, device, and medium for semantic understanding, to improve at least one of the above-mentioned technical problems.

[0006] In a first aspect, the present invention provides a model processing method for semantic understanding, comprising steps S1 to S8.

[0007] S1. Obtain the text information to be processed.

[0008] S2. Preprocess the text information to be processed to generate a text feature matrix for each word.

[0009] S3. The Transformer-based hierarchical encoding module encodes the text feature matrix of each word layer by layer to obtain the output feature matrix of each word at each layer.

[0010] S4. Calculate feature importance and inter-layer similarity based on the output feature matrix of each word in each layer to generate the diffusion intensity matrix of each word in each layer.

[0011] S5. Based on the diffusion intensity matrix of each word in each layer in the previous round, optimize the input feature matrix of each word in each layer in the current round.

[0012] S6. Determine the inter-layer dynamic connection path for the current round based on the inter-layer similarity of each word in the previous round and the current computational load.

[0013] S7. In the case of inter-layer skip connections, when it is determined from the previous round of layered coding that a certain word element skips the first layer in the current round... During layer connection, the missing features of the word in the skipped layer are simulated and filled in based on the diffusion intensity matrix to obtain the feature fusion and filling matrix, which is then used for the first layer connection. Used when calculating contextual features for other lexical units in the layer.

[0014] S8. Based on the current layer feature matrix of each word after padding, continue to execute the subsequent layer encoding according to the dynamic connection path to obtain the output feature matrix of the final layer, and output the semantic understanding result according to the output feature matrix of the final layer.

[0015] As a further aspect of the present invention, the preprocessing in S2 includes word segmentation, lexical feature extraction, and contextual grammar feature extraction of the text information to be processed. The lexical features include one or more of word vectors, word frequency, word length, and part-of-speech tags. The contextual grammar features include at least one or more of phrase structure, punctuation features, positional features, and dependency relations.

[0016] A text feature matrix is ​​constructed for each word segmentation unit.

[0017] definition Let be the text feature matrix obtained after preprocessing a certain word, with a matrix size of . .in, Representing feature dimension, This indicates the length of the feature sequence. Different dimensions in the text feature matrix are used to jointly represent word-level semantic information and contextual structure information.

[0018] As a further aspect of the present invention, in S3, the hierarchical encoding module is embedded in the Transformer model to perform layer-by-layer semantic representation learning on the input features. The hierarchical encoding module includes multiple encoding or decoding layers from the first layer to the nth layer. Here, n represents the total number of layers in the hierarchical encoding module, and... .

[0019] The first layer of hierarchical encoding receives the text feature matrix of a specific word. As input, it outputs the first layer's feature matrix of the word. .

[0020] The first layer of coding The layer that receives the word element The output feature matrix of the layer As input, output the first word of the term. The output feature matrix of the layer .in, It is a layer index, and 2≤ ≤n.

[0021] Each word is encoded and calculated at layers 1 to n according to the previous encoding steps. When calculating the context-related features of a word, the feature matrices of other related words at the corresponding layers are called to participate in the calculation.

[0022] As a further aspect of the present invention, in step S4, before calculating the diffusion intensity matrix, the feature importance matrix is ​​initialized first. The feature importance matrix has the same size as the text feature matrix. During initialization, the initial weights of entity words and verbs are set to 0.8, the initial weights of adjectives are set to 0.6, the initial weights of function words are set to 0.2, and the initial weights of contextual grammatical features are uniformly set to 0.5.

[0023] The feature importance matrix for the first layer is: In the formula, This represents the feature importance matrix of the first layer. This represents the feature importance matrix after initialization. This represents the multi-head attention weight matrix of the first layer of the Transformer. This represents the Hadamard product operation.

[0024] No. The feature importance matrix of the layer is: In the formula, Indicates the first The feature importance matrix of the layer. Indicates the first The feature importance matrix of the layer. Represents the Transformer Multi-head attention weight matrix of the layer. Representation layer index.

[0025] The size of the multi-head attention weight matrix is: .in, Indicates the feature dimension. Indicates the length of the feature sequence.

[0026] For the first Perform probability distribution normalization on the feature importance matrix of the layer: In the formula, Represents the normalized i-th The layer feature importance matrix at the th Line 1 The element at the column position. Represents the first digit before normalization. The layer feature importance matrix at the th Line 1 The element at the column position. Represents the natural constant. and These represent the row index and column index during the summation process, respectively.

[0027] As a further aspect of the present invention, the inter-layer similarity is used to characterize the degree of consistency between the feature representations of two adjacent layers. The higher the inter-layer similarity, the lower the diffusion intensity of the corresponding layer, thereby reducing the transmission of redundant features.

[0028] No. layer relative to the first The inter-layer similarity is: In the formula, Indicates the first layer relative to the first Inter-layer similarity. Indicates the first word of a word The output feature matrix of the layer. Indicates the first word of a word The output feature matrix of the layer. This indicates the inner product operation. This represents the Euclidean norm. Representation layer index.

[0029] No. The diffusion intensity matrix of the layer is: In the formula, Indicates the first The diffusion intensity matrix of the layer. This indicates a regulatory factor. Indicates the first The feature importance matrix of the layer. This represents the Hadamard product operation. This represents the base diffusion matrix, which has the same dimension as the text feature matrix. Indicates the feature dimension. Indicates the length of the feature sequence.

[0030] The inter-layer similarity calculated in each round of hierarchical coding. and diffusion intensity matrix All are reserved for use in the next round of hierarchical coding. The adjustment factors... The value range is from 0.5 to 1.

[0031] As a further aspect of the present invention, S5 specifically includes: based on the first layer of encoding retained in the previous round of hierarchical coding... Layer diffusion intensity matrix, for the current round The input feature matrix of the layer is updated. In the formula, Represents the first diffusion intensity after superposition. Layer input matrix. Indicates the first word of a word The output feature matrix of the layer. This represents the Hadamard product operation. Indicates the first The diffusion intensity matrix of the layer.

[0032] As a further aspect of the present invention, in S6, the first... layer to the first The connection probability of a layer is: In the formula, Indicates the first layer to the first Layer connection probability. This represents the Sigmoid function. Indicates the first layer relative to the first Inter-layer similarity. This represents the load adjustment factor, with a value ranging from 0 to 0.5. Indicates the first The current computational load of the layer. Representation layer index.

[0033] The Sigmoid function is: In the formula, Represent the independent variable The corresponding Sigmoid function value, Indicates the input variables of the function. Represents the natural constant.

[0034] As a further aspect of the present invention, the current round of hierarchical coding calculates the connection probability. At that time, the inter-layer similarity used This represents the inter-layer similarity calculated and retained in the previous round of hierarchical encoding. The current computational load... Take the first step in the previous round of training The GPU memory usage of each layer. Based on this, the calculation results of the previous round determine the current round's layer. layer to the first Whether the layers are connected.

[0035] As a further aspect of the present invention, the dynamic connection path is determined according to the following rules:

[0036] The first layer remains active and is not turned off.

[0037] When the Connection probability corresponding to layer When the value is greater than or equal to 0.6, the first [function] is activated. Layer, otherwise close the first layer. layer.

[0038] Any two consecutive layers should not be turned off simultaneously to avoid excessive damage to features.

[0039] As a further aspect of the present invention, in S7, when it is determined from the previous round of hierarchical encoding results that a certain word element skips the first iteration in the current round... When connecting layers, the first lexical unit is used as the basis for the connection. The output feature matrix of the layer and the first layer of the previous layer encoding of the word. Layer diffusion intensity matrix, for this word in the th layer The missing features of the layer are simulated to generate the word in the current round. Feature fusion and completion matrix of the layer: In the formula, Indicates the first word of a word The feature fusion and completion matrix of the layer. Indicates the first word of this word. The output feature matrix of the layer. This indicates the previous round of hierarchical encoding of the word. The diffusion intensity matrix of the layer. This represents the Hadamard product operation. Representation layer index.

[0040] Feature fusion and matrix completion solves the problem of the word skipping the first digit. The layer, resulting in other lexical units in the first layer. The problem of lacking adjacent word features when calculating contextual features in the layer.

[0041] Secondly, the present invention provides a model processing apparatus for semantic understanding, comprising a processor, a memory, and a computer program stored in the memory. The computer program can be executed by the processor to implement a model processing method for semantic understanding as described in any paragraph of the first aspect.

[0042] Thirdly, the present invention provides a computer-readable storage medium. The computer-readable storage medium includes a stored computer program, wherein, when the computer program is executed, it controls the device containing the computer-readable storage medium to perform a model processing method for semantic understanding as described in any paragraph of the first aspect.

[0043] By adopting the above technical solution, the present invention can achieve the following technical effects:

[0044] This invention generates a diffusion intensity matrix based on feature importance and inter-layer similarity, and dynamically adjusts the feature transmission intensity. This allows for precise focus on key semantic features such as entity words and verbs, while weakening the interference of redundant features such as function words. This effectively improves the semantic capture accuracy of text classification and text understanding, reduces model perplexity, and significantly improves task accuracy.

[0045] This invention also dynamically determines inter-layer connection paths by combining inter-layer similarity with the current computational load, activating or deactivating model coding layers as needed, reducing redundant inter-layer connections and invalid computational overhead, reducing GPU memory usage, and significantly improving model inference speed. At the same time, it does not require modification of the Transformer core structure and can be directly embedded into mainstream large language models such as GPT and LLaMA, adapting to various text processing scenarios. It maximizes the utilization of computing resources while ensuring model performance, and has good versatility and practical value. Attached Figure Description

[0046] To more clearly illustrate the technical solution of the present invention, the accompanying drawings used in the specific embodiments of the present invention will be briefly introduced below. It should be understood that the following drawings only show some specific embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained from these drawings without creative effort.

[0047] Figure 1 This is the overall flowchart of model processing.

[0048] Figure 2 This is a structural diagram of the adaptive diffusion module, used to reflect the adaptive diffusion principle.

[0049] Figure 3 This is a graph showing the results of the ablation experiment, used to verify the independent effects of the adaptive diffusion module and the dynamic connection module. The horizontal axis represents the model type (ADLC-full, ADLC-no diffusion, ADLC-no connection), and the vertical axis represents PPL (left axis) and inference speed (right axis). Detailed Implementation

[0050] The technical solutions of the present invention will now be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention.

[0051] Example 1, please refer to Figures 1 to 3 The first embodiment of this invention provides a model processing method for semantic understanding. This method constructs an adaptive diffusion module, a dynamic hierarchical connection module, and a feature fusion and completion module, embedding them into the encoding / decoding layer of a Transformer model to form an end-to-end architecture of "input -- preprocessing -- hierarchical encoding (embedded: adaptive diffusion module, dynamic hierarchical connection module, and feature fusion and completion module) -- output". The input module, preprocessing module, and output module are the same as in existing Transformer models, and will not be described further in this invention.

[0052] like Figure 1 As shown, the improved model consists of seven core modules: an input module, a preprocessing module, a hierarchical coding module (containing n layers from L1 to Ln, n≥12 in this embodiment), an adaptive diffusion module, a dynamic hierarchical connection module, a feature fusion and completion module, and an output module. The adaptive diffusion module receives features from all layers of the hierarchical coding module, completes the calculation, and feeds them back to each layer to achieve dynamic adjustment of the next round of hierarchical coding.

[0053] The model processing method for semantic understanding of the present invention includes steps S1 to S8.

[0054] S1. Obtain the text information to be processed. Specifically, the input module is used to input the text information to be processed.

[0055] S2. Preprocess the text information to be processed to generate a text feature matrix for each word.

[0056] Preferably, the preprocessing includes word segmentation, lexical feature extraction, and contextual grammar feature extraction of the text information to be processed. The lexical features include one or more of word vectors, word frequency, word length, and part-of-speech tags. The contextual grammar features include at least one or more of phrase structure, punctuation features, positional features, and dependency relations.

[0057] A text feature matrix is ​​constructed for each word segmentation unit.

[0058] definition Let be the text feature matrix obtained after preprocessing a certain word, with a matrix size of . .in, Representing feature dimension, This indicates the length of the feature sequence. Different dimensions in the text feature matrix are used to jointly represent word-level semantic information and contextual structure information.

[0059] Specifically, the preprocessing module first segments the text information to be processed into words, and then generates a text feature matrix for each word. As shown in Table 1, the text feature matrix of each lexical unit includes both the internal lexical features of that lexical unit and the contextual grammatical features related to other lexical units.

[0060] Although the computation of a single lexical unit in the L1 to Ln layers of the hierarchical encoder is performed layer by layer, the hierarchical encoding of different lexical units can be executed in parallel. When calculating the context-related features of a certain lexical unit, it is necessary to query the feature matrices of other related lexical units in the corresponding layers to participate in the computation.

[0061] Table 1. Text Feature Matrix of Lexical Units .

[0062]

[0063] S3. The Transformer-based hierarchical encoding module encodes the text feature matrix of each word layer by layer, obtaining the output feature matrix of each word at each layer. Specifically, the hierarchical encoding module is used to complete the encoder or decoder of the Transformer.

[0064] Preferably, the hierarchical encoding module is embedded in the Transformer model to perform layer-by-layer semantic representation learning on the input features. The hierarchical encoding module includes multiple encoding or decoding layers from the first to the nth layer. Here, n represents the total number of layers in the hierarchical encoding module, and... .

[0065] The first layer of hierarchical encoding receives the text feature matrix of a specific word. As input, it outputs the first layer's feature matrix of the word. .

[0066] The first layer of coding The layer that receives the word element The output feature matrix of the layer As input, output the first word of the term. The output feature matrix of the layer .in, It is a layer index, and 2≤ ≤n.

[0067] Each word is encoded and calculated at layers 1 to n according to the previous encoding steps. When calculating the context-related features of a word, the feature matrices of other related words at the corresponding layers are called to participate in the calculation.

[0068] S4. Based on the output feature matrix of each word in each layer, calculate the feature importance and inter-layer similarity to generate the diffusion intensity matrix of each word in each layer. Specifically, the adaptive diffusion module is used to calculate the feature importance weights, calculate the inter-layer similarity, and generate the diffusion intensity matrix.

[0069] Preferably, the feature importance matrix is ​​initialized before calculating the diffusion intensity matrix. The feature importance matrix has the same size as the text feature matrix. During initialization, the initial weights for entity words and verbs are set to 0.8, the initial weights for adjectives are set to 0.6, the initial weights for function words are set to 0.2, and the initial weights for contextual grammatical features are uniformly set to 0.5.

[0070] The feature importance matrix for the first layer is: In the formula, This represents the feature importance matrix of the first layer. This represents the feature importance matrix after initialization. This represents the multi-head attention weight matrix of the first layer of the Transformer. This represents the Hadamard product operation.

[0071] No. The feature importance matrix of the layer is: In the formula, Indicates the first The feature importance matrix of the layer. Indicates the first The feature importance matrix of the layer. Represents the Transformer Multi-head attention weight matrix of the layer. Representation layer index.

[0072] The size of the multi-head attention weight matrix is: .in, Indicates the feature dimension. Indicates the length of the feature sequence.

[0073] For the first Perform probability distribution normalization on the feature importance matrix of the layer: In the formula, Represents the normalized i-th The layer feature importance matrix at the th Line 1 The element at the column position. Represents the first digit before normalization. The layer feature importance matrix at the th Line 1 The element at the column position. Represents the natural constant. and These represent the row index and column index during the summation process, respectively.

[0074] Preferably, the inter-layer similarity is used to characterize the degree of consistency between the feature representations of two adjacent layers. The higher the inter-layer similarity, the lower the diffusion intensity of the corresponding layer, thereby reducing the transmission of redundant features.

[0075] No. layer relative to the first The inter-layer similarity is: In the formula, Indicates the first layer relative to the first Inter-layer similarity. Indicates the first word of a word The output feature matrix of the layer. Indicates the first word of a word The output feature matrix of the layer. This indicates the inner product operation. This represents the Euclidean norm. Representation layer index.

[0076] No. The diffusion intensity matrix of the layer is: In the formula, Indicates the first The diffusion intensity matrix of the layer. This indicates a regulatory factor. Indicates the first The feature importance matrix of the layer. This represents the Hadamard product operation. This represents the base diffusion matrix, which has the same dimension as the text feature matrix. Indicates the feature dimension. Indicates the length of the feature sequence.

[0077] The inter-layer similarity calculated in each round of hierarchical coding. and diffusion intensity matrix All are reserved for use in the next round of hierarchical coding. The adjustment factors... The value ranges from 0.5 to 1, and is generally taken as 0.85.

[0078] Specifically, the adaptive diffusion module is embedded into the traditional hierarchical coding module to calculate feature importance weights and inter-layer similarity. Then, based on feature importance and inter-layer similarity, a diffusion intensity matrix is ​​generated to dynamically adjust the weights of information transmission, thereby improving the problem of inefficient diffusion. The specific steps are as follows.

[0079] The first layer, L1, of the hierarchical encoding is used to: first, input the text feature matrix output by the preprocessing module. (As shown in Table 1), the matrix size is For feature dimension, Let be the length of the feature sequence. Then, for the input feature matrix... Perform standard Transformer layered encoding operations. Finally, output the feature matrix of the first layer. (Shallow understanding) The size of the matrix is .

[0080] The first layer of coding Layer Lk is used for: First, the input is the first... Layer output Then, for the input feature matrix... Perform Transformer layered encoding operations. Finally, the output is the Transformer encoder's [number]th [code snippet]. Layer output (Deepening layer by layer), the matrix size is Where 2≤ ≤n.

[0081] Initialize the feature importance matrix. The matrix size is [value missing]. For feature dimension, is the length of the feature sequence. The initial matrix values ​​are: Word features: entity word weight 0.8, verb weight 0.8, adjective weight 0.6, function word weight 0.2. Contextual syntax: uniformly initialized to 0.5.

[0082] Calculate feature importance. To adapt to the text features in Table 1, the number of heads in the multi-head attention mechanism in each layer of the Transformer is designed as follows: Therefore, the first Multi-head attention weight matrix The size is .

[0083] Calculate the feature importance matrix of the first layer .

[0084] .

[0085] In the formula, This is the initialized feature importance matrix. This is the multi-head attention weight matrix for the first layer of the Transformer. It is the Hadamard product of the matrix.

[0086] And so on, the... Layer feature importance matrix .

[0087] .

[0088] In the formula, for The output feature importance matrix is ​​calculated from the layer feature importance calculation. For Transformer Multi-head attention weight matrix of the layer. It is the Hadamard product of the matrix.

[0089] For the feature importance matrix Perform probability distribution normalization. The calculation formula for each element is as follows.

[0090] .

[0091] In the formula, Represents the normalized i-th The layer feature importance matrix at the th Line 1 The element at the column position. Represents the first digit before normalization. The layer feature importance matrix at the th Line 1 The element at the column position. Represents the natural constant. and These represent the row index and column index during the summation process, respectively.

[0092] Calculate inter-layer similarity. Specifically, calculate the inter-layer similarity between two adjacent layers, the first... Layer relative Interlayer similarity The expression is: In the formula, It is the Euclidean norm. This is the output of the k-th layer of the layer encoder. This is the output of the (k-1)th layer of the layer encoder.

[0093] The closer it is to 1, the higher the overlap between the two layers of features, and the diffusion intensity needs to be reduced to avoid redundancy.

[0094] Generate the diffusion intensity matrix. Specifically, use the feature importance matrix. and interlayer similarity Finally, the first one was calculated. Diffusion intensity matrix of the layer .

[0095] .

[0096] In the formula, The adjustment factor is set to a value between 0.5 and 1, and is typically set to 0.85. It is the Hadamard product of the matrix. The basic diffusion matrix has the same dimensions as the text feature matrix and is used to ensure that each feature has a basic diffusion path.

[0097] The larger the element value, the stronger the feature diffusion intensity of that dimension.

[0098] S5. Based on the diffusion intensity matrix of each word in each layer in the previous round, optimize the input feature matrix of each word in each layer in the current round.

[0099] Preferably, S5 specifically includes: based on the first layer of coding retained in the previous round of hierarchical coding... Layer diffusion intensity matrix, for the current round The input feature matrix of the layer is updated. Specifically, the diffusion intensity matrix generated in the current round... Used for the calculation of the next round of the layer encoder (in this embodiment, the number of calculation rounds is set to ≥10), the next round of calculation in the first... The input of the layer is from Become .

[0100] .

[0101] In the formula, The first after superimposed diffusion intensity Layer input matrix. For the first The output of the layer. Indicates the first The diffusion intensity matrix of the layer. This represents the Hadamard product operation. 2≤ ≤n.

[0102] This enables the hierarchical coding module in each subsequent round to perform adaptive diffusion based on the feature importance, inter-layer similarity, and diffusion intensity matrix calculated and retained in the previous round.

[0103] S6. Based on the inter-layer similarity calculated and retained from the previous round of hierarchical coding and the current computational load, determine the dynamic inter-layer connection path for the current round. Specifically, the dynamic hierarchical connection module is used to calculate and optimize the connection path.

[0104] In this embodiment, the dynamic hierarchical connection module calculates and retains the inter-layer similarity based on the previous round of hierarchical encoding. and current computing load Calculate the connection probability of the current round and dynamically control the activation and deactivation of inter-layer connection paths in the current round. The specific steps are as follows.

[0105] Preferred, the first layer to the first The connection probability of a layer is:

[0106] .

[0107] In the formula, Indicates the first layer to the first Layer connection probability. This represents the Sigmoid function. Indicates the first layer relative to the first Inter-layer similarity. This represents the load adjustment factor. Indicates the first The current computational load of the layer. Representation layer index.

[0108] The Sigmoid function is: In the formula, Represent the independent variable The corresponding Sigmoid function value, Indicates the input variables of the function. Represents the natural constant.

[0109] The Sigmoid function is a non-linear function that maps real numbers to the interval (0, 1).

[0110] In this context, the current round of hierarchical coding calculates the connection probability. At that time, the inter-layer similarity used , which is the inter-layer similarity calculated and retained in the previous round of layered coding. Indicates the first The current computational load of the layer is taken as the value of the layer in the previous training round. The GPU memory usage of each layer. Based on this, the calculation results of the previous round determine the current round's layer. layer to the first Whether the layers are connected.

[0111] The load adjustment factor has values ​​as shown in Table 2.

[0112] Table 2. Load adjustment factor values.

[0113]

[0114] That is: the load adjustment factor Use 0 to 0.1 in the early stage of training, 0.1 to 0.3 in the middle stage of training, and 0.3 to 0.4 in the later stage of training.

[0115] Preferably, the dynamic link path is determined according to the following rules:

[0116] The first layer remains active and is not turned off. Specifically, the L1 layer cannot be turned off because there is no L0 layer, so inter-layer similarity cannot be calculated. .

[0117] When the Connection probability corresponding to the layer When the value is greater than or equal to 0.6, the first step is activated. Layer, otherwise close the first layer. layer.

[0118] Any two consecutive layers should not be turned off simultaneously to avoid excessive damage to features (i.e., severe loss or degradation of feature information).

[0119] S7. In the case of inter-layer skip connections, when it is determined from the previous round of layered coding that a certain word element skips the first layer in the current round... During layer connection, the missing features of the word in the skipped layer are simulated and filled in based on the diffusion intensity matrix to obtain the feature fusion and filling matrix, which is then used for the first layer connection. Used when calculating contextual features for other lexical units in the layer.

[0120] Preferably, in S7, when it is determined from the previous round of hierarchical coding results that a certain word element will skip the first iteration in the current round... When connecting layers, the first lexical unit is used as the basis for the connection. The output feature matrix of the layer and the first layer of the previous layer encoding of the word. Layer diffusion intensity matrix, for this word in the th layer The missing features of the layer are simulated to generate the word in the current round. The feature fusion and completion matrix of the layer.

[0121] .

[0122] In the formula, Indicates the first word of a word The feature fusion and completion matrix of the layer. Indicates the first word of this word. The output feature matrix of the layer. This indicates the previous round of hierarchical encoding of the lexical unit. The diffusion intensity matrix of the layer. This represents the Hadamard product operation. Representation layer index.

[0123] Feature fusion and matrix completion solves the problem of the word skipping the first digit. The layer, resulting in other lexical units in the first layer. The problem of lacking adjacent word features when calculating contextual features in the layer.

[0124] Specifically, as shown in Table 1, the text feature matrix of a single word. The second row of elements includes contextual grammar features. Although the computation of a single lexical unit in the L1 to Ln layers of the hierarchical encoder is sequential, the hierarchical encoding of different lexical units can be performed in parallel. With the introduction of dynamic hierarchical connections, if a lexical unit skips the first... If the layers are connected, then when other lexical units calculate contextual grammar-related features at that layer, the skipped lexical unit may still be needed in the next layer. The corresponding feature value of the layer. If this feature value is missing, other word encoding operations cannot be performed at this layer.

[0125] To address this issue, this embodiment further introduces a feature fusion and completion module, which performs feature fusion and completion on the skipped word at the [number]th [position]. The missing features of the layer are quickly simulated and generated to ensure the integrity of the dependent features required for encoding other tokens in the same layer. That is, when a token skips the first... During layer concatenation, the feature matrix of the term is quickly generated and padded. Feature fusion is used to padded the matrix, addressing the issue of the term skipping the first concatenation step. The layer, resulting in other lexical units in the first layer. The problem of lacking adjacent word features when calculating contextual features in the layer.

[0126] S8. Based on the current layer feature matrix of each padded word, continue to execute subsequent layer encoding according to the dynamic connection path to obtain the output feature matrix of the final layer, and output the semantic understanding result based on the output feature matrix of the final layer. Specifically, the output module is used to output text classification and text understanding (i.e., semantic understanding).

[0127] This embodiment presents a model processing method for semantic understanding that focuses on key features through adaptive diffusion, improving text understanding accuracy by over 3%. By reducing inter-layer connections by 20%-30% through a dynamic connection module, inference speed is improved by over 30%. Furthermore, it requires no modification to the Transformer core structure and can be directly embedded into mainstream models such as GPT and LLaMA, adapting to tasks such as text understanding and text classification.

[0128] Ablation experiments were conducted on the model processing method for semantic understanding in this embodiment, and the results are shown in the appendix. Figure 3 As shown.

[0129] The chart type for the ablation experiment is a biaxial bar chart (the left axis represents the accuracy index, and the right axis represents the efficiency index).

[0130] The core objective of the ablation experiment is to verify the independent effects of the adaptive diffusion module and the dynamic connection module.

[0131] Figure 3 The horizontal axis represents the model type (3 types), including ADLC-full, ADLC-diffusion-free, and ADLC-connection-free.

[0132] "ADLC-Complete" refers to the complete model of this invention (including adaptive diffusion and dynamic connection).

[0133] "ADLC-No Diffusion" means: Remove the adaptive diffusion module and retain only the dynamic connection.

[0134] "ADLC-No Connection" means: remove the dynamic connection module and retain only adaptive diffusion.

[0135] Figure 3 The vertical axis represents perplexity (PPL) on the left (blue), with lower values ​​indicating better performance, ranging from 7 to 14. The right axis (orange) represents inference speed (tokens / s), with higher values ​​indicating better efficiency, ranging from 60 to 140.

[0136] Accuracy dimensions: such as Figure 3 As shown, the PPL of ADLC-full is 9.8, which is lower than ADLC-no-diffusion (11.5) and ADLC-no-connection (10.2), indicating that the adaptive diffusion module can effectively focus on key features and provide more accurate semantic information.

[0137] Efficiency dimension: such as Figure 3 As shown, the inference speed of ADLC-full is 120 tokens / s, which is higher than ADLC-no-diffusion (90 tokens / s) and ADLC-no-connection (115 tokens / s), indicating that the dynamic connection module can reduce redundant connections and reduce computational overhead.

[0138] To facilitate understanding of the present invention, a specific application scenario is used below to verify the technical effects of the embodiments of the present invention. The experimental design is as follows. Wherein, LLaMA 3 7B is the 7B version of the third-generation LLaMA model.

[0139] Baseline Model 1: Standard LLaMA 3 7B (fully connected layered).

[0140] Baseline Model 2: LLaMA 3 7B + MoE (8 experts, dynamic activation).

[0141] Baseline Model 3: LLaMA 3 7B + Sparse Transformer (connection sparsity 50%).

[0142] Dataset: The training set, validation set, and test set are 45%, 25%, and 15% non-overlapping subsets randomly sampled from CCI 4.0-M2-Base-V1 Chinese web pages (1.7 billion lines of text), respectively.

[0143] Evaluation metrics include performance metrics and efficiency metrics.

[0144] Performance metrics include: perplexity (PPL, lower is better) and GLUE accuracy (higher is better). Perplexity measures the model's uncertainty in predicting the next word; a lower value indicates a better model. GLUE accuracy is the arithmetic mean of the accuracy rates for semantic, syntactic, and inference tasks, reflecting the model's overall ability in language understanding tasks; a higher value indicates a better model.

[0145] Efficiency metrics include: number of parameters (inter-layer connections), inference speed (tokens / s, the higher the better), and GPU memory usage (GB, the lower the better).

[0146] Hardware: GPU (NVIDIA 4090 24GB) ×8, RAM 128GB.

[0147] The experimental results are as follows, including comparisons of accuracy and efficiency.

[0148] Table 3. Accuracy Comparison.

[0149]

[0150] Percentage improvement in average accuracy: .

[0151] Table 4. Efficiency Comparison.

[0152]

[0153] Efficiency improvement percentage: .

[0154] The experimental results include performance, efficiency, and overall advantages.

[0155] In terms of performance: the PPL of this invention is 1.4 lower than the baseline model 1, and the GLUE accuracy is 3.2% higher. This is because the adaptive diffusion module enhances the transmission of key features.

[0156] In terms of efficiency: the inference speed of this invention is 35 tokens / s higher than the baseline model 1, and the memory usage is 4.2GB lower. This is because dynamic linking reduces redundant Transformer layers.

[0157] In terms of overall advantages: This invention significantly outperforms the baseline in terms of performance-efficiency balance, is suitable for a wide range of large language model application scenarios, and is particularly valuable in low-computing-power devices and high-concurrency real-time systems.

[0158] Example 2: This invention provides a model processing device for semantic understanding, comprising a processor, a memory, and a computer program stored in the memory. The computer program can be executed by the processor to implement a model processing method for semantic understanding as described in any paragraph of Example 1.

[0159] Example 3: The present invention provides a computer-readable storage medium. The computer-readable storage medium includes a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to perform a model processing method for semantic understanding as described in any paragraph of Example 1.

[0160] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A model processing method for semantic understanding, characterized in that, Include: S1. Obtain the text information to be processed; S2. Preprocess the text information to be processed to generate a text feature matrix for each word; S3. The Transformer-based hierarchical encoding module encodes the text feature matrix of each word layer by layer to obtain the output feature matrix of each word at each layer. S4. Calculate the feature importance and inter-layer similarity based on the output feature matrix of each word in each layer to generate the diffusion intensity matrix of each word in each layer. S5. Based on the diffusion intensity matrix of each word in each layer in the previous round, optimize the input feature matrix of each word in each layer in the current round. S6. Determine the inter-layer dynamic connection path for the current round based on the inter-layer similarity of each word in the previous round and the current computational load. S7. In the case of inter-layer skip connections, when it is determined from the previous round of layered coding that a certain word element skips the first layer in the current round... During layer connection, the missing features of the word in the skipped layer are simulated and filled in based on the diffusion intensity matrix to obtain the feature fusion and filling matrix, which is then used for the first layer connection. Used when calculating contextual features for other lexical units in the layer; S8. Based on the current layer feature matrix of each word after padding, continue to execute the subsequent layer encoding according to the dynamic connection path to obtain the output feature matrix of the final layer, and output the semantic understanding result according to the output feature matrix of the final layer.

2. The model processing method for semantic understanding according to claim 1, characterized in that, In S4, before calculating the diffusion intensity matrix, the feature importance matrix is ​​initialized first; wherein, the matrix size of the feature importance matrix is ​​the same as that of the text feature matrix; during initialization, the initial weights of entity words and verbs are set to 0.8, the initial weights of adjectives are set to 0.6, the initial weights of function words are set to 0.2, and the initial weights of contextual grammatical features are uniformly set to 0.5; The feature importance matrix for the first layer is: In the formula, This represents the feature importance matrix of the first layer; This represents the feature importance matrix after initialization. This represents the multi-head attention weight matrix of the first layer of the Transformer; This represents the Hadamard product operation; No. The feature importance matrix of the layer is: In the formula, Indicates the first The feature importance matrix of the layer; Indicates the first The feature importance matrix of the layer; Represents the Transformer Multi-head attention weight matrix of the layer; Presentation layer index; The size of the multi-head attention weight matrix is: ;in, Indicates the feature dimension; Indicates the length of the feature sequence; For the Perform probability distribution normalization on the feature importance matrix of the layer: In the formula, Represents the normalized i-th The layer feature importance matrix at the th Line number The element at the column position; Represents the first digit before normalization. The layer feature importance matrix at the th Line number The element at the column position; Represents the natural constant; and These represent the row index and column index during the summation process, respectively.

3. The model processing method for semantic understanding according to claim 1, characterized in that, The inter-layer similarity is used to characterize the degree of consistency between the feature representations of two adjacent layers; the higher the inter-layer similarity, the lower the diffusion intensity of the corresponding layer, so as to reduce the transmission of redundant features. No. layer relative to the first The inter-layer similarity is: In the formula, Indicates the first layer relative to the first Interlayer similarity; Indicates the first word of a word The output feature matrix of the layer; Indicates the first word of a word The output feature matrix of the layer; Indicates inner product operation; Denotes the Euclidean norm; Presentation layer index; No. The diffusion intensity matrix of the layer is: In the formula, Indicates the first The diffusion intensity matrix of the layer; Indicates the regulating factor; Indicates the first The feature importance matrix of the layer; This represents the Hadamard product operation; This represents the base diffusion matrix, which has the same dimension as the text feature matrix. Indicates the feature dimension; Indicates the length of the feature sequence; The inter-layer similarity calculated in each round of hierarchical coding. and diffusion intensity matrix All are reserved for use in the next round of hierarchical coding; the adjustment factors The value range is from 0.5 to 1.

4. The model processing method for semantic understanding according to claim 1, characterized in that, S5 specifically includes: based on the first layer of coding retained in the previous round. Layer diffusion intensity matrix, for the current round The input feature matrix of the layer is updated; ; In the formula, Represents the first diffusion intensity after superposition. Layer input matrix; Indicates the first word of a word The output feature matrix of the layer; This represents the Hadamard product operation; Indicates the first The diffusion intensity matrix of the layer.

5. The model processing method for semantic understanding according to claim 1, characterized in that, In S6, the first layer to the first The connection probability of a layer is: In the formula, Indicates the first layer to the first Layer connection probability; Represents the Sigmoid function; Indicates the first layer relative to the first Interlayer similarity; This represents the load adjustment factor, with a value ranging from 0 to 0.

5. Indicates the first The current computational load of the layer; Presentation layer index; The Sigmoid function is: In the formula, Represent the independent variable The corresponding Sigmoid function value, Indicates the input variables of the function. Represents the natural constant; In this context, the current round of hierarchical coding calculates the connection probability. At that time, the inter-layer similarity used The inter-layer similarity is calculated and retained from the previous round of hierarchical encoding; the current computational load... Take the first step in the previous round of training The GPU memory usage of the current layer is determined based on the results of the previous round of calculations. layer to the first Are the layers connected? The dynamic link path is determined according to the following rules: The first layer remains active and is not turned off. When the Connection probability corresponding to layer When the value is greater than or equal to 0.6, the first [function] is activated. Layer, otherwise close the first layer. layer; Any two consecutive layers should not be turned off simultaneously to avoid excessive damage to features.

6. The model processing method for semantic understanding according to claim 1, characterized in that, In S7, when it is determined from the previous round of hierarchical coding results that a certain word element will skip the first step in the current round... When connecting layers, the first lexical unit is used as the basis for the connection. The output feature matrix of the layer and the first layer of the previous layer encoding of the word. Layer diffusion intensity matrix, for this word in the th layer The missing features of the layer are simulated to generate the word in the current round. Feature fusion and completion matrix of the layer: In the formula, Indicates the first word of a word Feature fusion and completion matrix of the layer; Indicates the first word of a word The output feature matrix of the layer; This indicates the previous round of hierarchical encoding of the word. The diffusion intensity matrix of the layer; This represents the Hadamard product operation; Presentation layer index; Feature fusion and matrix completion solves the problem of the word skipping the first digit. Layer, causing other lexical units to be in the first layer. The problem of lacking adjacent word features when calculating contextual features in the layer.

7. A model processing method for semantic understanding according to any one of claims 1 to 6, characterized in that, In S3, a hierarchical encoding module is embedded in the Transformer model to perform layer-by-layer semantic representation learning on the input features; the hierarchical encoding module includes multiple encoding or decoding layers from the first layer to the nth layer; where n represents the total number of layers in the hierarchical encoding module, and n ; The first layer of hierarchical encoding receives the text feature matrix of a specific word. As input, and output the feature matrix of the word in the first layer. ; The first layer of coding The layer that receives the word element The output feature matrix of the layer As input, output the first word of the term. The output feature matrix of the layer ;in, It is a layer index, and 2≤ ≤n; Each word is encoded and calculated at layers 1 to n according to the previous encoding steps. When calculating the context-related features of a word, the feature matrices of other related words at the corresponding layers are called to participate in the calculation.

8. A model processing method for semantic understanding according to any one of claims 1 to 6, characterized in that, The preprocessing in S2 includes word segmentation, lexical feature extraction, and contextual grammar feature extraction of the text information to be processed; wherein, the lexical features include one or more of word vectors, word frequency, word length, and part of speech; the contextual grammar features include at least one or more of phrase structure, punctuation features, positional features, and dependency relations; Construct a text feature matrix for each word segmentation unit; definition Let be the text feature matrix obtained after preprocessing a certain word, with a matrix size of . ;in, Representing feature dimension, The length of the feature sequence is represented; different dimensions in the text feature matrix are used to jointly represent word-level semantic information and contextual structure information.

9. A model processing device for semantic understanding, characterized in that, It includes a processor, a memory, and a computer program stored in the memory; the computer program can be executed by the processor to implement a model processing method for semantic understanding as described in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored computer program, wherein, when the computer program is executed, it controls the device on which the computer-readable storage medium is located to perform a model processing method for semantic understanding as described in any one of claims 1 to 8.