Self-attention mechanism optimization method based on sentence structure constraint
By constructing sentence structure constraints in the self-attention mechanism, limiting attention computation to within sentences and utilizing sentence boundaries for information compression and transmission, the problem of excessive resource consumption in long text processing by the self-attention mechanism is solved, improving the model's efficiency and similarity to human brain language processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing self-attention mechanisms consume too much computational resources when processing long texts and do not make full use of sentence structure rules and human brain language processing mechanisms, resulting in low efficiency of the model in long text processing and low-resource deployment scenarios.
By acquiring a large-scale unlabeled corpus, identifying sentence boundaries and constructing structured attention constraints, attention computation is restricted to within sentences. Information compression and cross-sentence information transfer are performed using sentence boundaries to optimize the self-attention mechanism.
It significantly reduces the computational size and memory consumption of the attention matrix, improves the model's processing efficiency and interpretability, and enhances its consistency with the human brain's language processing mechanism.
Smart Images

Figure CN122242483A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the technical field of language model training, and in particular relates to an optimization method for a self-attention mechanism based on sentence structure constraints. Background Technology
[0002] Language models are probabilistic models that predict the next word based on preceding context in a text sequence. They are also the foundational models for many downstream tasks such as question answering, machine translation, and reasoning. The rapid improvement in the performance of large language models in recent years is largely attributed to the self-attention mechanism in the Transformer architecture. This mechanism establishes dependencies between any two positions in a text sequence, effectively capturing long-distance semantic relationships in language. The self-attention mechanism has directly propelled the leapfrog development of language models and has also spawned general-purpose artificial intelligence systems such as ChatGPT and DeepSeek.
[0003] Existing Transformer-based self-attention mechanisms typically calculate attention weights based on the relevance between the query and the key, and then use these weights to weightedly fuse information from different positions. For an input sequence of length n, self-attention usually requires calculating the pairwise correlations between each position in the sequence, resulting in a computational complexity of O(n²). As the length of the input sequence and the scale of the model parameters increase, self-attention computation leads to very high memory consumption and significantly increases training and inference time, thus limiting the application of the model in scenarios such as long text processing, low-resource deployment, and real-time inference.
[0004] To address the aforementioned issues, various self-attention optimization schemes have been proposed in existing technologies. Some studies employ methods such as sparse attention, local window attention, sliding window attention, or block attention, which avoid having each position interact with the entire context. Instead, they only perform calculations with adjacent positions, positions within a fixed window, or pre-divided local blocks, thereby reducing unnecessary pairwise calculations and thus lowering computational and storage requirements.
[0005] However, existing self-attention optimization methods still have the following drawbacks. First, although existing methods reduce computational costs to some extent, when processing long text sequences, they still require increasing memory and computational resources as the sequence length grows, resulting in a very high overall overhead. Currently, large language models need to integrate sequences with lengths in the millions and are still growing rapidly. Existing methods cannot fundamentally solve the problem of excessive resource consumption by self-attention mechanisms.
[0006] Secondly, existing methods often fail to fully consider the inherent rules of language, such as sentence structure. In natural language, words within the same sentence typically exhibit stronger dependencies, while dependencies between different sentences are relatively weaker. Traditional self-attention mechanisms generate highly redundant and unnecessary computations; and existing attention optimization methods often rely on fixed-length partitions and pre-defined sparse patterns to artificially limit the attention scope, lacking effective utilization of the inherent structural rules of language, resulting in relatively insufficient model interpretability.
[0007] Furthermore, existing methods do not fully utilize the information integration and compression mechanisms of the human brain based on sentence structure during language processing. Psycholinguistic and neuroscience research indicates that human language processing exhibits hierarchical encoding characteristics based on sentence structure. When processing text sequences, the human brain does not retain all word information from preceding text, nor does it comprehensively calculate the semantic relationships between any two words. Instead, it integrates and compresses preceding information at sentence structure boundaries, thereby achieving efficient language understanding with lower resource consumption. In contrast, existing self-attention mechanisms and their optimization schemes underutilize this mechanism, causing models to still rely on large amounts of memory and extensive pairwise computations during language processing, resulting in significant differences from the human brain's language processing mechanisms. Summary of the Invention
[0008] To address the aforementioned problems, this invention provides a self-attention mechanism optimization method based on sentence structure constraints, employing the following technical solution:
[0009] A brain-inspired optimization method for sentence structure-based self-attention mechanisms includes the following steps:
[0010] Obtain a large-scale unlabeled corpus, preprocess the corpus text and identify sentence boundaries to obtain sentence boundary information;
[0011] Structured attention constraints are constructed based on sentence boundaries: intra-sentence attention constraints are constructed for ordinary characters to allow direct attention computation between characters within a sentence; intra-sentence compression constraints are constructed for sentence boundary positions to form a sentence-level compressed representation; and cross-sentence constraints are constructed to allow both ordinary characters and boundary positions in the current sentence to access the boundary positions of the previous sentence.
[0012] Based on the corpus, structured attention constraints are applied to the self-attention modules of each layer of the pre-trained language model. The model with structured attention constraints is then trained or fine-tuned so that the model learns information integration methods based on sentence structure.
[0013] The optimized model was subjected to performance testing and neuromorphic analysis.
[0014] Furthermore, acquiring a large-scale unlabeled corpus and performing preprocessing and sentence boundary identification on the corpus text includes:
[0015] Collect an unannotated large-scale text corpus including news corpus, Baidu Encyclopedia corpus or other continuous discourse data;
[0016] Preprocess the corpus by removing non-Chinese characters and converting traditional Chinese to simplified Chinese;
[0017] Match sentence boundaries through end-of-sentence punctuation marks such as periods, question marks, and exclamation marks, or identify sentence boundaries using sentence splitting tools such as NLTK, Spacy, and hanLP;
[0018] Mark the sentence boundary position by assuming or adding dedicated sentence boundary markers at the corresponding positions of the end-of-sentence punctuation.
[0019] Further, in constructing the attention mask matrix based on the sentence boundary, let the input sequence be W = {w1, w2,..., w T}, T be the sequence length, and c(i) represent the sentence number to which the position i belongs; for any non-boundary position i, if the historical position j satisfies j ≤ i and c(j) = c(i), then a direct attention association is allowed to be established, and if c(j) < c(i), no association is established.
[0020] Further, constructing intra-sentence compression constraints for sentence boundary positions includes: for the boundary position b k of any sentence k, allowing it to access all historical word positions within the sentence, thereby forming a sentence-level compressed representation.
[0021] Further, constructing cross-sentence constraints includes: for the ordinary word i in any sentence k, in addition to allowing it to access the word positions within the sentence, it is also allowed to access the boundary position b k-1 of the previous sentence; for the boundary position b k in any sentence k, in addition to accessing the words within the sentence, it is also allowed to access the boundary position b k-1 of the previous sentence, so as to recursively transfer the compressed information of the previous sentence into the boundary representation of the current sentence.
[0022] Further, applying the structured attention constraint to each layer of the self-attention module of the pre-trained language model includes: the first N - 1 layers of the N-layer self-attention module are used to complete information encoding, and the last layer completes the training task based on the word-level representation of the current sentence and the boundary compressed representation of the previous sentence to predict the next word.
[0023] Further, training or fine-tuning the model with attention constraints imposed based on the corpus includes:
[0024] Load an existing pre-trained language model and modify the attention mask matrix to fine-tune the model by applying structured attention constraints, or directly train the language model from scratch using structured attention constraints; the training objective adopts an autoregressive language modeling task, predicting the next word based on the preceding text.
[0025] Furthermore, performance testing of the optimized model includes: training a model without structured attention constraints and a model with structured attention constraints under the same model size, the same training corpus, and the same training configuration, and then comparing and evaluating performance test metrics on the same test set.
[0026] Furthermore, the performance test metrics include: prediction accuracy, perplexity, model convergence speed, inference speed, and memory usage; prediction accuracy is defined as the proportion of the model that correctly predicts the next word in the test set; perplexity is defined as the exponential value of the average negative value of the logarithmic probability of the true next word in the test set; model convergence speed is defined as the number of training steps required for the training loss or perplexity to reach a preset threshold for the first time; inference speed is defined as the number of words or tokens processed per unit time; and memory usage is defined as the peak memory usage during the training or inference phases.
[0027] Furthermore, the brain-likeness analysis of the optimized model includes: extracting the hidden layer activation representations of each layer of the model at each word position, and aligning the representations with the corresponding human brain neural signals according to the stimulus presentation time; constructing model representation features containing multiple time delays, using a linear coding model to predict the neural responses on each electrode or channel, and calculating the correlation coefficient between the predicted neural responses and the real neural responses on the test set as the alignment index.
[0028] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0029] This invention, by drawing on the linguistic principle of stronger intra-sentence dependencies and the information integration mechanism of the human brain at sentence structural boundaries, restricts attention computation primarily to within sentences and compresses cross-sentence information interaction into access to sentence boundary representations. This allows the model to focus on information within a single sentence length. While preserving long-range information integration capabilities, this method significantly reduces the computational size and memory consumption of the attention matrix, lowers computational complexity, improves the rationality, interpretability, and processing efficiency of attention allocation, and enhances the consistency between the model and the human brain's language processing mechanisms. Attached Figure Description
[0030] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0031] Figure 1 This is a schematic diagram of the brain-inspired self-attention mechanism optimization method based on sentence structure constraints proposed in this application;
[0032] Figure 2 This is a schematic diagram of the syntactically constrained attention mechanism of the brain-inspired sentence structure-constrained self-attention mechanism optimization method of this application; Detailed Implementation
[0033] Embodiments of the present invention are described in detail below. Examples of these embodiments are illustrated in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain the present invention, and should not be construed as limiting the present invention.
[0034] like Figures 1 to 2 The diagram illustrates a brain-inspired, sentence-structure-based self-attention mechanism optimization method according to this application, comprising the following steps:
[0035] S1) Obtain a large-scale unlabeled corpus, preprocess the corpus text and identify sentence boundaries to obtain sentence boundary information;
[0036] S2) Constructing structured attention constraints based on sentence boundaries: Constructing intra-sentence attention constraints for ordinary characters to allow direct attention computation between characters within a sentence; constructing intra-sentence compression constraints for sentence boundary positions to form a sentence-level compressed representation; and constructing cross-sentence constraints to allow both ordinary characters and boundary positions in the current sentence to access the boundary positions of the previous sentence.
[0037] S3) Based on the corpus, the structured attention constraints are applied to the self-attention modules of each layer of the pre-trained language model. The model after applying the structured attention constraints is trained or fine-tuned so that the model learns the information integration method based on sentence structure.
[0038] S4) Perform performance testing and neuromorphic analysis on the optimized model.
[0039] Step S1) of the above steps is the data preparation stage of the method. First, the system obtains a large-scale unannotated corpus, preprocesses the corpus text and identifies sentence boundaries to obtain sentence boundary information. Specifically, the system collects a large-scale unannotated text corpus including news corpus, Baidu Encyclopedia corpus or other continuous discourse data. Subsequently, preprocessing operations such as removing non-Chinese characters and converting traditional Chinese to simplified Chinese are performed on the collected corpus to ensure the consistency of the data format. Then, sentence boundaries are matched through end-of-sentence punctuation marks such as periods, question marks, exclamation marks, or sentence segmentation tools such as NLTK, Spacy, hanLP, etc. are used to identify sentence boundaries, and finally the positions of sentence boundaries are marked.
[0040] The above solution ensures the quality and consistency of the input data through a standardized data collection, preprocessing and sentence boundary recognition process. The core output is sentence boundary information, which lays a foundation for the accurate application of subsequent structured attention constraints. The compatibility design of multiple sentence segmentation tools improves the applicability and engineering implementation flexibility of the method.
[0041] Step S2) of the above steps constructs structured attention constraints based on sentence boundaries. Specifically, it includes:
[0042] Construct intra-sentence attention constraints, that is, construct intra-sentence attention constraints for ordinary characters to allow direct attention calculation between characters within a sentence. Let the input sequence be W = {w1, w2,..., w T}, T be the sequence length, c(i) represent the sentence number to which position i belongs, and b k represent the boundary position of the k-th sentence; for any non-boundary position i, if the historical position j satisfies j ≤ i and c(j) = c(i), then a direct attention association is allowed to be established. If c(j) < c(i), that is, the historical position j belongs to the previous sentence, no direct attention association is established. This solution transforms linguistic rules into computable formal rules through mathematical constraints, enabling the construction of the attention mask matrix to have a clear implementation basis, ensuring the reproducibility and engineering feasibility of the method, and precisely defining the constraints of full connection within a sentence and no direct connection across sentences, so that attention calculation is strictly limited within the sentence.
[0043] Construct intra-sentence compression constraints for sentence boundary positions to form sentence-level compressed representations. Constructing intra-sentence compression constraints for sentence boundary positions includes: for the boundary position b of any sentence k kThis allows access to all historical word positions within the sentence, thus forming a sentence-level compressed representation. This compressed representation handles sentence-level information aggregation and cross-sentence information transfer, realizing the information integration mechanism of the human brain at sentence boundaries. Through this boundary-position compression representation mechanism, the scheme achieves efficient encoding of sentence-level semantics without increasing additional computational overhead, providing a compact and information-rich carrier for cross-sentence information transfer and significantly reducing the memory pressure of long text processing.
[0044] Construct cross-sentence constraints such that both ordinary characters and boundary positions in the current sentence are allowed to access boundary positions in the previous sentence. Specifically, constructing cross-sentence constraints includes: for any ordinary character i in sentence k, in addition to being allowed to access character positions within the current sentence, it is also allowed to access the boundary position b of the previous sentence. k-1 For any boundary position b in sentence k k In addition to accessing words within the current sentence, it is also allowed to access the boundary position b of the previous sentence. k-1 This constraint recursively passes compressed information from previous sentences to the boundary representation of the current sentence. Through this recursive passing mechanism of boundary representations, cross-sentence information interaction is compressed into access to sentence boundary representations, recursively passing compressed information from previous sentences, significantly reducing the computational overhead and memory consumption of long text processing.
[0045] Step S3 above is the model training phase of the method, which is based on the corpus, applying structured attention constraints to the self-attention modules of each layer of the pre-trained language model, and training or fine-tuning the model after applying structured attention constraints.
[0046] Specifically, the application of structured attention constraints to the self-attention modules of the pre-trained language model includes: the first N-1 layers of the N-layer self-attention module are used for information encoding, and the last layer completes the training task based on the character-level representation of the current sentence and the boundary compression representation of the previous sentence, i.e., predicting the next character. In other words, for a pre-trained language model containing N-layer self-attention modules, the first N-1 layers perform the information encoding function, integrating information within and across sentences; the last layer performs the prediction task, with its input including the character-level representation of the current sentence and the boundary compression representation of the previous sentence, and its output being the prediction result of the next character. This layered design ensures that the information encoding capability of the deep network and the prediction capability of the output layer work together, enabling structured attention constraints to effectively function in the deep architecture. This scheme, through layered functional division, makes structured attention constraints deeply compatible with the architecture of existing pre-trained language models, supporting both direct fine-tuning and de novo training modes, thus improving the engineering applicability and deployment flexibility of the method.
[0047] Furthermore, in the method for training or fine-tuning the model with applied attention constraints based on the corpus, an existing pre-trained language model is first loaded, and then the structured attention constraints are applied to the model for fine-tuning by modifying the attention mask matrix, or the language model is trained from scratch using the structured attention constraints without relying on the existing pre-trained model. The training objective adopts an autoregressive language modeling task, that is, predicting the next word based on the preceding context.
[0048] This implementation supports both fine-tuning of existing pre-trained language models (using pre-trained weights to accelerate convergence) and de novo training without relying on pre-trained models. The training objective is uniformly an autoregressive language modeling task, i.e., predicting the next word based on the preceding context, ensuring compatibility with the training paradigm of existing language models. This design allows the method to be directly applied to existing models and also used to build new structured language models. Constraint application is achieved by modifying the attention mask matrix without altering the core model architecture, reducing engineering implementation difficulty. Supporting both fine-tuning and de novo training modes balances development efficiency and customization needs, improving the method's versatility and scalability. This solution, through layered functional division, achieves deep compatibility between structured attention constraints and the architecture of existing pre-trained language models, supporting both direct fine-tuning and de novo training modes, thus enhancing the method's engineering applicability and deployment flexibility.
[0049] Step S4 above involves performance testing and neuromorphic analysis of the optimized model. Specifically:
[0050] First, the performance of the models with and without attention constraints is compared. Under the same model size, training corpus, and training configuration, a baseline model without structured attention constraints and an optimized model with structured attention constraints are trained separately, and then compared and evaluated on the same test set. Performance metrics include prediction accuracy, perplexity, model convergence speed, inference speed, and memory usage. Prediction accuracy is defined as the proportion of times the model correctly predicts the next character in the test set; perplexity is defined as the exponential value of the average negative value of the true next character's logarithmic probability in the test set; model convergence speed is defined as the number of training steps required for the training loss or perplexity to first reach a preset threshold; inference speed is defined as the number of characters or tokens processed per unit time; and memory usage is defined as the peak memory usage during the training or inference phases. If the model with structured attention constraints maintains similar or better prediction accuracy and perplexity while having fewer convergence steps, higher inference speed, and lower peak memory usage, it indicates that the method can effectively improve computational efficiency and resource utilization efficiency while ensuring language modeling performance.
[0051] Based on the above approach, the correlation between model prediction difficulty and human reading time and neural response can be analyzed. For each position, the model can provide the conditional probability of the next word under the current context. This approach further constructs a processing difficulty index, namely surprise level, based on this conditional probability. Generally speaking, the lower the conditional probability of the actual next word, the higher the surprise level of the corresponding position, indicating that the position is more difficult to predict and the cognitive processing load is greater. This approach calculates the correlation coefficients between the surprise level and reading time and neural response intensity based on publicly available human reading datasets (such as Provo Corpus and Natural Stories datasets) and neural response datasets (such as ECoG-Podcast and MEG-MASC datasets). Furthermore, a linear regression model can be established, using surprise level as the independent variable and factors such as word frequency, word length, and position as control variables, to analyze the independent explanatory power of the model's prediction difficulty on reading time or neural response. If the model after applying structured attention constraints shows a higher correlation with reading time and neural response, or still has stronger explanatory power after controlling for the above confounding factors, it indicates that the model's prediction results are closer to the actual human language processing process.
[0052] Furthermore, an alignment analysis was performed on the implicit representations of the model after applying structured attention constraints, comparing them with human brain neural responses. Specifically, the implicit activation representations of each layer of the model at each word position were extracted, and these representations were temporally aligned with the corresponding human brain neural signals based on the stimulus presentation time. Further, a model representation feature containing multiple time delays was constructed, and a linear coding model was used to predict the neural responses at each electrode or channel. The linear coding model can be trained using ridge regression with a regularization term, and the correlation coefficient between the predicted neural responses and the actual neural responses calculated on the test set was used as the alignment metric. If the model after applying structured attention constraints achieves higher predictive performance across more channels or more time windows, it indicates that the model has formed an internal representation that more closely resembles the human brain's language processing mechanism.
[0053] This verification system ensures that the method achieves synergistic results in both engineering optimization and brain-like characteristics.
[0054] This approach establishes a complete evaluation system for the effectiveness of the method through multi-dimensional performance testing and brain-inspired analysis, providing a quantitative basis for the optimization and iteration of the method. It also verifies the dual advantages of the method in terms of computational efficiency and biological rationality based on brain inspiration.
[0055] The method described herein is implemented by modifying the attention mask matrix. It can be directly applied to fine-tuning existing pre-trained language models or used to train new structured language models from scratch. This method is not only applicable to sentence units but can also be further extended to language units at different levels, such as phrases and clauses, to achieve more efficient attention constraints and optimizations that better conform to language structure rules and refined syntactic rules.
[0056] In other words, this invention proposes to constrain the model's attention weight matrix to high-level language units, rather than associating all characters pairwise. Taking a sentence unit as an example, attentional associations can be directly established between characters within a sentence, and information within a sentence is integrated at the sentence boundary to form a sentence-level compressed representation. Cross-sentence information transmission is achieved by accessing this sentence-level compressed representation, instead of directly calculating attention for each character in the preceding sentence. In this way, the attention matrix simultaneously realizes the structural constraints of intra-sentence dependencies and sentence-level information integration and compression. By drawing on the human brain's processing mechanism of information integration and compression at structural boundaries, cross-sentence information no longer relies on direct attention calculation for all preceding characters, but is transmitted through the compressed representation formed at the sentence boundary. This significantly reduces the computational scale of the attention matrix while preserving the ability to transmit cross-sentence information.
[0057] Specifically, this invention first segments the input text into sentences to determine sentence boundaries. Then, sentence structure constraints are applied to the model's self-attention weight matrix, enabling direct attention associations between characters within the same sentence. Sentence boundary positions integrate intra-sentence information to form a sentence-level compressed representation. Characters in subsequent sentences no longer directly access all characters in the preceding sentence, but instead obtain cross-sentence information by accessing the compressed representation at the preceding sentence's boundary positions. Simultaneously, the sentence boundary compressed representation can be recursively updated, meaning that while aggregating current sentence information, it receives the compressed representation of the previous sentence's boundary. Therefore, long-range contextual information propagates gradually along sentence boundaries without requiring attention calculations at the character level for all historical positions.
[0058] In summary, the method of this application restricts attention computation mainly to within sentences. By utilizing the sentence structure rules of language itself and the information integration mechanism of the human brain at sentence boundaries, it constrains and optimizes the information interaction mode of the self-attention weight matrix. It compresses cross-sentence information interaction into access to sentence boundary representations, so that the model only needs to focus on information within a sentence length range. Thus, while retaining the ability to integrate long-range information, it significantly reduces the computational scale and memory consumption of the attention matrix, reduces computational complexity, improves the rationality, interpretability and processing efficiency of attention allocation, and enhances the consistency between the model and the human brain's language processing mechanism.
[0059] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the above embodiments do not limit the present invention in any way, and all technical solutions obtained by equivalent substitution or equivalent transformation fall within the protection scope of the present invention.
Claims
1. A brain-inspired optimization method based on sentence structure for self-attention mechanisms, characterized in that, Includes the following steps: Obtain a large-scale unlabeled corpus, preprocess the corpus text and identify sentence boundaries to obtain sentence boundary information; Structured attention constraints are constructed based on sentence boundaries: intra-sentence attention constraints are constructed for ordinary characters to allow direct attention computation between characters within a sentence; intra-sentence compression constraints are constructed for sentence boundary positions to form a sentence-level compressed representation; and cross-sentence constraints are constructed to allow both ordinary characters and boundary positions in the current sentence to access the boundary positions of the previous sentence. Based on the corpus, the structured attention constraints are applied to the self-attention modules of each layer of the pre-trained language model. The model after applying the structured attention constraints is trained or fine-tuned so that the model learns the information integration method based on sentence structure. The optimized model was subjected to performance testing and neuromorphic analysis.
2. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The process of acquiring a large-scale unlabeled corpus and preprocessing the text and identifying sentence boundaries includes: Collect a large-scale unlabeled text corpus including news corpora, Baidu Encyclopedia corpora, or other continuous discourse data; The corpus is preprocessed to remove non-Chinese characters and convert traditional Chinese characters to simplified Chinese characters; Sentence boundaries can be matched by using period, question mark, and exclamation mark punctuation at the end of sentences, or by using NLTK, Spacy, and hanLP sentence segmentation tools to identify sentence boundaries. By using punctuation marks at the end of sentences to serve as or add dedicated sentence boundary markers, the sentence boundary positions are marked.
3. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, In the construction of the attention mask matrix based on sentence boundaries, let the input sequence be W = {w1, w2,..., w T}, where T is the sequence length, and c(i) represents the sentence number to which the position i belongs; for any non-boundary position i, if the historical position j satisfies j ≤ i and c(j) = c(i), a direct attention association is allowed to be established, and if c(j) < c(i), no association is established.
4. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The construction of intra-sentence compression constraints at sentence boundary positions includes: for any sentence k, at boundary position b k This allows it to access all historical word positions within the sentence, thus forming a sentence-level compressed representation.
5. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The construction of cross-sentence constraints includes: for any ordinary character i in sentence k, in addition to being allowed to access character positions within the current sentence, it is also allowed to access the boundary position b of the previous sentence. k-1 For any boundary position b in sentence k k In addition to accessing words within the current sentence, it is also allowed to access the boundary position b of the previous sentence. k-1 This allows the compressed information from previous statements to be recursively passed to the boundary representation of the current sentence.
6. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The self-attention modules of the pre-trained language model that apply structured attention constraints include: the first N-1 layers of the N-layer self-attention module are used to complete information encoding, and the last layer completes the training task based on the character-level representation of the current sentence and the boundary compression representation of the previous sentence to predict the next character.
7. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The training or fine-tuning of the attention-constrained model based on the corpus includes: Load an existing pre-trained language model and modify the attention mask matrix to fine-tune the model by applying the structured attention constraints, or directly train the language model from scratch using the structured attention constraints; the training objective is an autoregressive language modeling task, predicting the next word based on the preceding text.
8. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The performance testing of the optimized model includes: training a model without structured attention constraints and a model with structured attention constraints under the same model size, the same training corpus, and the same training configuration, and then comparing and evaluating the performance test metrics on the same test set.
9. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 8, characterized in that, The performance test metrics include: prediction accuracy, perplexity, model convergence speed, inference speed, and video memory usage. The prediction accuracy is defined as the proportion of the model that correctly predicts the next character in the test set; the perplexity is defined as the exponential value of the average negative value of the logarithmic probability of the true next character in the test set; the model convergence speed is defined as the number of training steps required when the training loss or perplexity first reaches a preset threshold; the inference speed is defined as the number of characters or tokens processed per unit time; and the memory usage is defined as the peak memory usage during the training or inference phase.
10. The brain-inspired sentence structure-based self-attention mechanism optimization method according to claim 1, characterized in that, The brain-likeness analysis of the optimized model includes: extracting the hidden activation representations of each layer of the model at each word position, and aligning the representations with the corresponding human brain neural signals according to the stimulus presentation time; constructing model representation features containing multiple time delays, using a linear coding model to predict the neural responses on each electrode or channel, and calculating the correlation coefficient between the predicted neural responses and the actual neural responses on the test set as the alignment index.