Unsupervised sentence representation learning method based on tensor norm constraint and self-distillation
By using tensor norm constraints and self-distillation, the direction and norm of sentence embedding are optimized, which improves the semantic discriminative ability and model stability of unsupervised sentence representation learning. This solves the problem of insufficient vector direction optimization in existing methods and achieves more efficient sentence representation learning.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TIANJIN POLYTECHNIC UNIV
- Filing Date
- 2026-05-15
- Publication Date
- 2026-06-12
AI Technical Summary
Existing unsupervised sentence representation learning methods only focus on vector direction when optimizing sentence embeddings, ignoring the semantic information contained in the vector norm. This limits the model's ability to discriminate semantic details, and the quality of the initial pseudo-labels is limited during the self-distillation process, affecting model performance.
By introducing tensor norm constraints and self-distillation, dual encoders and cross encoders are constructed. The tensor norm loss function is used to optimize the direction and norm of sentence embedding. Iterative mutual distillation is performed by combining self-attention mechanism to generate high-quality pseudo-similarity labels and improve the collaborative performance of dual encoders and cross encoders.
It significantly improves the discriminative ability of sentence representation and the semantic understanding ability of the model. It outperforms existing methods on multiple semantic text similarity benchmarks, effectively suppresses pseudo-label overfitting of cross encoders, and improves the stability and accuracy of the distillation process.
Smart Images

Figure CN122197908A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of semantic text representation technology, and in particular relates to an unsupervised sentence representation learning method based on tensor norm constraints and self-distillation. Background Technology
[0002] Sentence representation learning is one of the core foundational tasks in natural language processing, aiming to map variable-length text to a fixed-dimensional semantic space, providing a unified feature representation for downstream applications such as semantic search, question answering systems, text clustering, and sentiment analysis. In recent years, pre-trained language models, represented by BERT and RoBERTa, have achieved breakthroughs in various sentence-level tasks thanks to their powerful context modeling capabilities. However, directly using sentence embeddings (such as [CLS] vectors or average pooling vectors) output by these models often exhibits anisotropic distribution characteristics, resulting in poor semantic similarity measurement performance.
[0003] To alleviate the anisotropy problem, contrastive learning has been introduced into the field of unsupervised sentence representation learning and has quickly become a mainstream paradigm. This type of method constructs positive and negative sample pairs and optimizes the cosine similarity of the embedding vectors, causing semantically similar sentences to be close together in space and dissimilar sentences to be far apart. Representative works include SimCSE, DiffCSE, and RankCSE, which have achieved excellent performance on multiple semantic text similarity benchmarks, greatly promoting the development of unsupervised sentence representation learning.
[0004] Existing contrastive learning methods generally focus only on the direction (i.e., angle) of the embedded vectors, neglecting the semantic information that the vector norm (i.e., length) may carry. From the perspective of information geometry, the semantic space should consider both direction and norm, but traditional contrastive learning only constrains the former, discarding the fine-grained information such as semantic richness and sentiment intensity implied by the norm. For example, a long sentence with dense information and detailed expression and a short sentence with vague semantics and concise structure may have similar thematic directions, but their norms should have significant differences; optimizing only the angle while ignoring norm alignment will limit the model's ability to discriminate semantic details.
[0005] To alleviate the aforementioned problems, existing research has attempted to introduce tensor norm constraints, simultaneously optimizing direction and norm in a dual-encoder structure. By designing tensor norm loss, the norms of positive sample pairs are forced to converge, achieving better results than traditional contrastive learning on semantic similarity tasks. However, this method only focuses on the dual encoder itself, without extending it to sentence pair interaction modeling or deeply integrating it with self-distillation frameworks. On the other hand, self-distillation methods can combine the advantages of dual encoders and cross-encoders through iterative mutual distillation: the dual encoder provides efficient pseudo-labels, while the cross-encoder uses a self-attention mechanism to capture fine-grained inter-sentence interactions, feeding back into the dual encoder after training, forming a performance closed loop. However, existing self-distillation still uses traditional contrastive loss when initializing the dual encoder, lacking norm supervision, which limits the quality of initial pseudo-labels and thus affects subsequent distillation results. Therefore, how to simultaneously optimize the direction and norm of sentence representations in unsupervised scenarios and effectively improve the collaborative performance of dual encoders and cross-encoders through self-distillation mechanisms is a pressing technical problem that needs to be solved. Summary of the Invention
[0006] In view of this, the present invention aims to overcome the shortcomings of the above-mentioned problems in the prior art and proposes an unsupervised sentence representation learning method based on tensor norm constraints and self-distillation, which can simultaneously optimize the direction and norm of sentence embedding, and improve the collaborative performance of dual encoders and cross encoders through iterative mutual distillation.
[0007] To achieve the above objectives, the technical solution of the present invention is implemented as follows:
[0008] An unsupervised sentence representation learning method based on tensor norm constraints and self-distillation includes the following steps:
[0009] Step 1: Obtain an unsupervised sentence pair dataset, which contains multiple sentence pairs, each consisting of a first sentence and a second sentence;
[0010] Step 2: Construct a dual encoder, which includes a first encoder and a second encoder. The first encoder and the second encoder are two encoders with the same structure but trained independently with different parameters.
[0011] Step 3: Input each sentence in the sentence pair into the dual encoder, generate positive sample pairs using random deactivation, calculate the contrast loss between positive sample pairs within each encoder and the interaction constraint loss between the two encoders, and introduce tensor norm constraint loss to jointly optimize the dual encoders to obtain an initial sentence representation with both orientation and norm alignment.
[0012] Step 4: Use the dual encoder trained in Step 3 to generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset, which will serve as the training target for the cross encoder.
[0013] Step 5: Construct a cross encoder. The cross encoder concatenates the two sentences in a sentence pair and inputs them into a pre-trained language model. It captures fine-grained interactions between sentences through a self-attention mechanism and uses the pseudo-similarity labels generated in Step 4 as supervision signals for training to obtain an optimized cross encoder.
[0014] Step 6: Use the cross-encoder trained in Step 5 to generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset, as the training target for the dual encoder.
[0015] Step 7: Use the pseudo-similarity labels generated in Step 6 to retrain the dual encoder and update the parameters of the dual encoder;
[0016] Step 8: Repeat steps 4 to 7 multiple times until the performance of the dual encoder and the cross encoder converges, and obtain the final dual encoder and cross encoder.
[0017] Furthermore, in step 2, the first encoder and the second encoder employ different initialization strategies. The first encoder uses SimCSE for pre-training initialization, while the second encoder uses back-translation enhancement for pre-training initialization. This allows the two encoders to generate diverse initial embedding representations during the early stages of training. SimCSE fine-tunes the pre-trained language model through contrastive learning, enabling the model to produce more discriminative sentence embeddings. Back-translation enhancement generates semantically similar but expressively diverse sentence pairs by translating sentences into an intermediate language and then back into the original language, which are used for pre-training the encoder. These two different initialization strategies allow the dual encoders to capture semantic information from different perspectives, providing diverse initial states for subsequent norm constraint learning.
[0018] Furthermore, in step 3, the contrast loss between positive sample pairs within each encoder is achieved using the InfoNCE loss function, the expression of which is:
[0019] ;
[0020] in, and These are the embedding representations of positive sample pairs generated by the same encoder for the same sentence using different random deactivation masks. All samples in the current batch other than positive samples are considered negative samples, and sim is the cosine similarity function. This is the temperature coefficient, with a value between 0.05 and 0.1.
[0021] Furthermore, in step 3, the interaction constraint loss between the two encoders adopts the interaction constraint InfoNCE loss function, the expression of which is:
[0022] ;
[0023] in, The first encoder embeds the sentence into the last hidden layer as its input. The loss is used to embed the output of the sentence input to the second encoder in the last hidden layer. This loss compares the last hidden state of encoder I with the corresponding state of encoder II as positive sample pairs, forcing the two encoders to align with each other at the representation level, so that they produce consistent embedding representations for the same input, thereby enhancing the stability of the dual encoder output.
[0024] Furthermore, the calculation method for the tensor norm constraint loss in step 3 is as follows:
[0025] ;
[0026] ;
[0027] in and , and The output embeddings of the first sentence from the first encoder and the second encoder are respectively. and The output embeddings of the second sentence from the first encoder and the second encoder are respectively. and For the corresponding pooling layer output, this tensor norm constraint loss simultaneously optimizes the directional consistency and norm consistency of the embeddings of the two encoder outputs.
[0028] The first term of the loss function Used to constrain the direction, forcing the embedding vectors output by the two encoders to converge in direction; the second term The norm is used to constrain the vector lengths of positive sample pairs to be equal. By jointly optimizing the direction and norm, sentence embeddings can more completely preserve semantic information.
[0029] Furthermore, in step 8, steps 4 to 7 are iterated three times. In each iteration, the dual encoder and the cross encoder are trained alternately, and the same unsupervised sentence pair dataset is used in each iteration. In the first iteration, pseudo-labels are generated by the dual encoder trained in step 3 to train the cross encoder. In the second iteration, pseudo-labels are generated by the cross encoder trained in the first iteration to retrain the dual encoder. In the third iteration, pseudo-labels are generated again by the updated dual encoder to further optimize the cross encoder. When the number of iterations exceeds three, the performance of the dual encoder and the cross encoder tends to saturate or decline, and the iteration stops.
[0030] Furthermore, in step 5, the cross-encoder uses BERT or RoBERTa as the base pre-trained language model. It concatenates the first and second sentences of a sentence pair into an input sequence using a delimiter, and then performs joint encoding on the concatenated sequence through a self-attention mechanism, outputting a scalar score representing the semantic similarity between the two sentences. Preferably, the input sequence format is "[CLS]first sentence[SEP]second sentence[SEP]", and the output vector at the [CLS] position is mapped to a similarity score through a linear transformation layer.
[0031] Furthermore, in step 5, the training of the cross-encoder uses the binary cross-entropy loss function, the expression of which is:
[0032] ;
[0033] in, For batch size, It is the sigmoid activation function. For the cross encoder to the first The predicted similarity scores for each sentence pair. These are the pseudo-similarity labels generated by the dual encoders in step 4. Binary cross-entropy loss is used instead of mean squared error loss because it is more tolerant of numerical differences, effectively preventing the cross encoder from overfitting to pseudo-labels and maintaining an appropriate difference between the student and teacher models, thus ensuring the continuous iterative learning process.
[0034] Furthermore, in step 7, the training of the dual encoders uses the mean squared error loss function, the expression of which is:
[0035] ;
[0036] in, For batch size, For the cross encoder to the first The predicted similarity scores for each sentence pair. These are the pseudo-similarity labels generated by the dual encoders in step 6.
[0037] For distillation with dual encoders, since the dual encoders encode two sentences separately and then compare them, the model structure itself is not prone to overfitting to pseudo-labels. Therefore, using mean squared error loss can achieve more accurate knowledge transfer.
[0038] Furthermore, the unsupervised sentence pair dataset in step 1 includes at least: the SemEval STS 2012-2016 dataset, the STS-Benchmark dataset, the SICK-Relatedness dataset, and sentence pair data constructed by sampling from Wikipedia or a general corpus; wherein the amount of data sampled from Wikipedia or a general corpus is no less than 100,000 entries, used to enhance the model's generalization ability in different domains and semantic scenarios. Preferably, sentences are randomly extracted from Wikipedia, using consecutive sentence pairs as positive samples and sentences from different documents as negative samples to construct a large-scale unsupervised sentence pair dataset, providing sufficient training samples for norm-constrained learning.
[0039] Compared with existing technologies, the unsupervised sentence representation learning method based on tensor norm constraints and self-distillation described in this invention has the following advantages:
[0040] 1. This invention introduces tensor norm constraints into unsupervised sentence representation learning, expanding the optimization objective from focusing solely on vector direction to simultaneously constraining both direction and norm. Traditional contrastive learning methods only constrain the angle of the embedded vectors through cosine similarity, neglecting the fine-grained information such as semantic richness and sentiment intensity implied by vector length. This invention, by designing a tensor norm loss function, forces the embedded vectors of positive sample pairs to converge in both direction and length, enabling sentence embeddings to more completely preserve semantic information and significantly improving the discriminative ability of the representation.
[0041] 2. This invention deeply integrates norm constraints with a self-distillation framework, achieving synergistic enhancement between the dual encoder and the cross encoder through iterative inter-distillation. After training with norm constraints, the dual encoder generates higher-quality pseudo-similarity labels, providing more accurate supervision signals for the cross encoder. The cross encoder utilizes a self-attention mechanism to capture fine-grained interactions between sentences, feeding back into the dual encoder after training and further improving its representation quality. Through multiple alternating iterations, the performance of both encoders improves simultaneously, ultimately resulting in an efficient and high-precision sentence representation model.
[0042] 3. This invention has been experimentally validated on multiple standard semantic text similarity benchmarks. On seven datasets—SemEval STS2012-2016, STS-Benchmark, and SICK-Relatedness—the average Spearman correlation coefficient of this invention reaches 79.87%, outperforming most existing unsupervised methods. Ablation experiments confirm that after introducing norm constraints, the cross-encoder's performance continuously improves during iterative training, while the baseline model using only orientation constraints shows a performance decline after three iterations. This demonstrates that the method of this invention effectively suppresses overfitting of the cross-encoder to pseudo-labels through norm constraints, improving the stability and effectiveness of the distillation process. Attached Figure Description
[0043] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an undue limitation of the invention. In the drawings:
[0044] Figure 1 This is a schematic diagram of the overall framework of the method of the present invention;
[0045] Figure 2 This is a schematic diagram of the norm constraint structure inside the dual encoder in this invention;
[0046] Figure 3 This is a schematic diagram of the iterative self-distillation process of the dual encoder and cross encoder in this invention;
[0047] Figure 4 This is a graph showing the performance variation of the present invention under different training data volumes;
[0048] Figure 5 This is a comparison chart of the ablation performance of the norm-constrained technique in this invention;
[0049] Figure 6 This is a graph showing the impact of norm constraints on the performance of the cross encoder on the STS-B and SICK-R validation sets in this invention.
[0050] Figure 7 This is a performance comparison chart of the dual encoder in this invention with the baseline model at different number of cycles. Detailed Implementation
[0051] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other.
[0052] The present invention will now be described in detail with reference to the accompanying drawings and embodiments.
[0053] Example 1:
[0054] This embodiment provides an unsupervised sentence representation learning method based on tensor norm constraints and self-distillation, which specifically includes the following steps.
[0055] Step 1: Obtain the unsupervised sentence pair dataset. This embodiment obtains all sentence pairs from the SemEval STS 2012-2016 dataset, the STS-Benchmark dataset, and the SICK-Relatedness dataset. After removing manually labeled similarity tags, approximately 37,000 unlabeled sentence pairs are obtained. Simultaneously, 1 million sentences are randomly sampled from English Wikipedia. Additional unsupervised sentence pair data is constructed through continuous sentence pair sampling to enhance the model's generalization ability across different semantic scenarios.
[0056] Step 2: Construct a dual encoder. The overall framework is as follows: Figure 1 The dual encoder consists of a first encoder and a second encoder, both using BERT-based models as their foundation. They have identical structures but are trained independently. The first encoder is pre-trained and initialized using the SimCSE approach, which generates positive sample pairs using dropout within an unsupervised contrastive learning framework and fine-tunes them with InfoNCE loss, giving the model an initial ability to discriminate directions. The second encoder is pre-trained and initialized using back-translation enhancement. Specifically, the original sentence is translated into Chinese, French, and Spanish, and then back into English, generating semantically similar but expressively diverse sentence variants for pre-training the encoder. These two different initialization strategies allow the two encoders to generate diverse initial embedding representations in the early stages of training, providing differentiated initial states for subsequent norm constraint learning.
[0057] Step 3: Jointly optimize the dual encoders. Input each sentence from the sentence pair into the dual encoders separately, and generate positive sample pairs using random deactivation. The contrastive loss between positive sample pairs within each encoder uses the InfoNCE loss function, whose expression is:
[0058]
[0059] in, and These are the embedding representations of positive sample pairs generated by the same encoder for the same sentence using different random deactivation masks. All samples in the current batch other than positive samples are considered negative samples, and sim is the cosine similarity function. This is the temperature coefficient, with a value of 0.05.
[0060] The interaction constraint loss between the two encoders adopts the interaction constraint InfoNCE loss function, the expression of which is:
[0061]
[0062] in, The first encoder embeds the sentence into the last hidden layer as its input. The loss is used to embed the output of the sentence into the last hidden layer of the second encoder, and this loss forces the two encoders to align with each other at the representation level.
[0063] The method for calculating the tensor norm constraint loss is as follows:
[0064]
[0065]
[0066] in and , and The output embeddings of the first sentence from the first encoder and the second encoder are respectively. and The output embeddings of the second sentence from the first encoder and the second encoder are respectively. and For the corresponding pooling layer output, this tensor norm-constrained loss simultaneously optimizes the directional consistency and norm consistency of the embeddings of the two encoder outputs. The norm constraint structure inside the dual encoder is as follows: Figure 2 In this embodiment, the total loss function is the sum of the three losses mentioned above. A joint optimization is performed using the AdamW optimizer with a batch size of 32 and a learning rate of 2e-5 to obtain an initial sentence representation with both direction and norm alignment.
[0067] Step 4: Use the dual encoder trained in Step 3 to generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset. For any sentence pair ( The two encoders are input separately to obtain the output embeddings of the two encoders. The cosine similarity is calculated after summing the two, and it is used as the pseudo-similarity score of the sentence pair.
[0068] Step 5: Construct the cross-encoder. The cross-encoder uses SimCSE-BERT-base as the base pre-trained language model. It concatenates the first and second sentences of a sentence pair using a delimiter to form the input sequence, formatted as "[CLS]first sentence[SEP]second sentence[SEP]". A self-attention mechanism is used to jointly encode the concatenated sequence, and the output vector at the [CLS] position is mapped to a similarity score through a linear transformation layer. The pseudo-similarity labels generated in Step 4 are used as the supervision signal for training, employing the binary cross-entropy loss function.
[0069] ;
[0070] in, For batch size, It is the sigmoid activation function. For the cross encoder to the first The predicted similarity scores for each sentence pair. The pseudo-similarity labels are generated by the dual encoders in step 4. In this embodiment, the cross encoder is trained for 1 epoch with a learning rate of 5e-5.
[0071] Step 6: Using the cross-encoder trained in Step 5, generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset. Concatenate the sentence pairs and input them into the cross-encoder. Take the output at the [CLS] position and perform a linear transformation to obtain the similarity score, which is used as the new pseudo-label. The cyclic distillation logic of the dual encoder and cross-encoder is as follows: Figure 3 As shown.
[0072] Step 7: Use the pseudo-similarity labels generated in Step 6 to retrain the dual encoder. The training of the dual encoder uses the mean squared error loss function:
[0073]
[0074] in, For batch size, For the cross encoder to the first The predicted similarity scores for each sentence pair. The pseudo-similarity labels generated by the dual encoders in step 6 are, in this embodiment, trained by the dual encoders for 10 epochs, with the learning rate restored to 2e-5.
[0075] Step 8: Repeat steps 4 through 7 a total of 3 times. In the first iteration, pseudo-labels are generated from the dual encoder trained in step 3 to train the cross encoder. In the second iteration, pseudo-labels are generated from the cross encoder trained in the first iteration to retrain the dual encoder. In the third iteration, pseudo-labels are again generated from the updated dual encoder to further optimize the cross encoder. After three iterations, the performance of the dual encoder and the cross encoder tends to be optimal, and the iteration stops, yielding the final dual encoder and cross encoder.
[0076] To verify the effectiveness of the proposed method in this embodiment, experiments were conducted on seven standard semantic text similarity datasets: SemEval STS 2012-2016 (STS12-16), STS-Benchmark (STS-B), and SICK-Relatedness (SICK-R). These datasets cover various text types, including news text, online forums, and image descriptions, with similarity labels continuously distributed from 0 to 5, enabling a comprehensive evaluation of the method's performance in different semantic scenarios. The Spearman rank correlation coefficient was used as the evaluation metric, which measures the consistency between the similarity ranking predicted by the model and the manually labeled ranking, with a value ranging from -1 to 1; a higher value indicates stronger semantic understanding. In this embodiment, the trained dual encoder is denoted as Ours(bi), and the cross encoder is denoted as Ours(cross), and compared with existing mainstream unsupervised methods. The comparison methods include SimCSE, ESimCSE, DiffCSE, Trans-Encoder, SNCSE, RankEncoder, and TNCSE, as shown in Table 1.
[0077] Table 1
[0078]
[0079] As shown in Table 1, the dual encoder Ours(bi) trained in this embodiment achieves an average Spearman correlation coefficient of 79.52% across seven datasets, and the cross encoder Ours(cross) achieves 79.87%, both outperforming existing comparative methods. The most significant improvements are observed on the STS15 and STS-B tasks, reaching 85.11% and 82.61% respectively, representing improvements of 0.14 and 0.89 percentage points compared to TNCSE. Experimental results demonstrate that optimizing the orientation and norm alignment of the dual encoder through tensor norm constraints can generate higher-quality pseudo-labels, thereby enhancing the fine-grained semantic modeling capability of the cross encoder.
[0080] Example 2:
[0081] This embodiment, building upon Embodiment 1, explores the impact of training data size on model performance. The experiments used sentence pairs from the SemEval STS 2012-2016, STS-Benchmark, and SICK-Relatedness datasets as in-domain training data, and sampled sentence pairs from English Wikipedia as additional training data. All experiments followed the steps in Embodiment 1, performing three rounds of cyclic distillation training, and evaluating the Spearman correlation coefficient of the dual encoder on the STS-Benchmark development set. To examine the impact of data size, this embodiment included several comparative experiments with different data volumes. The first group used only 15,000 sentence pairs randomly selected from the STS dataset as training data. The second group used all STS in-domain data, totaling approximately 37,000 sentence pairs. The third group, in addition to the all STS data, added 10,000 sentence pairs sampled from Wiki1M. The fourth group, based on all STS data, added 30,000 Wiki1M samples, and this process was repeated until 200,000 Wiki1M samples were added, for a total of eight groups of experiments. The experimental results are as follows: Figure 4 As shown, the model performance increases with the amount of training data, but the performance improvement tends to saturate or even slightly decreases after the data volume exceeds 100k. The reasons for this are as follows: while out-of-domain data can provide rich language expression patterns, its distribution differs from that of downstream tasks. Too many irrelevant samples may introduce noise, interfering with the model's focus on the target task features. Furthermore, the Wiki1M dataset contains a large number of truncated sentences, meaningless symbols, and repetitive content, making its quality lower than the STS dataset. Therefore, simply increasing the amount of data has limited effect on model improvement. In this embodiment, some training data is generated through a back-translation augmentation strategy, that is, using multilingual round-trip translation to construct semantically equivalent but expressively diverse sentence variants, further enriching the diversity of the training data.
[0082] Example 3:
[0083] This embodiment, based on Embodiment 1, sets up an ablation experiment to compare the performance of the baseline model without norm constraints and the complete model with norm constraints on the overall model and the cross-encoder. The experiments were conducted on STS-all, STS-B, and SICK-R, and the changes in the Spearman correlation coefficients of the overall model and the cross-encoder were recorded during three rounds of cyclic distillation. Figure 5 As shown, on STS-all, the dual encoder and cross encoder of the model with norm constraints have higher Spearman correlation coefficients than the baseline model in three rounds of cycling. Figure 6As shown, on the STS-B dataset, the Spearman correlation coefficients of the baseline model for the three rounds of distillation were 80.62%, 81.11%, and 81.14%, respectively, with performance saturating after the first round and showing almost no improvement in the third round. In contrast, the model with norm constraints achieved Spearman correlation coefficients of 81.72%, 82.79%, and 82.61% for the three rounds, peaking in the second round and then slightly declining, but still significantly outperforming the baseline model overall, with a maximum improvement of 1.47 percentage points. On the SICK-R dataset, the baseline model achieved Spearman correlation coefficients of 70.97%, 71.88%, and 71.80% for the three rounds, with slow performance improvement. The model with norm constraints achieved Spearman correlation coefficients of 71.28%, 72.28%, and 73.05% for the three rounds, peaking in the third round and improving by 1.25 percentage points compared to the baseline model. Experimental results indicate that the introduction of tensor norm constraints enables the cross-encoder to continuously gain performance gains in multiple rounds of distillation, while the performance of the baseline model without norm constraints stagnates or even declines after the second round. This is because the norm constraint optimizes both the directional consistency and norm consistency of the sentence representation, enabling the pseudo-labels generated by the dual encoder to carry richer semantic strength information. This provides a more accurate supervision signal for the cross encoder, thereby effectively suppressing the tendency of the cross encoder to overfit pseudo-labels in multiple rounds of distillation and improving the stability of the distillation process and the model's generalization ability.
[0084] Example 4:
[0085] This embodiment, based on Embodiment 1, explores the impact of the number of cyclic distillation rounds on model performance. All hyperparameters were kept identical to those in Embodiment 1, except for the number of iterations in step 8. Training was conducted with 1, 2, 3, 4, and 5 cyclic rounds respectively, and the Spearman correlation coefficient of the dual encoder was evaluated on the STS-Benchmark development set. Figure 7 As shown, the performance of the dual encoder steadily increases from 1 to 3 rounds, peaking in the 3rd round. However, performance declines when the number of rounds increases to 4 and 5. This is because three rounds of cyclic distillation fully extract semantic information from the data, enabling co-evolution of the dual encoder and the cross-encoder. Beyond three rounds, the cumulative error of pseudo-labels in multiple iterations amplifies the small biases introduced in the initial stage, leading to overfitting and impaired generalization ability. This embodiment demonstrates that setting the number of cyclic distillation rounds to 3 achieves optimal performance in the method described in this invention.
[0086] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. An unsupervised sentence representation learning method based on tensor norm constraints and self-distillation, characterized in that: Includes the following steps: Step 1: Obtain an unsupervised sentence pair dataset, which contains multiple sentence pairs, each consisting of a first sentence and a second sentence; Step 2: Construct a dual encoder, which includes a first encoder and a second encoder. The first encoder and the second encoder are two encoders with the same structure but trained independently with different parameters. Step 3: Input each sentence in the sentence pair into the dual encoder, generate positive sample pairs using random deactivation, calculate the contrast loss between positive sample pairs within each encoder and the interaction constraint loss between the two encoders, and introduce tensor norm constraint loss to jointly optimize the dual encoders to obtain an initial sentence representation with both orientation and norm alignment. Step 4: Use the dual encoder trained in Step 3 to generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset, which will serve as the training target for the cross encoder. Step 5: Construct a cross encoder. The cross encoder concatenates the two sentences in a sentence pair and inputs them into a pre-trained language model. It captures fine-grained interactions between sentences through a self-attention mechanism and uses the pseudo-similarity labels generated in Step 4 as supervision signals for training to obtain an optimized cross encoder. Step 6: Use the cross-encoder trained in Step 5 to generate pseudo-similarity labels for each sentence pair in the unsupervised sentence pair dataset, as the training target for the dual encoder. Step 7: Use the pseudo-similarity labels generated in Step 6 to retrain the dual encoder and update the parameters of the dual encoder; Step 8: Repeat steps 4 to 7 multiple times until the performance of the dual encoder and the cross encoder converges, and obtain the final dual encoder and cross encoder.
2. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation as described in claim 1, characterized in that: In step 2, the first encoder and the second encoder adopt different initialization strategies. The first encoder uses SimCSE for pre-training initialization, while the second encoder uses back-translation enhancement for pre-training initialization, so that the two encoders can generate diverse initial embedding representations in the early stage of training.
3. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation as described in claim 1, characterized in that: In step 3, the contrast loss between positive sample pairs within each encoder is calculated using the InfoNCE loss function, which is expressed as follows: ; in, and These are the embedding representations of positive sample pairs generated by the same encoder for the same sentence using different random deactivation masks. All samples in the current batch other than positive samples are considered negative samples, and sim is the cosine similarity function. This is the temperature coefficient, with a value between 0.05 and 0.
1.
4. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation as described in claim 3, characterized in that: In step 3, the interaction constraint loss between the two encoders adopts the interaction constraint InfoNCE loss function, the expression of which is: ; in, The first encoder embeds the sentence into the last hidden layer as its input. The second encoder embeds the output of the sentence input into the last hidden layer; This loss compares the last hidden state of encoder I with the corresponding state of encoder II as positive sample pairs, forcing the two encoders to align with each other at the representation level.
5. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 4, characterized in that: In step 3, the tensor norm constraint loss is calculated as follows: ; ; in and , and The output embeddings of the first sentence from the first encoder and the second encoder are respectively. and The output embeddings of the second sentence from the first encoder and the second encoder are respectively. and For the corresponding pooling layer output, this tensor norm constraint loss simultaneously optimizes the directional consistency and norm consistency of the embeddings of the two encoder outputs.
6. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 1, characterized in that: In step 8, the iterations of steps 4 to 7 are 3 times. In each iteration, the dual encoder and the cross encoder are trained alternately, and the same unsupervised sentence pair dataset is used in each iteration. When the number of iterations exceeds 3, the performance of the dual encoder and the cross encoder tends to saturate or decline, and the iteration stops.
7. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 1, characterized in that: In step 5, the cross encoder uses BERT or RoBERTa as the basic pre-trained language model, concatenates the first and second sentences of the sentence pair into an input sequence using a delimiter, and performs joint encoding on the concatenated sequence through a self-attention mechanism to output a scalar score representing the semantic similarity between the two sentences.
8. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 1, characterized in that: In step 5, the training of the cross-encoder uses the binary cross-entropy loss function, the expression of which is: ; in, For batch size, It is the sigmoid activation function. For the cross encoder to the first The predicted similarity scores for each sentence pair. These are the pseudo-similarity labels generated by the dual encoders in step 4.
9. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 1, characterized in that: In step 7, the training of the dual encoders uses the mean squared error loss function, the expression of which is: ; in, For batch size, For the cross encoder to the first The predicted similarity scores for each sentence pair. These are the pseudo-similarity labels generated by the dual encoders in step 6.
10. The unsupervised sentence representation learning method based on tensor norm constraints and self-distillation according to claim 1, characterized in that: In step 1, the unsupervised sentence pair dataset includes at least: the SemEval STS 2012-2016 dataset, the STS-Benchmark dataset, the SICK-Relatedness dataset, and sentence pair data constructed by sampling from Wikipedia or a general corpus; wherein the amount of data sampled from Wikipedia or a general corpus is no less than 100,000, which is used to enhance the model's generalization ability in different domains and semantic scenarios.