A semantic similarity calculation method based on a twin micro-BERT model
By employing weight sharing and hybrid pooling strategies in the twin mini-BERT model, the high resource consumption of the BERT model is addressed, achieving efficient semantic similarity calculation and making it suitable for unsupervised learning tasks in the field of natural language processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HOHAI UNIV
- Filing Date
- 2024-12-12
- Publication Date
- 2026-06-12
AI Technical Summary
Existing BERT models consume high resources and require additional post-processing steps in text similarity calculation tasks, making it difficult to efficiently extract sentence-level semantic similarity.
We employ a twin mini-BERT model, compress the model using weight sharing, and design a twin network structure by combining a hybrid pooling strategy and feature vector concatenation. We then use deep semantic information to calculate semantic similarity.
It significantly reduces computational resource consumption while improving the efficiency and accuracy of semantic similarity calculation, making it suitable for unsupervised learning tasks.
Smart Images

Figure CN122196718A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a semantic similarity calculation method based on a twin mini-BERT model, belonging to the field of natural language processing. Background Technology
[0002] Semantic similarity calculation is one of the core problems in the field of natural language processing (NLP) and a key research direction in text analysis and understanding. In early NLP techniques, text similarity assessment typically relied on simple affix matching and word frequency statistics, lacking the ability to deeply understand semantics. This resulted in limited accuracy in similarity assessment, making it difficult to meet the needs of complex application scenarios. In recent years, with the development of deep learning technology, especially the widespread application of artificial neural networks, computers' ability to understand text semantics has significantly improved, demonstrating broad application potential in multiple fields such as intelligent question answering, semantic retrieval, and text generation. Methods for semantic similarity calculation are also continuously advancing and improving.
[0003] The development of semantic similarity algorithms has gone through several stages. Early semantic similarity algorithms mainly relied on word-based similarity calculations, such as edit distance and Jaccard similarity coefficient. However, these methods ignored semantic information and relied solely on string or set operations, making it difficult to understand the context. Subsequently, statistical models gradually emerged, with algorithms such as TF-IDF and cosine similarity using word frequency information to measure similarity. While this solved the semantic understanding problem to some extent, it lacked modeling of deep semantic relationships between words. With the advent of neural networks, word embedding techniques such as Word2Vec and GloVe represented words through vectorization, extending semantic similarity calculation from the word level to the phrase and sentence level. In recent years, deep learning-based pre-trained models have shown superior performance in natural language processing, significantly accelerating the development of related technologies. Large-scale models, represented by BERT, are particularly adept at encoding complex semantic relationships between sentences. Their bidirectional Transformer architecture can simultaneously consider the contextual relationships between words. This design performs exceptionally well in tasks involving semantic information. By performing unsupervised pre-training on massive corpora, the BERT model learns rich contextual relationships and grammatical structures, effectively capturing deep semantic relationships between sentences. However, BERT's outstanding performance comes at the cost of extremely high resource consumption. During the pre-training phase, BERT requires extensive training on large amounts of unlabeled text data, often taking days or even weeks to complete, placing extremely high demands on computational resources. Furthermore, the traditional BERT model is not designed for sentence-level similarity tasks, requiring additional post-processing steps to extract sentence embeddings.
[0004] To address this issue, a semantic similarity calculation method based on a Siamese mini-BERT model is proposed by optimizing the bidirectional encoder and constructing a Siamese network architecture. This method can extract semantic information to the maximum extent while significantly reducing computational resource consumption and supporting unsupervised learning tasks. Summary of the Invention
[0005] This invention addresses the limitations of the BERT model in text similarity calculation tasks by proposing a semantic similarity calculation method based on a Siamese mini-BERT model. This method compresses the BERT model through weight sharing, constructs a Siamese network structure based on it, and designs a hybrid pooling strategy and a combination of specific feature vectors for concatenation. This approach fully utilizes the deep semantic information of the text while significantly improving the efficiency of semantic similarity calculation.
[0006] Based on the provided documents, the following are the technical implementation steps of this invention patent:
[0007] (1) Token encoding of text data: The input sentence is processed by word segmentation algorithm, and a token is added to indicate the start and end of the classification task. The sentence is also padded to a fixed length so that the model can handle inputs of uniform length.
[0008] (2) Absolute position encoding: For the token sequence after word segmentation, the model will add absolute position encoding, and calculate the encoding value of each position to help the model capture the position information of the token in the sentence.
[0009] (3) Mini BERT network extracts semantic features of text: The mini BERT model uses a multi-layer Transformer network to extract semantic features of text. It generates queries, keys and values through an attention mechanism, and computes attention in parallel with multiple heads. Finally, it processes the output through residual connections and layer normalization to further refine semantic information.
[0010] (4) Hybrid pooling dimensionality reduction: The extracted semantic features are processed by hybrid pooling to reduce dimensionality and enrich feature representation. Hybrid pooling is performed on each feature dimension to generate a hybrid feature vector with higher semantic generalization ability.
[0011] (5) Cosine distance calculation of similarity (inference stage): In the inference stage, the model uses cosine similarity to calculate the similarity between different texts. This process measures the semantic similarity of texts by comparing the angular relationship between feature vectors.
[0012] (6) Feature vector concatenation (non-inference stage): During the training stage, multiple feature vectors are concatenated together to construct a synthetic feature vector for classification tasks, thereby improving the model's discriminative ability.
[0013] (7) Classification task training: The concatenated feature vector will be input into the fully connected layer to perform a binary classification task. The class probability will be output through the Softmax activation function, and the cross-entropy loss function will be used to optimize the model parameters and improve the classification performance.
[0014] The specific process of word segmentation and token encoding involved in step (1) is as follows:
[0015] 1.1 The input sentence is segmented using the WordPiece algorithm.
[0016] 1.2 Add a [CLS] token at the beginning of the sentence to represent the aggregate classification task.
[0017] 1.3 Add a [SEP] token to the end of the sentence to separate sentences and indicate the end.
[0018] 1.4 Use [PAD] tokens to fill sentences, making the length uniform to 512 tokens.
[0019] The results of step (1) are used to perform position encoding, and the encoding value of each position and dimension is calculated using absolute position encoding.
[0020] The final encoded data output from step (2) is input into the mini BERT network to extract text semantic features, as follows:
[0021] 3.1 Generate queries, keys, and values, and calculate attention scores.
[0022] 3.2 This is expanded to multiple attention heads, each of which calculates independently and the results are then concatenated.
[0023] 3.3 Multi-head attention output is processed through residual connections and layer normalization.
[0024] 3.4 The material passes through a feedforward neural network (FFN) and undergoes residual connections and layer normalization again.
[0025] 3.5 All parameters are shared across layers, and training is performed 12 times in a loop.
[0026] The dimension compression is achieved by performing a hybrid pooling operation on the output feature vector in step (3), as follows:
[0027] 4.1 Perform average pooling and max pooling on each feature dimension.
[0028] 4.2 The results of average pooling and max pooling are concatenated to obtain the hybrid pooling feature vector.
[0029] For the compressed feature vector output in step (4), calculate the cosine distance as a similarity index (the reasoning stage directly jumps to this step).
[0030] The compressed feature vector output in step (4) is concatenated to construct a training feature vector for classification training. (Jump to this step if not in the inference stage):
[0031] During the training phase, the concatenated feature vector result from step (6) is input into the fully connected layer for classification training, as follows:
[0032] 7.1 Input the concatenated feature vector into a fully connected layer for binary classification.
[0033] 7.2 Use the Softmax activation function to output the prediction results.
[0034] 7.3 The cross-entropy loss function is used to optimize the model.
[0035] The beneficial effects of this invention are as follows:
[0036] The semantic similarity calculation method based on the twin mini-BERT model takes into account the deep semantic relationships behind the text information. It can significantly reduce the computational cost and complete the final similarity calculation task while fully extracting the original text semantic information, thus providing strong support for the field of semantic similarity analysis in natural language processing. Attached Figure Description
[0037] Figure 1 This is a flowchart of a semantic similarity calculation method based on a twin mini-BERT model. Detailed Implementation
[0038] The present invention will now be described in further detail with reference to the accompanying drawings.
[0039] like Figure 1 This is a flowchart of a semantic similarity calculation method based on a Siamese miniature BERT model. The implementation steps of the semantic similarity calculation method based on the Siamese miniature BERT model are as follows:
[0040] (1) Encode the text data with a token;
[0041] (2) Perform positional encoding on the text data;
[0042] (3) Mini BERT network extracts semantic features of text;
[0043] (4) Hybrid pooling reduces the dimensionality of feature data;
[0044] (5) If the reasoning stage is in progress, proceed to step (6); otherwise, proceed to step (7).
[0045] (6) Calculate the cosine distance as the final similarity index;
[0046] (7) Concatenate the feature vectors;
[0047] (8) Categorized training.
[0048] Taking the LCQMC dataset as an example, this dataset was released by Harbin Institute of Technology at the 2018 International Conference on Computational Linguistics. The dataset was constructed by extracting user questions from different domains on Baidu Knows. It contains 238,766 training data points, 8,802 validation data points, and 12,500 test data points. Each data point consists of a pair of short sentences and a label, with similarity set to 1 and dissimilarity set to 0. Training and inference were performed using a Tesla P100 GPU with 16GB of VRAM. During training, stochastic gradient descent was used with the AdamW optimizer. The epoch was set to 8, and the initial learning rate was 5e-5, dynamically adjusted according to the epoch, decreasing by 1e-5 every two epochs. The loss function used was the cross-entropy loss function.
[0049] The text data is then fed into step (1) for token encoding. The detailed process is as follows:
[0050] 1.1 Tokenization: The input sentence is broken down into tokens. BERT uses the WordPiece algorithm for tokenization, employing a Chinese vocabulary of length 21128 for Chinese text tasks.
[0051] 1.2 Add special tokens. Add a [CLS] token at the beginning of the sentence for aggregate representation of the classification task; add a [SEP] token at the end of the sentence to separate sentences and indicate the end; use [PAD] tokens to fill sentences, unifying the input to a length of 512 tokens.
[0052] The results of step 1.2 are then subjected to absolute position encoding. The detailed process is as follows:
[0053] In the position encoding formula 2.1, for position pos and dimension i:
[0054]
[0055] in:
[0056] pos represents the position index in the sequence.
[0057] i represents the dimension index in the location encoding.
[0058] d model This is the dimension size of the positional encoding, which is set to 512 in this method.
[0059] The output of step 2.1 is fed into the Siamese miniature BERT model, and the steps are as follows:
[0060] 3.1 For input X, first generate the query, key, and value, denoted as Q, K, and V respectively:
[0061] Q = XW Q K = XW K V = XW V
[0062] Among them W Q W K W V It is a learnable weight matrix. The dimension is set to d. Q =d k =d v =64. Then, calculate the dot product of the query for each word and the keys of other words to obtain the attention score. After normalization using softmax, multiply by the value to obtain the attention output. The attention calculation formula is:
[0063]
[0064] here It is a scaling factor used to avoid excessively large inner product values and maintain gradient stability.
[0065] 3.2 Extend the calculation in step 3.1 to multiple attention heads (e.g., h heads). Each head independently calculates a set of Q, K, V and obtains an attention output. The results from all heads are concatenated and subjected to a linear transformation to obtain the final multi-head attention output:
[0066] MultiHead(Q,K,V)=Concat(head1,...,head h W O
[0067] in:
[0068] head i =Attention(Q) i K i V i )
[0069] W O It is the weight matrix used for linear transformation after being concatenated.
[0070] 3.3 The multi-head attention output from step 3.2 is processed through residual connections and layer normalization:
[0071] Z1=LayerNorm(X+MultiHead(Q,K,V))
[0072] Z1 is the intermediate result after multi-head attention processing.
[0073] 3.4 For the representation vector at each word position in step 3.3, a two-layer feedforward neural network (FFN) is used, where the FFN operates independently at each position:
[0074] FFN(Z1)=ReLU(Z1W1+b1)W2+b2
[0075] in These are the parameters of the FFN layer, d ff This is the dimension size of the hidden layer, set to 2048.
[0076] 3.5 Perform residual joins and layer normalization on the output of FFN from step 3.4 again:
[0077] Z2 = LayerNorm(Z1 + FFN(Z1))
[0078] After processing by the encoder, the output is... This is the output of the encoder layer. This output contains contextual information about the input sequence and can be used as input to the next encoder layer or as the final output for downstream tasks.
[0079] 3.6 Adopting a strategy of sharing all parameters across layers, the current layer is considered equivalent to the logical downstream layer of the network. The output from step 3.5 is then fed into step 3.1 for retraining, and this process is repeated 12 times to obtain Z. out .
[0080] The final output feature Z in step 3.6 out As input to the hybrid pooling layer, the specific steps are as follows:
[0081] 4.1 Hybrid Pooling Strategy Hybrid pooling combines the advantages of average pooling and max pooling to obtain richer feature representations. The specific operation is as follows:
[0082] Average pooling: Takes the average value for each feature dimension.
[0083]
[0084] Max Pooling: Takes the maximum value for each feature dimension.
[0085] max = maxi = 1, 2, ..., nz 2,i
[0086] Mixed pooling combines the results of average pooling and max pooling.
[0087] vector = [avg; max]
[0088] Where [;] represents the vector concatenation operation, and n is the feature dimension.
[0089] 5. Apply the hybrid pooling strategy from step 4.1 to the two outputs of the Siamese network to obtain feature vectors u and v.
[0090] The detailed process of the method in the reasoning stage is as follows:
[0091] 6.1 Cosine similarity is defined as follows, for the vectors u and v output in step 5:
[0092]
[0093] in:
[0094] uv represents the dot product of u and v:
[0095]
[0096] ||u|| and ||v|| denote the Euclidean norm (L2 norm) of u and v:
[0097]
[0098] The detailed process of the training phase is as follows:
[0099] 7.1 To capture the similarity between the two, the following feature vectors are constructed for the vectors u and v output in step 5:
[0100] h = [u; v; |uv|]
[0101] Where |uv| represents the absolute difference between corresponding elements, and [;] represents the vector concatenation operation.
[0102] The feature vector h obtained from the concatenation in step 7.1 is input into the downstream classification network. The specific steps are as follows:
[0103] 8.1 Input the feature vector h into a fully connected layer for a binary classification task. The weight matrix of the fully connected layer is W, and the bias is b. The output is:
[0104] o = σ(Wh + b)
[0105] Where σ is the Softmax activation function.
[0106] The cross-entropy loss function is used to optimize the model. For binary classification tasks, the cross-entropy loss function is defined as:
[0107]
[0108] Where m is the number of samples, in this task scenario m=2, yi is the true label of the i-th sample, and o i This represents the probability predicted by the model.
[0109] The experimental results are as follows:
[0110] Table 1 Comparison of the method of the present invention with other methods
[0111]
[0112] As shown in Table 1, the method of this invention exhibits high accuracy and low parameter count on mainstream evaluation metrics, demonstrating the model's excellent robustness and feature extraction capabilities. It also greatly reduces the hardware overhead of model localization and has broad application potential.
Claims
1. A semantic similarity calculation method based on a twin miniature BERT model, characterized in that, During semantic extraction and feature value calculation, text pairs are preprocessed to generate word embedding matrices, which are then input into a self-designed Siamese mini-BERT model. The mini-BERT model employs weight-sharing technology to reuse the Transformer module, outputting a high-dimensional vector matrix rich in semantic information. A hybrid pooling strategy is introduced to reduce the dimensionality of the backbone network's output and extract features. Combinations of specific feature vectors are concatenated to form the final feature vector for training.
2. A semantic similarity calculation method based on a twin miniature BERT model according to claim 1, characterized in that, A miniature BERT model was designed, which reuses the Transformer module through cross-layer parameter sharing technology, significantly reducing the number of model parameters.
3. A semantic similarity calculation method based on a twin miniature BERT model according to claim 1, characterized in that, A twin network architecture based on parameter sharing was constructed, which solves the problem of low efficiency of traditional BERT models when dealing with large-scale semantic similarity calculation tasks.
4. A semantic similarity calculation method based on a twin miniature BERT model according to claim 1, characterized in that, A hybrid pooling strategy based on mean pooling and max pooling was designed for data dimensionality reduction and feature extraction.
5. A semantic similarity calculation method based on a twin miniature BERT model according to claim 1, characterized in that, By constructing training feature vectors by concatenating the original feature vectors and the difference between the feature vectors, the model's perception of semantic similarity and differences in text is enhanced.