A standard content summarization generation method fusing embedded vectors and semantic supervision
By fusing word vectors and character vectors generated by the BERT model, and combining them with BiLSTM and TextCNN neural networks, keyword pre-classification and text recombination are performed, solving the problem of inaccurate word vector extraction in existing technologies and achieving efficient and accurate generation of standard content summaries.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA NAT INST OF STANDARDIZATION
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies only output text feature vectors based on word vectors, without considering the joint analysis of word vectors and character vectors. This results in inaccurate extraction of text keywords and a lack of semantic relevance judgment, affecting the accuracy and completeness of the summary generation.
Word and character vectors are generated using the BERT model. Combined with BiLSTM and TextCNN neural networks, word and character feature representation vectors are extracted. Keyword pre-classification is performed using attention mechanisms and classifiers. Text is reorganized based on semantic similarity and relevance to form a standard content summary.
It improves the accuracy and efficiency of abstract generation, ensures that the generated abstracts are semantically consistent with the original text, fully cover the core content, and enhances the ability to identify keywords and the accuracy of semantic supervision.
Smart Images

Figure CN122309732A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a standard content summarization method that integrates embedded vectors and semantic supervision. Background Technology
[0002] Standard content summarization is one of the core tasks in the field of natural language processing. It aims to extract or generate concise, complete, and semantically relevant summaries from original text. Standard content summarization has strict requirements for the accuracy, semantic completeness, and keyword coverage of the summaries. It must simultaneously meet the requirements of formal standardization and semantic clarity. With the development of pre-trained language models and deep learning architectures, semantic representation techniques based on embedding vectors and semantic supervision mechanisms have become key directions for improving the quality of summary generation. Existing methods mostly use single word-level or character-level embedding vectors, without effectively concatenating the word vectors and character vectors generated by BERT, and without constructing a semantic supervision link. This results in a lack of continuous semantic constraints in the summary generation process. At the same time, pre-classified keywords are not combined with semantic relevance for secondary verification to ensure that the semantics of the summary are highly consistent with the meaning of the original text.
[0003] Chinese Patent Publication No. CN115481222A discloses a training method and apparatus for a semantic vector extraction model and a semantic vector representation method. The training method for the semantic vector extraction model includes: acquiring dialogue corpora from downstream tasks in the same scenario corresponding to the semantic vectors as first pre-training text; performing secondary training on a pre-trained BERT language network based on the first pre-training text to generate a BERT language network focusing on nouns and verbs in the text; acquiring text classification data and generating a text similarity dataset based on the text classification data; and fine-tuning the multi-task classification network based on the text similarity dataset to generate the semantic vector extraction model.
[0004] Therefore, it is evident that the existing technology has the following problems: Existing technologies only output text feature vectors based on word vectors for semantic vector extraction, without considering that jointly analyzing word vectors and character vectors can make the keywords of the text more accurate, and whether it is necessary to re-divide words and reorganize the text by judging the correlation between different segments in the text. Summary of the Invention
[0005] To address this, the present invention provides a standard content summarization method that integrates embedded vectors and semantic supervision, thereby overcoming the problems of existing technologies that only output text feature vectors based on word vectors for semantic vector extraction, without considering that jointly analyzing word vectors and character vectors can make the keywords of the text more accurate, and by judging whether it is necessary to re-divide words and reorganize the text through the correlation between different segments in the text.
[0006] To achieve the above objectives, this invention provides a standard content summarization method that integrates embedded vectors and semantic supervision, comprising: The text is divided into several words, each word is input into the BERT model to generate word vectors, the word vectors are input into the BiLSTM model to update the hidden layer variables, the hidden layer variables are processed using an attention mechanism, and word feature representation vectors are calculated. The text is divided into several characters, each character is input into the BERT model to generate character vectors, the character vectors are input into the TextCNN neural network to generate several feature map vectors, and the maximum response value of each feature map vector is selected and concatenated to generate character feature representation vectors; The word feature representation vector and the character feature representation vector are concatenated to generate a concatenated feature vector. The concatenated feature vectors corresponding to several time points are selected to generate a concatenated feature vector matrix. The probability distribution of the concatenated feature vector matrix is calculated by a classifier. Based on the probability distribution, keyword pre-classification is performed, several pre-classified keywords are combined to generate combined segments, the combined semantic vector of the combined segments is obtained, and the natural semantic vector of several natural segments is calculated. Calculate the semantic similarity between the combined semantic vector and the corresponding natural semantic vector, and perform text recombination based on the semantic similarity; Calculate the semantic correlation between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text, and determine whether the pre-classified keyword is a keyword based on the semantic correlation. The non-keywords in the text are further segmented into words to repeat the above steps; The hidden layer variables include the outputs of the feedforward neural network and the inverse neural network, and the natural language segment is a segment of the text divided by punctuation marks.
[0007] Furthermore, the process of calculating and generating word feature representation vectors includes, The updated hidden layer variables are concatenated to form the hidden layer state based on the BiLSTM model; The hidden layer state is input into the attention matrix to obtain the word feature representation vector.
[0008] Furthermore, the process of selecting the maximum response value of each feature map vector and concatenating them to generate a character feature representation vector includes, A single feature map is obtained from the convolution operation; Generate a feature map vector by combining several feature maps; The maximum response value of each feature map vector is obtained and concatenated to generate a word feature representation vector.
[0009] Furthermore, the keyword pre-classification based on the probability distribution, wherein, If the probability of a keyword is greater than or equal to the probability threshold, it is determined to be a keyword; If the probability of a keyword is less than the probability threshold, it is determined to be a non-keyword.
[0010] Furthermore, the natural semantic vector of a natural segment that does not have a corresponding combined segment is denoted as the zero vector.
[0011] Furthermore, the text reconstruction based on the semantic similarity, wherein, If the semantic similarity is greater than or equal to the similarity threshold, the combined segment is determined to be able to be used as a recombined segment for text recombination to generate a standard content summary. If the semantic similarity is less than the similarity threshold, the combined segments are deemed unsuitable for text recombination to generate a standard content summary.
[0012] Furthermore, the process of reorganizing text based on the semantic similarity includes, If it is determined that the combined segment cannot be used as a reconstructed segment for text reconstruction to generate a standard content summary, the natural segment corresponding to the combined segment will be used as a replacement segment for text reconstruction to generate a standard content summary. Natural segments without corresponding combined segments are used as supplementary segments for text recombination to generate standard content summaries.
[0013] Furthermore, the process of calculating the semantic correlation between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text includes, Calculate the semantic correlation between each recombined segment and the replaced segment, as well as the semantic correlation between each recombined segment and the supplementary segment.
[0014] Furthermore, the process of determining whether the pre-classified keyword is a keyword based on the semantic relevance includes, If the semantic relevance is less than the relevance threshold, the pre-classified keyword is determined to be a non-keyword; If the semantic relevance is greater than or equal to the relevance threshold, then the pre-classified keyword is determined to be a keyword.
[0015] Furthermore, the non-keywords in the text are split or combined into several words or several single characters to regenerate word vectors or character vectors.
[0016] Compared with existing technologies, the word vectors generated by the BERT model in this invention have context relevance and can accurately distinguish the semantics of professional terms in standard text under different contexts. Combined with the BiLSTM bidirectional recurrent structure, it can simultaneously capture the semantic dependencies of words before and after the context, and fully restore the temporal semantic logic of long sentences in standard text. The attention mechanism weights the hidden layer variables output by BiLSTM, increases the weight of core words in standard text, and makes the generated word feature representation vectors more consistent with the key semantics of standard content summary. TextCNN extracts local features of word vectors through multi-scale convolutional kernels, which can accurately capture key character combinations in standard text. By selecting the maximum response value of each feature map vector and concatenating them, the optimal local features corresponding to each convolutional kernel can be selected, improving the recognizability of word feature representation vectors. Word feature representation vector extraction is not complete enough for the extraction of natural language features in short texts in standard text, while word feature representation vector extraction can make full use of the semantic information of each character. By concatenating them through convolutional kernels, a complete word feature representation vector is formed, which improves the accuracy of the standard content summary generation method that integrates embedded vectors and semantic supervision.
[0017] Furthermore, this invention ensures the integrity of multi-granular features by generating both word and character features based on the BERT model, providing high-quality input for accurate judgment by the subsequent classifier. Selecting concatenated feature vectors corresponding to several time points to construct a matrix is essentially a temporal slice model of the text, which can obtain the semantic dynamic changes in different positions and contexts in the standard text. This allows the feature matrix to completely restore the temporal semantic structure of the text. The classifier can simultaneously concatenate the commonalities and differences of different words in the feature vector matrix, improving the ability to distinguish between keywords and non-keywords, increasing the efficiency of feature processing, and improving the accuracy of the standard content summarization generation method that integrates embedded vectors and semantic supervision.
[0018] Furthermore, this invention performs keyword pre-classification based on the probability distribution output by the classifier. By setting a probability threshold, it can filter out words with high core semantic relevance in the text to generate combined segments, and filter out redundant words with low semantic relevance or irrelevant meaning, thereby improving the efficiency of standard content summary generation. Based on the comparison results of the similarity between the combined segments and the corresponding natural segments and their thresholds, for combined segments that cannot be used as recombined segments, their corresponding natural segments are used as replacement segments for text recombination to improve the semantic integrity of the recombined text. At the same time, for natural segments that do not have corresponding combined segments, they are used as supplementary segments to fill in the recombined text, ensuring that the summary content generated by the recombined text is consistent with the meaning of the original text, thereby improving the accuracy of the standard content summary generation method that integrates embedding vectors and semantic supervision.
[0019] Furthermore, this invention further verifies the correlation between natural segments that cannot be directly used as recombined segments and combined segments that can be directly used as recombined segments by calculating the semantic correlation between each recombined segment and the replacement segment, as well as the semantic correlation between each recombined segment and the supplementary segment. By setting a threshold, segments with a correlation higher than the threshold are identified as keywords that can generate summaries. For non-keywords that do not meet the requirements, vocabulary is re-segmented and multi-granular feature extraction is performed to more meticulously uncover the core information that was missed in the initial screening. This ensures that the generated summary can cover all the core content of the original text. From keyword pre-classification to text recombination, and then to semantic correlation verification and non-keyword iteration, the entire process forms a semantic supervision link, which improves the accuracy of the standard content summary generation method that integrates embedded vectors and semantic supervision. Attached Figure Description
[0020] Figure 1 This is an overall flowchart of the standard content summarization generation method that integrates embedded vectors and semantic supervision according to an embodiment of the present invention; Figure 2 This is a flowchart illustrating the logic of keyword pre-classification based on probability distribution in an embodiment of the present invention. Figure 3 This is a flowchart illustrating the logic of text reorganization based on semantic similarity in an embodiment of the present invention. Figure 4 This is a flowchart illustrating the logic of determining whether a pre-classified keyword is a keyword based on semantic relevance in an embodiment of the present invention. Detailed Implementation
[0021] To make the objectives and advantages of the present invention clearer, the present invention will be further described below with reference to embodiments; it should be understood that the specific embodiments described herein are merely for explaining the present invention and are not intended to limit the present invention.
[0022] Preferred embodiments of the present invention will now be described with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are merely illustrative of the technical principles of the present invention and are not intended to limit the scope of protection of the present invention.
[0023] Please see Figure 1 The diagram shown is an overall flowchart of the standard content summarization method integrating embedding vectors and semantic supervision according to an embodiment of the present invention. The present invention provides a standard content summarization method integrating embedding vectors and semantic supervision, comprising: Step S1: Divide the text into several words, input each word into the BERT model to generate word vectors, input the word vectors into the BiLSTM model to update the hidden layer variables, use the attention mechanism to process the hidden layer variables, and calculate the generated word feature representation vectors. Step S2: Divide the text into several characters, input each character into the BERT model to generate character vectors, input the character vectors into the TextCNN neural network to generate several feature map vectors, and select the maximum response value of each feature map vector to concatenate them to generate character feature representation vectors; Step S3: Concatenate the word feature representation vector and the character feature representation vector to generate a concatenated feature vector. Select the concatenated feature vectors corresponding to several time points to generate a concatenated feature vector matrix. Calculate the probability distribution of the concatenated feature vector matrix using a classifier. Step S4: Based on the probability distribution, perform keyword pre-classification, combine several pre-classified keywords to generate combined segments, obtain the combined semantic vector of the combined segments, and calculate the natural semantic vector of several natural segments. Step S5: Calculate the semantic similarity between the combined semantic vector and the corresponding natural semantic vector, and perform text recombination based on the semantic similarity; Step S6: Calculate the semantic correlation between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text, and determine whether the pre-classified keyword is a keyword based on the semantic correlation. Step S7: Further segment the non-keywords in the text into words and repeat the above steps. Step S8, wherein the hidden layer variables include the output of the feedforward neural network and the output of the reverse neural network, and the natural language segment is the segment of text divided by punctuation marks.
[0024] Understandably, this implementation case uses two standard data sets for testing: the gas accident standard and the hazardous chemical accident standard. The following is a detailed explanation of the experimental process for the gas accident standard dataset. There are a total of 244 gas accident standard documents, with an average document length of 7064 characters and an average of 34 keywords per document.
[0025] Specifically, the process of calculating the word feature representation vector includes, The updated hidden layer variables are concatenated to form the hidden layer state based on the BiLSTM model; Inputting the hidden layer state into the attention matrix yields the word feature representation vector.
[0026] Specifically, (1) Determine the text word vector input : The BERT model is used to pre-train word vectors. This model can transform words in text into dimensional vectors, facilitating efficient feature extraction calculations by the neural network.
[0027] Retrieving sentence representation from text ; After passing through the word embedding layer, the text representation is transformed into... ; in, This represents a character array representing the preprocessed sentence text. Indicates the first in the text The serialized characters of each word, For word vector dimensions, This represents the original input text sequence. Represents the set of real numbers. This represents the first word in the sentence, and so on. During the experiment using the standard dataset for gas accidents, Word vector dimension .
[0028] (2) Determine the feature representation vector of standard text words : This invention utilizes a variant of the BiLSTM model optimized based on LSTM to extract text word feature representation vectors, receiving the output of the word vectors. The method is as follows: Input vector sequence , Indicates the number of words in the text. ; The loop structure updates the hidden layer variable using the following formula. The calculation process is as follows:
[0029] in: The output of the forget gate is used to control the cell state at the previous time step. The retention ratio; The output of the input gate is used to control the degree to which the current input information is written into the unit state; The output of the output gate is used to control the output ratio of the current cell state to the hidden state; The state of the cell at time t; These are the implicit state variables from the previous time step; Let be the input word vector at time t; This represents the cell state at time t-1; and These represent the Sigmoid activation function and the hyperbolic tangent activation function, respectively.
[0030] in, These represent the weight matrix and bias vector corresponding to the forget gate, input gate, candidate unit state, and output gate, respectively. The parameters are shared across time steps. The weight matrix is initialized to follow a normal distribution N(0,1), and the initial value of the bias vector is uniformly set to 0.01. Each parameter is optimized through supervised training.
[0031] Understandably, supervised training is based on labeled standard text samples. Word sequences and their corresponding semantic labels are used as training data input into the model. The weight matrix and bias vector in the model are iteratively optimized by minimizing the loss function between the model's prediction results and the manually labeled results.
[0032] This invention uses a BiLSTM model to integrate the outputs of two LSTM neural networks, concatenate them, and feed them into the next layer to ultimately form the hidden layer state:
[0033] in, They represent The outputs of the feedforward neural network and the inverse neural network at each time step.
[0034] (3) Further extract word feature representation vectors through attention mechanism processing. : This module takes the output of the previous BiLSTM layer as input and uses an attention mechanism to further extract text features.
[0035] The specific calculations are as follows:
[0036] in, It is the defined attention matrix, which takes the BiLSTM output as input; It is a weight matrix, where the weight matrix is... The initial distribution satisfies the following: , Initial distribution satisfies The random distribution; These represent the forward and backward LSTM outputs, respectively. This is the bias vector, initialized to 0. This is the activation function.
[0037]
[0038] The above formula represents that in At the 1st time point, the 2nd time point... The node is the first Attention probability weights for each node.
[0039]
[0040] The above formula is used to calculate the first... The new forward output feature value for the nth word is calculated. Similarly, the nth word is calculated. New reverse feature values for each word , This represents the hidden state of the BiLSTM model, which serves as the input vector for the attention mechanism.
[0041] Get the first Output characteristics after entering the attention mechanism at each moment: .
[0042] Specifically, the process of selecting the maximum response value of each feature map vector and concatenating them to generate a word feature representation vector includes: A single feature map is obtained from the convolution operation; Generate a feature map vector by combining several feature maps; The maximum response value of each feature map vector is obtained and concatenated to generate a word feature representation vector.
[0043] Specifically, (1) Determine the text word vector input : To compensate for the shortcomings of word vector models during training, this paper uses character-level vectors to further extract text features. Character vectors refer to training Chinese text character by character, generating a vector with a certain dimension for each character. Character vectors can better express the meaning of each character and can also serve as input to neural networks, becoming an effective supplement to word vectors.
[0044] Retrieving sentence representation from text
[0045] After passing through the sub-embedding layer, the text representation is transformed into...
[0046] in, This represents a character array representing the preprocessed sentence text. Indicates the first in the text Serialized characters of a word, The dimension is the character vector. During experiments on the standard dataset for gas accidents... character vector dimension .
[0047] (2) Determine the character feature representation vector : A TextCNN neural network is used to extract word feature representation vectors. Its input vector sequence is... Size is , Indicates the dimension of the word vector.
[0048] Set up convolutional kernels of equal width with lengths of (2, 3, 4), where each kernel length includes M kernels. The convolution operation is as follows:
[0049] in, Indicates the first The numerical result obtained from the convolution operation; For word vectors; This represents the vector segment of text covered by the convolution kernel; Indicates the end position of the window; Indicates the length of the convolution kernel; The weights of the convolution kernel are initialized with the following distribution: ; It is a bias vector, initially set to 0.01; function Represents the non-linear activation function ReLU; K is the width of the convolution kernel. The number of characters covered by the convolution kernel. For A convolution kernel of size [size missing] is used to obtain a feature map of size [size missing] through the convolution operation. , This indicates the length of the input character sequence.
[0050] The feature map vector is represented as:
[0051] By using a max pooling layer, the maximum value of a single feature map is obtained. :
[0052] Maximum response value of all feature maps By concatenating the vectors, we obtain the text character feature representation vectors. The optimization parameters include the word vector matrix, convolution kernel weights, and bias vectors. , This represents the maximum response value of the first feature map. This represents the maximum response value of the second feature map. This represents the maximum response value of the 3xMth feature map.
[0053] (3) Determine the keyword tag category output : Will The feature vector output by the time-series attention mechanism is concatenated with the text word feature representation vector:
[0054] In the above formula, for The result of concatenating the word feature representation vector and the character feature representation vector is the time-word feature representation vector. for The word feature representation vector at time step.
[0055]
[0056] The above formula The vector matrix corresponding to the eigenvector. When The vector matrix at time step.
[0057]
[0058] In the above formula, The probability distribution calculated for the classifier. These are the weight matrix and bias vector, used to map the fused feature vector to the keyword classification label space. The weight matrix... Initial distribution satisfies random distribution, Initialize to 0, The calculated result is the raw score. This is used to convert the original scores into probabilities. Through the above calculations, the fused high-dimensional feature vector is compressed and mapped to a low-dimensional label space containing only keywords and non-keywords. This mapping process uses a linear transformation to weight different feature dimensions and utilizes... The function transforms the calculation results into a probability distribution, thereby enabling binary classification to determine whether a word belongs to a keyword.
[0059] Specifically, this invention generates word vectors using the BERT model that possess contextual relevance, accurately distinguishing the semantics of technical terms in standard texts under different contexts. Combined with the BiLSTM bidirectional recurrent structure, it can simultaneously capture the semantic dependencies of words before and after the context, fully restoring the temporal semantic logic of long sentences in standard texts. The attention mechanism weights the hidden layer variables output by BiLSTM, increasing the weight of core words in standard texts, making the generated word feature representation vectors more closely match the key semantics of standard content summaries. TextCNN extracts local features of word vectors through multi-scale convolutional kernels, accurately capturing key character combinations in standard texts. By selecting the maximum response value of each feature map vector and concatenating them, the optimal local features corresponding to each convolutional kernel can be selected, improving the discriminativeness of word feature representation vectors. While word feature representation vector extraction is not comprehensive enough for extracting features from short natural language segments in standard texts, word feature representation vector extraction can fully utilize the semantic information of each character. By concatenating them through convolutional kernels, a complete word feature representation vector is formed, improving the accuracy of the standard content summarization generation method that integrates embedded vectors and semantic supervision.
[0060] Specifically, this invention generates both word and character features based on the BERT model, ensuring the integrity of multi-granular features and providing high-quality input for accurate judgment by the subsequent classifier. Selecting concatenated feature vectors corresponding to several time points to construct a matrix is essentially a temporal slice model of the text, which can obtain the semantic dynamic changes in different positions and contexts in the standard text. This allows the feature matrix to completely restore the temporal semantic structure of the text. The classifier can simultaneously concatenate the commonalities and differences of different words in the feature vector matrix, improving the ability to distinguish between keywords and non-keywords, increasing the efficiency of feature processing, and improving the accuracy of the standard content summarization generation method that integrates embedded vectors and semantic supervision.
[0061] Please see Figure 2 As shown, it is a flowchart of the logic for keyword pre-classification based on probability distribution in an embodiment of the present invention. Keyword pre-classification is performed based on probability distribution, wherein... If the probability of a keyword is greater than or equal to the probability threshold, it is determined to be a keyword; If the probability of a keyword is less than the probability threshold, it is determined to be a non-keyword.
[0062] Understandably, the probability distribution calculated by the classifier will output two results: the probability of keywords and the probability of non-keywords. The probability of selecting keywords is compared with a threshold, taking into account both text length and keyword density. To prevent redundant words from being misclassified as keywords and to prevent the omission of low-frequency but important professional terms, the probability threshold is generally set between 60% and 70%.
[0063] Specifically, the natural semantic vector of a natural segment that does not have a corresponding combined segment is denoted as a zero vector.
[0064] Understandably, in order to improve computational efficiency when calculating the semantic similarity between the combined semantic vector and the corresponding natural semantic vector, the natural segments that do not have corresponding combined segments are directly added to the reconstructed text to generate a standard content summary. Therefore, the natural semantic vector of the natural segment that does not have a corresponding combined segment is denoted as the zero vector.
[0065] Please see Figure 3 As shown, it is a logical flowchart of text reconstruction based on semantic similarity according to an embodiment of the present invention. Text reconstruction based on semantic similarity, wherein... If the semantic similarity is greater than or equal to the similarity threshold, the combined segment is determined to be able to be used as a recombined segment for text recombination to generate a standard content summary. If the semantic similarity is less than the similarity threshold, the combined segments are deemed unsuitable for text recombination to generate a standard content summary.
[0066] In one specific embodiment, a similarity threshold of 0.85 is set. If the semantic similarity is 0.97, which is greater than the similarity threshold, it is determined that the combined segment can be used as a recombined segment for text recombination to generate a standard content summary. If the semantic similarity is 0.71, which is less than the similarity threshold, then the combined segment is determined not to be a recombined segment for text recombination to generate a standard content summary.
[0067] Understandably, in order to ensure that the combined segments selected based on the similarity threshold can clearly indicate the semantics of the corresponding natural segments and make the text reconstruction more complete, the similarity threshold is generally set in the range of 0.8 to 0.9.
[0068] Specifically, the process of text reconstruction based on semantic similarity includes, If it is determined that the combined segment cannot be used as a reconstructed segment for text reconstruction to generate a standard content summary, the natural segment corresponding to the combined segment will be used as a replacement segment for text reconstruction to generate a standard content summary. Natural segments without corresponding combined segments are used as supplementary segments for text recombination to generate standard content summaries.
[0069] Understandably, when it is determined that the combined segments cannot be used as recombined segments for text recombination to generate standard content summaries, in order to ensure that the summaries generated by each segment do not lack the core content of the original text, the inaccurate combined segments are replaced with the natural segments corresponding to the combined segments, and natural segments that do not have corresponding combined segments are added to the recombined text to make the semantics of the generated summaries complete.
[0070] Specifically, this invention pre-classifies keywords based on the probability distribution output by a classifier. By setting a probability threshold, it can filter out words with high core semantic relevance in the text to generate combined segments, and filter out redundant words with low semantic relevance or irrelevant meaning, thus improving the efficiency of standard content summary generation. Based on the comparison results of the similarity between the combined segments and the corresponding natural segments and their thresholds, for combined segments that cannot be used as recombined segments, their corresponding natural segments are used as replacement segments for text recombination to improve the semantic integrity of the recombined text. At the same time, for natural segments that do not have corresponding combined segments, they are used as supplementary segments to fill in the recombined text, ensuring that the summary content generated by the recombined text is consistent with the meaning of the original text, thus improving the accuracy of the standard content summary generation method that integrates embedding vectors and semantic supervision.
[0071] Specifically, the process of calculating the semantic association between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text includes, Calculate the semantic correlation between each recombined segment and the replaced segment, as well as the semantic correlation between each recombined segment and the supplementary segment.
[0072] Understandably, the semantic relevance between each recombined segment and the replacement segment, as well as the semantic relevance between each recombined segment and the supplementary segment, directly affects the comprehensiveness and semantic fluency of the summary jointly generated by the segments. Among them, the recombined segments have passed the semantic similarity verification with natural text and have been determined to be the content for generating the summary. The replacement segments and supplementary segments are both corresponding natural text segments, so it is necessary to test the semantic relevance between them and the recombined segments separately.
[0073] Please see Figure 4 The diagram shown is a flowchart illustrating the logical process of determining whether a pre-classified keyword is a keyword based on semantic relevance in an embodiment of the present invention. The process of determining whether a pre-classified keyword is a keyword based on semantic relevance includes: If the semantic relevance is less than the relevance threshold, the pre-classified keyword is determined to be a non-keyword; If the semantic relevance is greater than or equal to the relevance threshold, then the pre-classified keyword is determined to be a keyword.
[0074] In one specific embodiment, the relevance threshold is set to 0.7. If the semantic relevance is 0.63, which is less than the relevance threshold, then the pre-classified keyword is determined to be a non-keyword. If the semantic relevance score is 0.84, which is greater than the relevance threshold, then the pre-classified keyword is determined to be a keyword.
[0075] It is understandable that semantic relevance is used to ensure that the standard content summary generated by each segment can accurately summarize the content of the original text and to avoid omitting key segments as much as possible. Therefore, the relevance threshold is generally set between 0.6 and 0.8.
[0076] Specifically, non-keywords in the text are split or combined into several words or several single characters to regenerate word vectors or character vectors.
[0077] It is understandable that, in practice, the determination of non-keywords may be due to insufficient splitting or incomplete extraction of word vectors and character vectors. Therefore, non-keywords are iterated to reorganize the text and generate a standard content summary.
[0078] Specifically, this invention further verifies the correlation between natural segments that cannot be directly used as recombined segments and combined segments that can be directly used as recombined segments by calculating the semantic correlation between each recombined segment and the replacement segment, as well as the semantic correlation between each recombined segment and the supplementary segment. By setting a threshold, segments with a correlation higher than the threshold are identified as keywords that can generate summaries. For non-keywords that do not meet the requirements, vocabulary is re-segmented and multi-granular feature extraction is performed to more meticulously uncover the core information that was missed in the initial screening. This ensures that the generated summary can cover all the core content of the original text. From keyword pre-classification to text recombination, and then to semantic correlation verification and non-keyword iteration, the entire process forms a semantic supervision link, which improves the accuracy of the standard content summary generation method that integrates embedded vectors and semantic supervision.
[0079] The technical solution of the present invention has been described above with reference to the preferred embodiments shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will all fall within the scope of protection of the present invention.
Claims
1. A standard content summarization generation method that integrates embedded vectors and semantic supervision, characterized in that, include: The text is divided into several words, each word is input into the BERT model to generate word vectors, the word vectors are input into the BiLSTM model to update the hidden layer variables, the hidden layer variables are processed using an attention mechanism, and word feature representation vectors are calculated. The text is divided into several characters, each character is input into the BERT model to generate character vectors, the character vectors are input into the TextCNN neural network to generate several feature map vectors, and the maximum response value of each feature map vector is selected and concatenated to generate character feature representation vectors; The word feature representation vector and the character feature representation vector are concatenated to generate a concatenated feature vector. The concatenated feature vectors corresponding to several time points are selected to generate a concatenated feature vector matrix. The probability distribution of the concatenated feature vector matrix is calculated by a classifier. Based on the probability distribution, keyword pre-classification is performed, several pre-classified keywords are combined to generate combined segments, the combined semantic vector of the combined segments is obtained, and the natural semantic vector of several natural segments is calculated. Calculate the semantic similarity between the combined semantic vector and the corresponding natural semantic vector, and perform text recombination based on the semantic similarity; Calculate the semantic correlation between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text, and determine whether the pre-classified keyword is a keyword based on the semantic correlation. The non-keywords in the text are further segmented into words to repeat the above steps; The hidden layer variables include the outputs of the feedforward neural network and the inverse neural network, and the natural language segment is a segment of the text divided by punctuation marks.
2. The standard content summarization method that integrates embedded vectors and semantic supervision according to claim 1, characterized in that, The process of calculating and generating word feature representation vectors includes, The updated hidden layer variables are concatenated to form the hidden layer state based on the BiLSTM model; The hidden layer state is input into the attention matrix to obtain the word feature representation vector.
3. The standard content summarization method that integrates embedded vectors and semantic supervision according to claim 2, characterized in that, The process of selecting the maximum response value of each feature map vector and concatenating them to generate a character feature representation vector includes: A single feature map is obtained from the convolution operation; Generate a feature map vector by combining several feature maps; The maximum response value of each feature map vector is obtained and concatenated to generate a word feature representation vector.
4. The standard content summarization method that integrates embedded vectors and semantic supervision according to claim 3, characterized in that, The keyword pre-classification is performed based on the probability distribution, wherein... If the probability of a keyword is greater than or equal to the probability threshold, it is determined to be a keyword; If the probability of a keyword is less than the probability threshold, it is determined to be a non-keyword.
5. The standard content summarization method that integrates embedded vectors and semantic supervision according to claim 4, characterized in that, The natural semantic vector of a natural segment that does not have a corresponding combined segment is denoted as the zero vector.
6. The standard content summarization method according to claim 5, which integrates embedded vectors and semantic supervision, is characterized in that... The text reconstruction based on the semantic similarity, wherein... If the semantic similarity is greater than or equal to the similarity threshold, the combined segment is determined to be able to be used as a recombined segment for text recombination to generate a standard content summary. If the semantic similarity is less than the similarity threshold, the combined segments are deemed unsuitable for text recombination to generate a standard content summary.
7. The standard content summarization method according to claim 6, which integrates embedded vectors and semantic supervision, is characterized in that... The process of reconstructing text based on semantic similarity includes: If it is determined that the combined segment cannot be used as a reconstructed segment for text reconstruction to generate a standard content summary, the natural segment corresponding to the combined segment will be used as a replacement segment for text reconstruction to generate a standard content summary. Natural segments without corresponding combined segments are used as supplementary segments for text recombination to generate standard content summaries.
8. The standard content summarization method according to claim 7, which integrates embedded vectors and semantic supervision, is characterized in that... The process of calculating the semantic correlation between each reconstructed segment and each remaining pre-classified keyword in the reconstructed text includes, Calculate the semantic correlation between each recombined segment and the replaced segment, as well as the semantic correlation between each recombined segment and the supplementary segment.
9. The standard content summarization method according to claim 8, which integrates embedded vectors and semantic supervision, is characterized in that... The process of determining whether the pre-classified keyword is a keyword based on the semantic relevance includes: If the semantic relevance is less than the relevance threshold, the pre-classified keyword is determined to be a non-keyword; If the semantic relevance is greater than or equal to the relevance threshold, then the pre-classified keyword is determined to be a keyword.
10. The standard content summarization method according to claim 9, which integrates embedded vectors and semantic supervision, is characterized in that... The non-keywords in the text are split or combined into several words or several single characters to regenerate word vectors or character vectors.