Semantic analysis method and apparatus, electronic device, and storage medium

By dividing documents into text fragments and using a semantic structure model to determine label probabilities, and automatically aggregating the fragments, the shortcomings of existing models in semantic understanding and cross-language documents are addressed, achieving more accurate semantic structure and topic analysis.

WO2026124525A1PCT designated stage Publication Date: 2026-06-18CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
CHINA TELECOM ARTIFICIAL INTELLIGENCE TECHNOLOGY (BEIJING) CO LTD
Filing Date
2025-12-10
Publication Date
2026-06-18

AI Technical Summary

Technical Problem

Existing models struggle to accurately capture the semantic coherence and themes within a document, especially when dealing with short texts and cross-language documents, where they fail to effectively understand semantic structure. Furthermore, reliance on manual annotation can easily lead to inaccurate labeling.

Method used

The document is divided into multiple text fragments, and each fragment is assigned a label. The probability of the label and text fragment is determined through a semantic structure model. The process is iteratively optimized until preset conditions are met, and fragments with the same label are automatically aggregated to generate new text fragments.

🎯Benefits of technology

It improves the accuracy of semantic structure and topic in document collections, reduces the workload of manual annotation, and improves the accuracy and efficiency of tags.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025141403_18062026_PF_FP_ABST
    Figure CN2025141403_18062026_PF_FP_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a semantic analysis method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring a target document; dividing the target document into a plurality of text fragments, and configuring label sets for the plurality of text fragments; inputting the plurality of text fragments and the plurality of label sets corresponding thereto into a semantic structure model; determining, by means of the semantic structure model, generation probabilities of words in each text fragment; and determining target words from the words in each text fragment on the basis of the generation probabilities of the words in each text fragment, and using the target words as a new label set for the text fragment; after determining new label sets for the plurality of text fragments, returning to the step of inputting into the semantic structure model, until a preset condition is satisfied; and when the preset condition is satisfied, aggregating text fragments having identical labels on the basis of the label sets corresponding to the plurality of text fragments, so as to obtain a plurality of new text fragments.
Need to check novelty before this filing date? Find Prior Art

Description

Semantic analysis methods, devices, electronic devices and storage media

[0001] Related applications

[0002] This application claims priority to Chinese patent application No. 2024118143802, filed on December 10, 2024, entitled "A Semantic Analysis Method, Apparatus, Electronic Device and Storage Medium", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of large language model technology, and in particular to a semantic analysis method, a semantic analysis device, an electronic device, and a computer-readable storage medium. Background Technology

[0004] Some existing models can handle large and complex vocabularies, but they primarily focus on word distribution without considering word order and contextual relationships within a document. They utilize pre-trained word vectors to enhance topic modeling and improve the semantic expressiveness of topics. However, they struggle to capture the semantic coherence and themes within a document and are insufficient for handling short texts and cross-lingual documents. Other models iteratively adjust and optimize model parameters to estimate the probability distribution between documents and potential topics, as well as the probability distribution between potential topics and words, revealing the potential relationships between different documents and words in a document set, thereby aiding in understanding the semantic structure of the document set. However, these models have a higher barrier to entry, requiring manual interpretation and annotation of topics, and they struggle to effectively capture deep-seated relationships and contextual information between words.

[0005] When processing documents using these methods, it is difficult to effectively infer the topic distribution of new, unseen documents, provide sufficient contextual information for short texts to accurately identify topics, and quickly understand and handle semantic differences between different languages ​​for cross-language documents. It is also impossible to accurately mine the high-level semantic structure and potential topics within document collections, and relying on manual interpretation of topic annotation semantics can easily lead to inaccurate labeling. Summary of the Invention

[0006] To address the aforementioned problems, a first aspect of this application provides a semantic analysis method, the method comprising:

[0007] Obtain the target document;

[0008] The target document is divided into multiple text fragments, and a tag set is configured for each of the multiple text fragments;

[0009] The multiple text fragments and their corresponding tag sets are input into a semantic structure model. Using the semantic structure model, the probability of each tag in the tag set appearing in the corresponding text fragment is determined, as is the distribution probability of each tag in the tag set and the probability of the text fragment appearing in all text fragments. Based on the probability of each tag appearing in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the probability of the text fragment appearing in all text fragments, the probability of word generation in the text fragment is determined. Based on the probability of word generation in the text fragment, a target word is determined from the words in the text fragment, and the target word is used as a new tag set for the text fragment.

[0010] After determining the new set of labels for the multiple text fragments, return to the step of inputting the multiple text fragments and their corresponding multiple set of labels into the semantic structure model until the preset conditions are met;

[0011] When the preset conditions are met, text fragments with the same tags are aggregated according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments.

[0012] Optionally, dividing the target document into multiple text segments includes:

[0013] The statements in the target document are preprocessed, and a vector of the statements is generated;

[0014] The similarity between adjacent statements in the target document is calculated based on the vector of the statement to obtain the similarity between the adjacent statements.

[0015] The similarity between the adjacent statements is compared with a similarity threshold;

[0016] If the similarity between adjacent statements is greater than the similarity threshold, then the adjacent statements are merged into a text segment.

[0017] Optionally, determining the target word from the words in the text segment based on the generation probability of the words in the text segment includes:

[0018] The words in the text segment are sorted according to their generation probability.

[0019] Based on the sorting results, the target word is determined from the words in the text segment.

[0020] Optionally, sorting the words in the text segment according to their generation probabilities includes:

[0021] The words in the text segment are sorted in descending order based on their generation probability.

[0022] Optionally, the semantic structure model is trained in the following manner:

[0023] Obtain the training document set;

[0024] Each training document in the training document set is divided into multiple text segments, and a tag set is configured for each of the multiple text segments;

[0025] Input the text fragment and its corresponding tag set into the large language model to obtain the word generation probability of the text fragment;

[0026] Determine a new set of tags for the text fragment based on the word generation probability;

[0027] After determining the new tag set for the text fragment, return to the step of inputting the text fragment and the corresponding tag set into the large language model until the preset conditions are met;

[0028] When the preset conditions are met, text fragments with the same tags are aggregated according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments;

[0029] Obtain the labeling accuracy of the new text fragment;

[0030] The large language model is adjusted based on the labeling accuracy to obtain the semantic structure model.

[0031] Optionally, dividing each training document in the training document set into multiple text segments includes:

[0032] The sentences in the training document set are preprocessed, and vectors of the sentences are generated;

[0033] The similarity between adjacent statements in the training document set is calculated based on the vector of the statement to obtain the similarity between the adjacent statements.

[0034] The similarity between the adjacent statements is compared with a similarity threshold;

[0035] If the similarity between adjacent statements is greater than the similarity threshold, the adjacent statements are merged into a text segment to obtain multiple text segments.

[0036] Optionally, the step of inputting the text fragment and its corresponding label into a large language model to obtain the word generation probability of the text fragment includes:

[0037] Using the large language model, the probability of each tag in the tag set appearing in the corresponding text segment is determined, the distribution probability of each tag in the tag set is determined, and the probability of the text segment appearing in the training document set is determined.

[0038] Based on the occurrence probability of each tag in the corresponding text segment, the distribution probability of each tag in the tag set in the training document set, and the occurrence probability of the text segment in the training document set, the generation probability of words in the text segment is determined.

[0039] Optionally, the accuracy of obtaining the labels for the new text fragment includes:

[0040] Obtain the new label and the actual label of the text fragment;

[0041] Based on the new labels and the actual labels of the text fragments, determine the number of correct labels that the large language model uses to label the text fragments;

[0042] Based on the new label and the actual label of the text fragment, determine the number of labeling errors made by the large language model in labeling the text fragment;

[0043] The labeling accuracy rate is determined based on the number of correct labels and the number of incorrect labels;

[0044] The learning rate and number of iterations of the large language model are adjusted based on the labeling accuracy.

[0045] Alternatively, similarity calculation methods include cosine similarity, Euclidean distance, Manhattan distance, and Jaccard similarity.

[0046] A second aspect of this application provides a semantic analysis apparatus, the apparatus comprising:

[0047] The document acquisition module is used to acquire the target document.

[0048] The tag configuration module is used to divide the target document into multiple text fragments and configure a tag set for the multiple text fragments;

[0049] A probability determination module is used to input the multiple text fragments and their corresponding multiple tag sets into a semantic structure model; through the semantic structure model, determine the occurrence probability of each tag in the tag set in the corresponding text fragment, determine the distribution probability of each tag in the tag set, and determine the occurrence probability of the text fragment in all text fragments; based on the occurrence probability of each tag in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the occurrence probability of the text fragment in all text fragments, determine the generation probability of words in the text fragments; based on the generation probability of words in the text fragments, determine the target word from the words in the text fragments, and use the target word as a new tag set for the text fragments;

[0050] The condition determination module is used to return the step of inputting the multiple text fragments and their corresponding multiple tag sets into the semantic structure model after determining the new tag set of the multiple text fragments, until the preset conditions are met;

[0051] The text aggregation module is used to aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments when the preset conditions are met, so as to obtain multiple new text fragments.

[0052] Optionally, the label configuration module includes:

[0053] The vector generation submodule is used to preprocess the statements in the target document and generate vectors of the statements;

[0054] The similarity determination submodule is used to calculate the similarity between adjacent sentences in the target document based on the vector of the sentence, and obtain the similarity between the adjacent sentences;

[0055] The first similarity comparison submodule is used to compare the similarity of the adjacent statements with a similarity threshold.

[0056] The statement merging submodule is used to merge adjacent statements into a text segment if the similarity between adjacent statements is greater than the similarity threshold.

[0057] Optionally, the probability determination module includes:

[0058] The word sorting submodule is used to sort the words in the text segment according to the generation probability of the words in the text segment;

[0059] The word determination submodule is used to determine the target word from the words in the text segment based on the sorting results.

[0060] Optionally, the semantic structure model is trained using the following modules:

[0061] The training document acquisition module is used to acquire a set of training documents;

[0062] The training label configuration module is used to divide each training document in the training document set into multiple text segments and configure a label set for the multiple text segments.

[0063] The word probability generation module is used to input the text fragment and the corresponding tag set of the text fragment into the large language model to obtain the word generation probability of the text fragment;

[0064] A tag determination module is used to determine a new set of tags for the text segment based on the word generation probability.

[0065] The return operation module is used to return to the step of inputting the text fragment and the corresponding tag set into the large language model after determining the new tag set of the text fragment, until the preset conditions are met;

[0066] The text fragment aggregation module is used to aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments when the preset conditions are met, so as to obtain multiple new text fragments.

[0067] The accuracy acquisition module is used to acquire the labeling accuracy of the new text fragment;

[0068] The semantic model generation module is used to adjust the large language model based on the labeling accuracy of the labels to obtain the semantic structure model.

[0069] Optionally, the training label configuration module includes:

[0070] The vector determination submodule is used to preprocess the statements in the training document set and generate vectors for the statements;

[0071] The similarity calculation submodule is used to calculate the similarity between adjacent statements in the training document set based on the vector of the statement, and to obtain the similarity between the adjacent statements.

[0072] The second similarity comparison submodule is used to compare the similarity of the adjacent statements with a similarity threshold.

[0073] The text fragment generation submodule is used to merge the adjacent statements into a text fragment if the similarity between the adjacent statements is greater than the similarity threshold, thereby obtaining multiple text fragments.

[0074] Optionally, the word probability generation module includes:

[0075] The label probability determination submodule is used to determine, through the large language model, the probability of occurrence of each label in the label set in the corresponding text segment, the distribution probability of each label in the label set, and the probability of occurrence of the text segment in the training document set;

[0076] The word probability determination submodule is used to determine the generation probability of words in the text segment based on the occurrence probability of each tag in the corresponding text segment, the distribution probability of each tag in the tag set in the training document set, and the occurrence probability of the text segment in the training document set.

[0077] Optionally, the accuracy acquisition module includes:

[0078] The tag acquisition submodule is used to acquire new tags for the text fragment and the actual tags for the text fragment.

[0079] The correct number determination submodule is used to determine the correct number of labels that the large language model uses to label the text fragment based on the new label of the text fragment and the actual label of the text fragment;

[0080] The error count determination submodule is used to determine the number of labeling errors of the text fragment by the large language model based on the new label of the text fragment and the actual label of the text fragment;

[0081] The accuracy determination submodule is used to determine the labeling accuracy based on the number of correct labels and the number of incorrect labels;

[0082] The adjustment submodule is used to adjust the learning rate and number of iterations of the large language model based on the labeling accuracy.

[0083] According to a third aspect of this application, an electronic device is provided, comprising: a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the steps of a semantic analysis method as described above.

[0084] According to a fourth aspect of this application, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of a semantic analysis method as described above.

[0085] Details of one or more embodiments of this application are set forth in the following drawings and description. Other features, objects, and advantages of this application will become apparent from the specification, drawings, and claims. Attached Figure Description

[0086] Figure 1 is a flowchart of the steps of a semantic analysis method provided in an embodiment of this application;

[0087] Figure 2 is a flowchart of the semantic structure model training steps of a semantic analysis method provided in an embodiment of this application;

[0088] Figure 3 is a schematic diagram of the semantic structure model training process of a semantic analysis method provided in an embodiment of this application;

[0089] Figure 4 is a structural block diagram of a semantic analysis device provided in an embodiment of this application. Detailed Implementation

[0090] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, this application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0091] Existing models cannot accurately uncover the high-level semantic structure and potential topics in document collections, and relying on manual interpretation of topic annotation semantics can easily lead to inaccurate labels.

[0092] One of the core concepts of this application is that it divides a document into multiple text fragments and assigns tags to these fragments. The text fragments and their corresponding tags are then input into a semantic structure model to determine the probability of each tag appearing in the document, the probability distribution of tags within the document and text fragments, and the probability of each text fragment appearing in the document. These probabilities are used to determine word generation probabilities, thereby identifying target words. New tags are generated based on these target words, and this process is repeated until preset conditions are met. After the preset conditions are met, text fragments with the same tags are aggregated to obtain new text fragments. These new text fragments contain the same semantic structure information. This method can accurately obtain the semantic structure information and document topics in a document collection, and the semantic structure model makes the tags more accurate when interpreting the semantics of topic annotations.

[0093] Referring to Figure 1, a flowchart of the steps of a semantic analysis method provided in an embodiment of this application is shown. The method may specifically include the following steps:

[0094] Step 101, Obtain the target document;

[0095] The target document typically refers to the specific document that a user or system hopes to find or process in tasks such as information retrieval, text classification, and clustering.

[0096] In this embodiment, the document to be processed is obtained as the target document, and semantic analysis is performed on the document.

[0097] Step 102: Divide the target document into multiple text fragments and configure a tag set for the multiple text fragments;

[0098] A text fragment is a portion of a complete text, typically a combination of sentences, phrases, or words. In applications of large language models (such as GPT and BERT), text fragments can be used for various tasks, including but not limited to text classification, information retrieval, clustering, and generation.

[0099] Labels are typically used to represent the category, attribute, or feature of data. Labels can be used in supervised learning tasks such as classification and regression, as well as unsupervised learning tasks such as clustering and dimensionality reduction. In applications of large language models (such as GPT and BERT), labels can be used for a variety of tasks, including but not limited to text classification, information retrieval, clustering, and generation.

[0100] In this embodiment, the document needs to be divided into multiple text fragments, and a tag set needs to be configured for each text fragment. The document is divided into multiple text fragments, each containing relatively complete semantic information. Based on this, a set of tags is initialized; these tags are lists of multiple words that will be used for subsequent operations.

[0101] In some embodiments, step 102 includes the following sub-steps:

[0102] Sub-step S11: Preprocess the statements in the target document and generate a vector of the statements;

[0103] Preprocessing sentences in a target document and generating sentence vectors typically includes: text cleaning, word segmentation, feature extraction, and vectorization.

[0104] Text cleaning primarily removes noise and unnecessary characters from text, making it cleaner and more standardized. Tokenization divides the text into words or phrases for easier subsequent processing. Feature extraction extracts meaningful features from the segmented text to represent it. The extracted features are then converted into numerical vectors for easier processing by machine learning models.

[0105] In this embodiment, each sentence in the input target document is preprocessed, including document segmentation and stop word removal. The preprocessed sentences are then input into the pre-trained model BERT to obtain the vector representation of the sentences.

[0106] Sub-step S12: Calculate the similarity between adjacent statements in the target document based on the vector of the statement to obtain the similarity between the adjacent statements;

[0107] Similarity is a metric that measures how similar two objects are. In Natural Language Processing (NLP), similarity is commonly used to compare objects such as text, vectors, and images. Common similarity metrics include cosine similarity, Euclidean distance, Manhattan distance, and Jaccard similarity.

[0108] In this embodiment, after obtaining the vectors of sentences in the target document, cosine similarity is used to calculate the similarity between adjacent sentences in the pre-trained model. The vectors of sentences are obtained only after the pre-trained model is used, and then the cosine similarity is calculated. The closer the calculated similarity value is to 1, the more similar the sentences are. By vectorizing and calculating the similarity between adjacent sentences in the target document, the degree of similarity between sentences can be effectively measured.

[0109] Sub-step S13: Compare the similarity of the adjacent statements with a similarity threshold;

[0110] A similarity threshold is a critical value used in similarity calculations to determine whether two objects are similar. By setting a similarity threshold, dissimilar objects can be filtered out, while similar objects are retained.

[0111] In this embodiment, a similarity threshold is set, and the calculated similarity between adjacent statements is compared with the similarity threshold to determine similar statements.

[0112] Sub-step S14: If the similarity between the adjacent statements is greater than the similarity threshold, then the adjacent statements are merged into a text segment.

[0113] In this embodiment, a similarity threshold is set, and the calculated similarity scores of adjacent sentences are compared with the threshold. If the similarity scores of two adjacent sentences exceed the threshold, they are considered semantically similar and can be considered as a single text segment. This merging operation helps us better understand the structure and semantics of the text, thereby enabling more in-depth analysis and processing.

[0114] For example, there is a document that states: "Natural Language Processing (NLP) is a branch of artificial intelligence and linguistics. It studies the theories and methods that enable effective communication between humans and computers using natural language. Natural Language Processing and information retrieval are related fields, both involving the understanding and generation of language."

[0115] Preprocessing the statements in the target document includes: Word segmentation: Splitting the document into sentences or smaller text segments. Sentence 1: "Natural Language Processing (NLP) is a sub - discipline in the fields of artificial intelligence and linguistics." Sentence 2: "It studies various theories and methods that can achieve effective communication between humans and computers using natural language." Sentence 3: "Natural Language Processing and Information Retrieval are related fields, both involving language understanding and generation." Stop - word removal: Deleting stop - words such as "of", "is", etc.

[0116] Divide the target document into multiple text segments, and configure a set of tags for the multiple text segments. Initialize a tag for each sentence. These tags can be keywords or phrases related to the sentence content. Sentence 1 tags: ["Natural Language Processing", "Artificial Intelligence"], Sentence 2 tags: ["Natural Language Processing", "Communication"], Sentence 3 tags: ["Natural Language Processing", "Information Retrieval"].

[0117] Calculate the similarity of adjacent statements in the target document based on the vectors of the statements, obtaining the similarity of adjacent statements. Use the BERT model to encode each sentence and calculate the similarity between sentences. Sentence encoding: Input each sentence into the BERT model to obtain the vector representation of the sentence. Similarity calculation: Calculate the cosine similarity between sentence vectors.

[0118] Compare the similarity of adjacent statements with the similarity threshold. If the similarity of adjacent statements is greater than the similarity threshold, merge the adjacent statements into a text segment. Compare the similarity of Sentence 1 and Sentence 2 and find that their similarity is high, so they may belong to the same topic. Compare the similarity of Sentence 2 and Sentence 3 and find that their similarity is low, so they may belong to different topics.

[0119] Update the tags after merging into text segments: According to the output of the BERT model, adjust the tags of the sentences to better reflect their content and context. Sentence 1 tag update: ["Natural Language Processing", "Artificial Intelligence", "Linguistics"], Sentence 2 tag update: ["Natural Language Processing", "Communication", "Human - computer Interaction"], Sentence 3 tag update: ["Information Retrieval", "Language Understanding", "Language Generation"]).

[0120] Step 103: Input the multiple text fragments and corresponding multiple tag sets into a semantic structure model; using the semantic structure model, determine the occurrence probability of each tag in the tag set in the corresponding text fragment, determine the distribution probability of each tag in the tag set, and determine the occurrence probability of the text fragment in all text fragments; based on the occurrence probability of each tag in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the occurrence probability of the text fragment in all text fragments, determine the generation probability of words in the text fragments; based on the generation probability of words in the text fragments, determine the target word from the words in the text fragments, and use the target word as a new tag set for the text fragments;

[0121] A semantic structure model is a model used to understand and represent the semantic relationships between words, phrases, and sentences in a language.

[0122] The probability of each label in the label set appearing in the corresponding text segment refers to the probability modeling p(x) k |t), the probability distribution of each label in the label set refers to p(x) k The probability of a text segment appearing in all text segments is p(t|x). k ,d).

[0123] The word generation probability of a text fragment is determined by the following formula:

[0124] p(x k ) represents the text fragment x k The probability of appearing in a document; probability modeling p(x) k |t) is used to predict the probability distribution of a text segment with a given label; the probability of the label is p(t|x). k ,d) represents the probabilities of the global and local perspectives, including: the global (document) probability distribution of tags and the local (text fragment) probability distribution of tags.

[0125] In this embodiment, text fragments are clustered based on the labels configured for them. For each category, each text fragment and its corresponding label are input into the semantic structure model. The semantic structure model first uses the label of a certain text fragment as a condition to perform probability modeling p(x) on it. k |t), and then, using the text and text fragments as conditions, jointly calculate the probability p(t|x) of the label from both global and local perspectives. k ,d), and determine the text fragment x k The probability p(x) of appearing in the document k).

[0126] The word generation probability is determined by calculating the word generation probability based on probabilistic modeling, the probability of tags, and the probability of text fragments appearing in a document. Based on the generation probabilities of words in the text fragment, target words are identified from the words in the text fragment, and these target words are used as a new tag set for the text fragment.

[0127] In some embodiments, step 103 includes the following sub-steps:

[0128] Sub-step S21: Sort the words in the text segment according to the generation probability of the words in the text segment;

[0129] In this embodiment, the word generation probability of a text segment is determined by a semantic structure model, and the words in the text segment are sorted according to their generation probability, with words with higher generation probability listed first and words with lower generation probability listed last.

[0130] Sub-step S22: Based on the sorting results, determine the target word from the words in the text segment.

[0131] In this embodiment, the words with the highest word generation probability in the text fragment are retained, and the words with the highest probability are determined from the text fragment as target words. New tags are then formed based on the target words.

[0132] Step 104: After determining the new tag set of the multiple text fragments, return to the step of inputting the multiple text fragments and the corresponding multiple tag sets into the semantic structure model until the preset conditions are met;

[0133] In this embodiment, the words in the text segment are sorted according to their generation probability, and words with high generation probability are selected to form new labels. The process is then returned to the semantic structure model operation steps, and multiple text segments and their corresponding multiple label sets are input into the semantic structure model.

[0134] The model takes multiple text fragments and their corresponding sets of labels as input, processes the input using a semantic structure model, and outputs new text fragments and sets of labels. Through multiple iterations of optimization, it improves the accuracy and semantic consistency of the labels. It sets preset conditions (such as the number of iterations, label accuracy, etc.) and determines whether the iteration termination condition is met. If the preset condition is not met, it returns to the previous step and inputs multiple text fragments and their corresponding sets of labels into the semantic structure model to redetermine the new set of labels. If the preset condition is met, it outputs the final text fragments and set of labels.

[0135] Step 105: When the preset conditions are met, aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments.

[0136] In this embodiment, it is determined whether preset conditions are met, such as the number of iterations and label accuracy. If the preset conditions are met, text fragments with the same label are aggregated together to form new text fragments. Multiple new text fragments are generated, each text fragment corresponding to a label.

[0137] The aggregated text fragments are treated as new text fragments, and the tags corresponding to each new text fragment are retained. Text fragments belonging to the same tag have similar semantic information.

[0138] This embodiment divides the target document into multiple text fragments, enabling the semantic structure model to better classify these fragments. By aggregating text fragments with the same tags, the semantic structure model interprets the topic annotation semantics more accurately. Automated tag generation via the semantic structure model reduces manual annotation workload and improves efficiency.

[0139] Referring to Figure 2, a flowchart of the semantic structure model training steps of a semantic analysis method provided in an embodiment of this application is shown;

[0140] Semantic structure models are obtained by training large language models. Therefore, it takes continuous adjustments and multiple iterations for a large language model to be trained to obtain a semantic structure model.

[0141] Step 201: Obtain the training document set;

[0142] A training document set is a collection of documents used to train machine learning models or natural language processing (NLP) systems. These documents typically contain a large amount of text data, which is used to allow the model to learn information such as the structure, semantics, and context of the language.

[0143] In this embodiment, a document used to train the semantic structure model is obtained as a training document, and multiple training documents are obtained as a training document set for training the semantic structure model.

[0144] Step 202: Divide each training document in the training document set into multiple text segments, and configure a tag set for each of the multiple text segments;

[0145] In this embodiment, each document in the training document set needs to be divided into multiple text segments, and a label set needs to be configured for each text segment. The document is divided into multiple text segments, each containing relatively complete semantic information. Based on this, a set of labels is initialized; these labels are lists of multiple words that will be used for subsequent operations.

[0146] In some embodiments, step 202 includes the following sub-steps:

[0147] Sub-step S31: Preprocess the statements in the training document set and generate vectors of the statements;

[0148] In this embodiment, each sentence in the input training document set is preprocessed, including document segmentation and stop word removal. The preprocessed sentences are then input into the pre-trained model BERT to obtain the vector representation of the sentences.

[0149] Sub-step S32: Calculate the similarity between adjacent statements in the training document set based on the vector of the statement to obtain the similarity between the adjacent statements;

[0150] In this embodiment, after obtaining the vectors of sentences in the training document set, cosine similarity is used to calculate the similarity between adjacent sentences in the pre-trained model. The sentence vectors are obtained only after the pre-trained model is used, and then the cosine similarity is calculated. The closer the calculated similarity value is to 1, the more similar the sentences are. By vectorizing and calculating the similarity between adjacent sentences in the training document set, the degree of similarity between sentences can be effectively measured.

[0151] Sub-step S33: Compare the similarity of the adjacent statements with a similarity threshold;

[0152] In this embodiment, a similarity threshold is set, and the calculated similarity between adjacent statements is compared with the similarity threshold to determine similar statements.

[0153] Sub-step S34: If the similarity between adjacent statements is greater than the similarity threshold, then the adjacent statements are merged into a text segment to obtain multiple text segments.

[0154] In this embodiment, a similarity threshold is set, and the calculated similarity scores of adjacent sentences are compared with the threshold. If the similarity scores of two adjacent sentences exceed the threshold, they are considered semantically similar and can be considered as a single text segment. This merging operation helps us better understand the structure and semantics of the text, thereby enabling more in-depth analysis and processing.

[0155] Step 203: Input the text fragment and the corresponding tag set of the text fragment into the large language model to obtain the word generation probability of the text fragment;

[0156] In this embodiment, text segments are clustered based on the labels configured for them. For each group, each text segment and its corresponding label are input into the large language model. The large language model first uses the label of a certain text segment as a condition to perform probability modeling p(x) on it. k |t), and then, using the text and text fragments as conditions, jointly calculate the probability p(t|x) of the label from both global and local perspectives. k ,d), and determine the text fragment x k The probability p(x) of appearing in the document k ).

[0157] The probability of word generation is determined by calculating the probability of word generation based on probability modeling, the probability of tags, and the probability of text fragments appearing in a document.

[0158] In some embodiments, step 203 includes the following sub-steps:

[0159] Sub-step S41: Using the large language model, determine the probability of each tag in the tag set appearing in the corresponding text segment, determine the distribution probability of each tag in the tag set, and determine the probability of the text segment appearing in the training document set;

[0160] In this embodiment, text fragments are clustered according to the labels configured for them. For each group of classifications, each text fragment and its corresponding label are input into the large language model. The large language model first uses the label of a certain text fragment as a condition to perform probability modeling, and determines the probability of each label in the label set appearing in the corresponding text fragment. Then, using the text and text fragment as conditions, it jointly calculates the distribution probability of each label in the label set from both global and local perspectives, and determines the probability of the text fragment appearing in the training document set.

[0161] Sub-step S42: Based on the occurrence probability of each tag in the corresponding text segment, the distribution probability of each tag in the tag set in the training document set, and the occurrence probability of the text segment in the training document set, determine the generation probability of words in the text segment.

[0162] In this embodiment, the probability of each label appearing in the corresponding text segment refers to the probability modeling p(x) k |t), the probability distribution of each label in the label set in the training document set refers to p(x) kThe probability of a text fragment appearing in the training document set is p(t|x). k ,d).

[0163] The word generation probability of a text fragment is determined by the following formula:

[0164] p(x k ) represents the text fragment x k The probability of appearing in a document; probability modeling p(x) k |t) is used to predict the probability distribution of a text segment with a given label; the probability of the label is p(t|x). k ,d) represents the probabilities of the global and local perspectives, including: the global (document) probability distribution of tags and the local (text fragment) probability distribution of tags.

[0165] Step 204: Determine a new set of tags for the text fragment based on the word generation probability;

[0166] In this embodiment, the target word is determined from the words in the text fragment based on the generation probability of the words in the text fragment, and the target word is used as a new tag set for the text fragment.

[0167] Step 205: After determining the new tag set of the text fragment, return to the step of inputting the text fragment and the corresponding tag set into the large language model until the preset conditions are met;

[0168] In this embodiment, the words in the text segment are sorted according to their generation probability, and words with high generation probability are selected to form new labels. The process is then returned to the semantic structure model operation steps, and multiple text segments and their corresponding multiple label sets are input into the semantic structure model.

[0169] The model takes multiple text fragments and their corresponding sets of labels as input, processes the input using a semantic structure model, and outputs new text fragments and sets of labels. Through multiple iterations of optimization, it improves the accuracy and semantic consistency of the labels. It sets preset conditions (such as the number of iterations, label accuracy, etc.) and determines whether the iteration termination condition is met. If the preset condition is not met, it returns to the previous step and inputs multiple text fragments and their corresponding sets of labels into the semantic structure model to redetermine the new set of labels. If the preset condition is met, it outputs the final text fragments and set of labels.

[0170] Step 206: When the preset conditions are met, aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments;

[0171] In this embodiment, it is determined whether a preset condition is met. If the preset condition is met, text fragments with the same tag are aggregated together to form a new text fragment. Multiple new text fragments are generated, each text fragment corresponding to a tag.

[0172] The aggregated text fragments are treated as new text fragments, and the label corresponding to each new text fragment is retained. Text fragments belonging to the same label have similar semantic information. By continuously iterating and repeating this process, in each iteration, the words with the highest word generation probability of the text fragment are retained to form new labels, thereby training the semantic model.

[0173] Step 207: Obtain the labeling accuracy of the new text fragment;

[0174] Labeling accuracy refers to the proportion of text segments correctly labeled in terms of semantic structure, as determined by models or manual annotation.

[0175] In this embodiment, the labeling accuracy of a new text fragment is determined by comparing the new label of the text fragment with the actual label of the text fragment.

[0176] In some embodiments, step 207 includes the following sub-steps:

[0177] Sub-step S51: Obtain the new label of the text fragment and the actual label of the text fragment;

[0178] Obtaining the predicted labels and ground truth labels for text fragments is a key step in evaluating model performance.

[0179] In this embodiment, the new label refers to the label predicted by the model, and the actual label refers to the true or correct label. The actual label and the new label of the text segment are obtained. Each text segment should have a corresponding actual label and a new label. The labeling accuracy is calculated by comparing the actual label and the new label of each text segment.

[0180] Sub-step S52: Based on the new label of the text fragment and the actual label of the text fragment, determine the number of correct labels for the text fragment marked by the large language model;

[0181] In this embodiment, the actual label and the new label of the text fragment are obtained. Each text fragment should have a corresponding actual label and a new label. The actual label and the new label of the text fragment are compared, and the number of correctly labeled text fragments is calculated.

[0182] Sub-step S53: Based on the new label of the text fragment and the actual label of the text fragment, determine the number of labeling errors of the text fragment in the large language model;

[0183] In this embodiment, the actual label and the new label of the text fragment are obtained. Each text fragment should have a corresponding actual label and a new label. The actual label and the new label of the text fragment are compared to calculate the number of labeling errors.

[0184] Sub-step S54: Determine the labeling accuracy rate based on the number of correct labels and the number of incorrect labels; the labeling accuracy rate is calculated using the formula...

[0185] This is used to determine the model's performance and adjust its learning rate and iteration count. Here, TP represents the number of correctly predicted positive classes, FP represents the number of incorrectly predicted positive classes, and P represents the label accuracy. These values ​​are obtained by comparing the model's predictions with the true labels in the test set.

[0186] In this embodiment, data on the number of correctly labeled tags and the number of incorrectly labeled tags are obtained. The total number of tags equals the number of correctly labeled tags plus the number of incorrectly labeled tags. The labeling accuracy is calculated using the number of correctly labeled tags and the total number of tags.

[0187] Sub-step S55: Adjust the learning rate and number of iterations of the large language model based on the labeling accuracy.

[0188] In this embodiment, the learning rate and number of iterations of the model are adjusted based on the calculated label accuracy. For example, if the label accuracy is low, the learning rate can be reduced or the number of iterations increased to allow the model to better fit the data. If the label accuracy is high, the learning rate can be increased or the number of iterations decreased to avoid overfitting.

[0189] Step 208: Adjust the large language model according to the labeling accuracy to obtain the semantic structure model.

[0190] In this embodiment, by continuously adjusting and training the large language model, a semantic structure model can be obtained based on the trained large language model.

[0191] Referring to Figure 3, a schematic diagram of the semantic structure model training process of a semantic analysis method provided in an embodiment of this application is shown;

[0192] In this embodiment, training the semantic structure model requires document preprocessing to facilitate similarity calculation. Similarity is calculated between adjacent sentences in the preprocessed document, and adjacent sentences with similarity exceeding a similarity threshold are merged to obtain text segments, thus completing document segmentation. After document segmentation, multiple text segments are obtained, and labels are assigned to these text segments. The text segments and their corresponding labels are input into a large language model to determine word generation probabilities. New labels are obtained based on these word generation probabilities, and documents with the same labels are clustered. After multiple iterations, a semantic structure model is generated.

[0193] In this embodiment, the training document set is divided to obtain multiple text fragments. Based on the tag set corresponding to the multiple text fragments, text fragments with the same tag are aggregated to obtain multiple new text fragments. The large language model is adjusted according to the tag accuracy to train a semantic structure model, which can be used for natural language generation tasks. Through the model, more natural, coherent and accurate text can be generated to understand the semantic structure of the text.

[0194] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.

[0195] Referring to Figure 4, a structural block diagram of a semantic analysis device provided in an embodiment of this application is shown, which may specifically include the following modules:

[0196] Document acquisition module 401 is used to acquire the target document;

[0197] Tag configuration module 402 is used to divide the target document into multiple text fragments and configure a tag set for the multiple text fragments;

[0198] The probability determination module 403 is used to input the plurality of text fragments and corresponding plurality of tag sets into a semantic structure model; through the semantic structure model, determine the occurrence probability of each tag in the tag set in the corresponding text fragment, determine the distribution probability of each tag in the tag set, and determine the occurrence probability of the text fragment in all text fragments; based on the occurrence probability of each tag in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the occurrence probability of the text fragment in all text fragments, determine the generation probability of words in the text fragments; based on the generation probability of words in the text fragments, determine the target word from the words in the text fragments, and use the target word as a new tag set for the text fragments;

[0199] The condition determination module 404 is used to return the step of inputting the multiple text fragments and their corresponding multiple tag sets into the semantic structure model after determining the new tag set of the multiple text fragments, until the preset conditions are met;

[0200] The text aggregation module 405 is used to aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments when the preset conditions are met, so as to obtain multiple new text fragments.

[0201] Optionally, the label configuration module 402 includes:

[0202] The vector generation submodule is used to preprocess the statements in the target document and generate vectors of the statements;

[0203] The similarity determination submodule is used to calculate the similarity between adjacent sentences in the target document based on the vector of the sentence, and obtain the similarity between the adjacent sentences;

[0204] The first similarity comparison submodule is used to compare the similarity of the adjacent statements with a similarity threshold.

[0205] The statement merging submodule is used to merge adjacent statements into a text segment if the similarity between adjacent statements is greater than the similarity threshold.

[0206] Optionally, the probability determination module 403 includes:

[0207] The word sorting submodule is used to sort the words in the text segment according to the generation probability of the words in the text segment;

[0208] The word determination submodule is used to determine the target word from the words in the text segment based on the sorting results.

[0209] Optionally, the semantic structure model is trained using the following modules:

[0210] The training document acquisition module is used to acquire a set of training documents;

[0211] The training label configuration module is used to divide each training document in the training document set into multiple text segments and configure a label set for the multiple text segments.

[0212] The word probability generation module is used to input the text fragment and the corresponding tag set of the text fragment into the large language model to obtain the word generation probability of the text fragment;

[0213] A tag determination module is used to determine a new set of tags for the text segment based on the word generation probability.

[0214] The return operation module is used to return to the step of inputting the text fragment and the corresponding tag set into the large language model after determining the new tag set of the text fragment, until the preset conditions are met;

[0215] The text fragment aggregation module is used to aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments when the preset conditions are met, so as to obtain multiple new text fragments.

[0216] The accuracy acquisition module is used to acquire the labeling accuracy of the new text fragment;

[0217] The semantic model generation module is used to adjust the large language model based on the labeling accuracy of the labels to obtain the semantic structure model.

[0218] Optionally, the training label configuration module includes:

[0219] The vector determination submodule is used to preprocess the statements in the training document set and generate vectors for the statements;

[0220] The similarity calculation submodule is used to calculate the similarity between adjacent statements in the training document set based on the vector of the statement, and to obtain the similarity between the adjacent statements.

[0221] The second similarity comparison submodule is used to compare the similarity of the adjacent statements with a similarity threshold.

[0222] The text fragment generation submodule is used to merge the adjacent statements into a text fragment if the similarity between the adjacent statements is greater than the similarity threshold, thereby obtaining multiple text fragments.

[0223] Optionally, the word probability generation module includes:

[0224] The label probability determination submodule is used to determine, through the large language model, the probability of occurrence of each label in the label set in the corresponding text segment, the distribution probability of each label in the label set, and the probability of occurrence of the text segment in the training document set;

[0225] The word probability determination submodule is used to determine the generation probability of words in the text segment based on the occurrence probability of each tag in the corresponding text segment, the distribution probability of each tag in the tag set in the training document set, and the occurrence probability of the text segment in the training document set.

[0226] Optionally, the accuracy acquisition module includes:

[0227] The tag acquisition submodule is used to acquire new tags for the text fragment and the actual tags for the text fragment.

[0228] The correct number determination submodule is used to determine the correct number of labels that the large language model uses to label the text fragment based on the new label of the text fragment and the actual label of the text fragment;

[0229] The error count determination submodule is used to determine the number of labeling errors of the text fragment by the large language model based on the new label of the text fragment and the actual label of the text fragment;

[0230] The accuracy determination submodule is used to determine the labeling accuracy based on the number of correct labels and the number of incorrect labels;

[0231] The adjustment submodule is used to adjust the learning rate and number of iterations of the large language model based on the labeling accuracy.

[0232] This embodiment employs a document acquisition module to obtain the document, a tag configuration module to divide the target document into multiple text fragments and configure corresponding tags, a probability determination module to obtain the word generation probability, a condition determination module to determine whether the text fragments and their corresponding tags meet preset conditions, and a text aggregation module to aggregate text fragments that meet the preset conditions and have the same tags to obtain new text fragments. This allows for better classification of text fragments, and by aggregating text fragments with the same tags, the semantic interpretation of topic annotation becomes clearer. Automated tag generation reduces the workload of manual annotation and improves efficiency.

[0233] As the device embodiment is basically similar to the method embodiment, the description is relatively simple, and relevant parts can be found in the description of the method embodiment.

[0234] This application also provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and capable of running on the processor. When the computer program is executed by the processor, it implements the various processes of the above-described semantic analysis method embodiment and achieves the same technical effect. To avoid repetition, it will not be described again here.

[0235] This application also provides a computer-readable storage medium on which a computer program is stored. When the computer program is executed by a processor, it implements the various processes of the above-described semantic analysis method embodiment and achieves the same technical effect. To avoid repetition, it will not be described again here.

[0236] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0237] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0238] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in one or more blocks of the flowchart illustrations and / or one or more blocks of the block diagrams.

[0239] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and / or one or more block diagrams.

[0240] These computer program instructions may also be loaded onto a computer or other programmable data processing terminal equipment to cause a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable terminal equipment, provide steps for implementing the functions specified in one or more flowcharts and / or one or more block diagrams.

[0241] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.

[0242] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0243] The semantic analysis method and semantic analysis device provided in this application have been described in detail above. Specific examples have been used to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A semantic analysis method, comprising: Obtain the target document; The target document is divided into multiple text fragments, and a tag set is configured for each of the multiple text fragments; The multiple text fragments and their corresponding tag sets are input into a semantic structure model. Using the semantic structure model, the probability of each tag in the tag set appearing in the corresponding text fragment is determined, as is the distribution probability of each tag in the tag set and the probability of the text fragment appearing in all text fragments. Based on the probability of each tag appearing in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the probability of the text fragment appearing in all text fragments, the probability of word generation in the text fragment is determined. Based on the probability of word generation in the text fragment, a target word is determined from the words in the text fragment, and the target word is used as a new tag set for the text fragment. After determining the new set of labels for the multiple text fragments, return to the step of inputting the multiple text fragments and their corresponding multiple set of labels into the semantic structure model until the preset conditions are met; When the preset conditions are met, text fragments with the same tags are aggregated according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments.

2. The method according to claim 1, wherein dividing the target document into multiple text segments includes: The statements in the target document are preprocessed, and a vector of the statements is generated; The similarity between adjacent statements in the target document is calculated based on the vector of the statement to obtain the similarity between the adjacent statements. The similarity between the adjacent statements is compared with a similarity threshold; If the similarity between adjacent statements is greater than the similarity threshold, then the adjacent statements are merged into a text segment.

3. The method according to claim 1, wherein determining the target word from the words in the text segment based on the generation probability of the words in the text segment includes: The words in the text segment are sorted according to their generation probability. Based on the sorting results, the target word is determined from the words in the text segment.

4. The method according to claim 3, wherein sorting the words in the text segment based on the generation probability of the words in the text segment includes: The words in the text segment are sorted in descending order based on their generation probability.

5. The method according to claim 1, wherein the semantic structure model is trained in the following manner: Obtain the training document set; Each training document in the training document set is divided into multiple text segments, and a tag set is configured for each of the multiple text segments; Input the text fragment and its corresponding tag set into the large language model to obtain the word generation probability of the text fragment; Determine a new set of tags for the text fragment based on the word generation probability; After determining the new tag set for the text fragment, return to the step of inputting the text fragment and the corresponding tag set into the large language model until the preset conditions are met; When the preset conditions are met, text fragments with the same tags are aggregated according to the tag set corresponding to the multiple text fragments to obtain multiple new text fragments; Obtain the labeling accuracy of the new text fragment; The large language model is adjusted based on the labeling accuracy to obtain the semantic structure model.

6. The method according to claim 5, wherein dividing each training document in the training document set into multiple text segments comprises: The sentences in the training document set are preprocessed, and vectors of the sentences are generated; The similarity between adjacent statements in the training document set is calculated based on the vector of the statement to obtain the similarity between the adjacent statements. The similarity between the adjacent statements is compared with a similarity threshold; If the similarity between adjacent statements is greater than the similarity threshold, the adjacent statements are merged into a text segment to obtain multiple text segments.

7. The method according to claim 5, wherein inputting the text fragment and its corresponding label into a large language model to obtain the word generation probability of the text fragment includes: Using the large language model, the probability of each tag in the tag set appearing in the corresponding text segment is determined, the distribution probability of each tag in the tag set is determined, and the probability of the text segment appearing in the training document set is determined. Based on the occurrence probability of each tag in the corresponding text segment, the distribution probability of each tag in the tag set in the training document set, and the occurrence probability of the text segment in the training document set, the generation probability of words in the text segment is determined.

8. The method of claim 5, wherein obtaining the tagging accuracy of the new text fragment comprises: Obtain the new label and the actual label of the text fragment; Based on the new labels and the actual labels of the text fragments, determine the number of correct labels that the large language model uses to label the text fragments; Based on the new label and the actual label of the text fragment, determine the number of labeling errors made by the large language model in labeling the text fragment; The labeling accuracy rate is determined based on the number of correct labels and the number of incorrect labels; The learning rate and number of iterations of the large language model are adjusted based on the labeling accuracy.

9. The method according to claim 2, wherein the similarity calculation method includes cosine similarity, Euclidean distance, Manhattan distance, and Jaccard similarity.

10. A semantic analysis apparatus, comprising: The document acquisition module is used to acquire the target document; The tag configuration module is used to divide the target document into multiple text fragments and configure a tag set for the multiple text fragments; A probability determination module is used to input the multiple text fragments and their corresponding multiple tag sets into a semantic structure model; through the semantic structure model, determine the occurrence probability of each tag in the tag set in the corresponding text fragment, determine the distribution probability of each tag in the tag set, and determine the occurrence probability of the text fragment in all text fragments; based on the occurrence probability of each tag in the corresponding text fragment, the distribution probability of each tag in the tag set in all text fragments, and the occurrence probability of the text fragment in all text fragments, determine the generation probability of words in the text fragments; based on the generation probability of words in the text fragments, determine the target word from the words in the text fragments, and use the target word as a new tag set for the text fragments; The condition determination module is used to return the step of inputting the multiple text fragments and their corresponding multiple tag sets into the semantic structure model after determining the new tag set of the multiple text fragments, until the preset conditions are met; The text aggregation module is used to aggregate text fragments with the same tags according to the tag set corresponding to the multiple text fragments when the preset conditions are met, so as to obtain multiple new text fragments.

11. An electronic device comprising: A processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program, when executed by the processor, implements the steps of a semantic analysis method as described in any one of claims 1-9.

12. A computer-readable storage medium storing a computer program thereon, the computer program, when executed by a processor, implementing the steps of a semantic analysis method as claimed in any one of claims 1-9.