A speech recognition text sentence breaking accuracy optimization system

By optimizing the sentence segmentation processing of the speech recognition system through intent state machine modeling and cross-granularity semantic anchor extraction modules, the problem of the failure to effectively model the speaker's intent state in existing technologies is solved, and high-accuracy sentence segmentation is achieved in multi-turn dialogue and long speech stream scenarios.

CN122245292APending Publication Date: 2026-06-19JIANGSU UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
JIANGSU UNIV OF SCI & TECH
Filing Date
2026-03-23
Publication Date
2026-06-19

Smart Images

  • Figure CN122245292A_ABST
    Figure CN122245292A_ABST
Patent Text Reader

Abstract

This invention relates to the field of speech recognition technology and discloses a speech recognition text segmentation accuracy optimization system, comprising: an intent state machine modeling module, used to annotate the intent state of the token sequence output by streaming speech recognition, obtain the posterior probability distribution of the intent state at each token position, and suppress, retain, or forcibly mark the generation of candidate positions for segmentation boundaries according to the intent state type; and a cross-granularity semantic anchor extraction module, used to extract semantic anchors at three levels: phrase granularity, clause granularity, and topic granularity, respectively, to generate a hierarchical constraint set for candidate positions for segmentation boundaries. The purpose of this speech recognition text segmentation accuracy optimization system is to solve the problems of existing segmentation schemes that do not model the speaker's behavioral intent state, have fixed segmentation scoring weights, and lack online adaptive update capabilities.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of speech recognition technology, specifically to a speech recognition text segmentation accuracy optimization system. Background Technology

[0002] With the continuous development of artificial intelligence technology, speech recognition technology has been widely applied in various scenarios such as intelligent dialogue, meeting transcription, and government Q&A. After completing acoustic decoding, the speech recognition system needs to segment the output continuous text stream into several semantically complete paragraphs for use by downstream natural language processing modules. Existing segmentation solutions mainly follow two paths: one is to locate the endpoints of the pause energy distribution in the acoustic signal based on speech activity detection, using acoustic boundaries as the basis for segmentation; the other is to determine the segmentation points of the text based on lexical features and a pre-set language model after speech recognition output, and to perform local backtracking correction on the segmentation results along the time axis. Both of these solutions have been applied to a certain extent in specific scenarios.

[0003] However, the above-mentioned solutions all abstract the sentence segmentation problem into a pure text segmentation task in actual dialogue scenarios. The determination of the sentence segmentation boundary depends on the positional linear scanning of local text locations. They do not model the speaker's intentional state, such as filler words, verbal corrections, and topic switching in real dialogue. This leads to frequent sentence segmentation errors, such as filler word intervals being incorrectly segmented into independent semantic paragraphs, text on both sides of correction marker words being split, and cross-topic texts being merged into the same segment. At the same time, the acoustic weights and semantic weights on which the sentence segmentation score depends are fixed values ​​and cannot be dynamically adjusted according to the real-time changes in noise environment quality and speech recognition confidence. The system also does not have the ability to continuously optimize the sentence segmentation strategy online based on feedback from downstream tasks and user behavior signals. The sentence segmentation accuracy drops significantly in multi-turn continuous dialogue and long speech stream scenarios. Summary of the Invention

[0004] The purpose of this invention is to address the problems of existing sentence segmentation schemes that do not model the speaker's intention and state of behavior, have fixed sentence segmentation scoring weights, and lack online adaptive update capabilities, and proposes a speech recognition text sentence segmentation accuracy optimization system.

[0005] The technical solution of the present invention to solve the above-mentioned technical problems is as follows: A speech recognition text segmentation accuracy optimization system includes: The intent state machine modeling module is used to annotate the intent state of the token sequence output by streaming speech recognition, obtain the posterior probability distribution of the intent state at each token position, and suppress, retain or forcibly mark the generation of candidate positions for sentence segmentation boundaries according to the intent state type. The cross-granularity semantic anchor extraction module is used to extract semantic anchors at three levels: phrase granularity, clause granularity, and topic granularity, and generate a hierarchical constraint set for candidate positions of sentence segmentation boundaries. The acoustic-semantic confidence fusion scoring module is used to dynamically weight and comprehensively score the confidence of each candidate position of sentence segmentation boundary based on acoustic features, semantic integrity features, and intent state features. The sentence segmentation candidate graph construction and optimal path solving module is used to construct a directed weighted graph of sentence segmentation candidates with Token positions as nodes, transform the hierarchical constraint set into graph weights, and output the sentence segmentation scheme through the optimal path solving algorithm; An online adaptive feedback learning module is used to collect downstream task consistency feedback signals and implicit user behavior feedback signals, and to update the dynamic weighting coefficients and intent state transition probabilities online. The sentence segmentation output and posterior correction module is used to merge and revise the output sentence segmentation results within a controlled time window based on the dual conditions of the difference in comprehensive confidence score and consistency of intent state, and output the final sentence segmentation text.

[0006] Based on the above technical solution, the present invention can be further improved as follows.

[0007] Furthermore, the intent state machine modeling module defines the speaker's intent state into five types: declarative state, interrogative state, filling state, error correction state, and topic switching state. The intent state machine modeling module adopts a sequence labeling model that combines a bidirectional long short-term memory network with a conditional random field to calculate the posterior probability of the intent state of each token in the current text window in real time. The intent state transition probability matrix of the intent state machine modeling module is obtained through offline training of multi-round dialogue corpora and is updated online through the online adaptive feedback learning module.

[0008] Furthermore, the intent state machine modeling module processes candidate positions for sentence segmentation boundaries according to the intent state type as follows: When the intent state of the Token is filled, the intent state machine modeling module performs a suppression operation on all candidate positions of sentence break boundaries within the Token interval, and no candidate sentence break boundaries are generated within the interval. When the intent state of the Token is in the error correction state, the intent state machine modeling module marks the text segments before and after the error correction flag as the same semantic segment, preventing the generation of sentence break boundary candidates on both sides of the error correction flag. When the intent state to which the Token belongs is a topic switching state, the intent state machine modeling module generates a high-priority sentence segmentation boundary candidate at the position corresponding to the Token, and applies a positive enhancement mark to the comprehensive confidence score of the candidate position.

[0009] Furthermore, the specific method by which the cross-granularity semantic anchor extraction module extracts semantic anchors at three granularities is as follows: At the phrase granularity level, the cross-granularity semantic anchor extraction module uses dependency parsing to parse the phrase structure of the current text window, extracts the boundary range of complete noun phrases, verb phrases and prepositional phrases, and marks each token position inside the above complete phrase structure as a hard constraint prohibited sentence segmentation area. The sentence segmentation boundary candidate positions in the hard constraint prohibited sentence segmentation area are reduced in weight in the directed weighted graph of sentence segmentation candidates by applying a large positive value penalty term. At the clause granularity level, the cross-granularity semantic anchor extraction module identifies the clause boundary positions guided by conjunctions, transition words, and causal markers through the clause boundary detection model, marks the clause boundary positions as soft-constraint punctuation point candidates, and marks each token position inside the clause as a soft-constraint prohibited punctuation area. The punctuation boundary candidate positions within the soft-constraint prohibited punctuation area are downweighted by applying a small positive value penalty term. At the topic granularity level, the cross-granularity semantic anchor extraction module adopts an online incremental topic segmentation algorithm to calculate the cosine similarity between the semantic embedding vector of the current text segment and the centroid vectors of each topic in the topic vector library. When the cosine similarity is lower than the topic boundary threshold dynamically adjusted according to the topic density of the dialogue history, a forced topic boundary anchor is generated at that position. The forced topic boundary anchor is weighted in the directed weighted graph of the sentence segmentation candidate by applying a positive reward term.

[0010] Furthermore, in the cross-granularity semantic anchor extraction module, the hard constraint prohibited sentence segmentation region, the soft constraint prohibited sentence segmentation region, and the topic boundary forced anchor point constitute a hierarchical constraint set; in the directed weighted graph of the sentence segmentation candidate, for the same directed edge, the weight correction terms of the three types of constraints work together in a superimposed manner on the comprehensive weight of the directed edge; when both the hard constraint prohibited sentence segmentation region and the topic boundary forced anchor point exist within the coverage area of ​​the same directed edge, the penalty term of the hard constraint prohibited sentence segmentation region has a higher priority than the reward term of the topic boundary forced anchor point, that is, the hard constraint penalty term is not offset by the existence of the reward term.

[0011] Furthermore, the acoustic-semantic confidence fusion scoring module obtains the comprehensive confidence score for each candidate position of sentence segmentation boundary by dynamically weighting three parts: acoustic feature score, semantic integrity score, and intent state score. The acoustic feature score is calculated by normalizing the speech pause duration, fundamental frequency change rate, and speech energy attenuation rate before and after the candidate position; The semantic integrity score is obtained by inputting the text to the left of the candidate position of the sentence break boundary into a pre-trained language model and calculating the probability that the text segment constitutes an independent and complete semantic unit. The intent state score is taken from the posterior probability of the intent state at that position output by the intent state machine modeling module, wherein the statement state and topic switching state correspond to positive scores, and the filling state and error correction state correspond to negative scores.

[0012] Furthermore, the dynamic weights in the acoustic-semantic confidence fusion scoring module are adaptively adjusted based on three dimensions: the noise environment quality of the current dialogue, the estimated speech recognition confidence value, and the dialogue scenario type. When the confidence estimate of speech recognition is lower than the preset confidence threshold, the dynamic weight corresponding to the semantic integrity score increases, and the dynamic weight corresponding to the acoustic feature score decreases accordingly. When the density of technical terms in the current scene output by the dialogue scene classifier is higher than the preset term density threshold, the dynamic weight corresponding to the acoustic feature score increases, and the dynamic weight corresponding to the semantic integrity score decreases accordingly. The dialogue scenario type is output in real time by the scenario classifier based on the lexical features and domain features of the current text window, and the update step size of the dynamic weights is controlled by the reward signal of the online adaptive feedback learning module.

[0013] Furthermore, the sentence segmentation candidate graph construction and optimal path solving module uses each Token position in the current text window as a node, and uses a directed edge from node i to node j to represent the text in the interval [i,j] as a complete sentence segment; The comprehensive weight of the directed edge is composed of the following three parts: the comprehensive confidence score at the endpoint position j, the sum of the penalty terms transformed by cross-granularity semantic anchor constraints in the interval [i,j], and the topic boundary reward term at the endpoint position j. The optimal path solving module uses an improved Viterbi algorithm to solve for the highest comprehensive weight path from the start node to the end node on the directed weighted graph of the sentence segmentation candidate, and outputs the node sequence corresponding to the highest comprehensive weight path as the sentence segmentation scheme of the current text window.

[0014] Furthermore, the online adaptive feedback learning module collects two types of feedback signals and performs different parameter update operations for each: Regarding the downstream task consistency feedback signal: when the downstream intent recognition module returns an ambiguous intent marker for a certain segment, a negative reward signal is generated; when the downstream intent recognition module returns a high-confidence intent recognition result for a certain segment, a positive reward signal is generated; the online adaptive feedback learning module updates the dynamic weight coefficients in the acoustic-semantic confidence fusion scoring module in small steps according to the reward signal using an online gradient update method; For implicit feedback signals of user behavior: when a user verbally corrects the content of the previous round of voice input after the system answers, the online adaptive feedback learning module recognizes the behavior as a negative signal of the previous round of sentence segmentation results and triggers the update of the relevant intention state transition probability matrix in the intention state machine modeling module. The online adaptive feedback learning module uses an exponential moving average mechanism to smooth the update amounts of both types of parameters to prevent parameters from exceeding the preset stable range due to a single abnormal feedback signal.

[0015] Furthermore, the sentence segmentation output and the posterior correction module must simultaneously meet the following two conditions for performing the merge and revision operation: First condition: Within a controlled time window, the difference between the combined confidence score recalculated after merging two adjacent sentence segments and the sum of the combined confidence scores of the two sentence segments before merging exceeds the preset revision threshold. The second condition is that the intent state annotation result corresponding to the text segment after merging the two adjacent sentence segments belongs to a single intent state type. When the first condition is met but the second condition is not met, that is, when the merged text segment spans two or more intent state types, the sentence segmentation output and posterior correction module does not perform the merge revision operation and maintains the original sentence segmentation result output; when both conditions are met, the sentence segmentation output and posterior correction module performs the merge revision operation, and merges two adjacent sentence segment segments into one sentence segment before outputting it.

[0016] Compared with the prior art, the technical solution of this application has the following beneficial technical effects: This invention uses an intent state machine modeling module to annotate the token sequence output by streaming speech recognition with intent states, obtains the posterior probability distribution of intent states at each token position, and suppresses, retains, or forcibly marks the generation of sentence segmentation boundary candidate positions according to the intent state type. This introduces modeling of the speaker's behavioral intent at the source of sentence segmentation decision, which suppresses the generation of sentence segmentation boundary candidate positions within the filler word interval, marks the text segments on both sides of the error correction marker word as the same semantic segment, and forcibly generates sentence segmentation boundary candidate positions at topic switching positions. The semantic anchor point extraction module extracts semantic anchor points at three levels: phrase granularity, clause granularity, and topic granularity. This generates a hierarchical constraint set for candidate positions of sentence segmentation boundaries. The hierarchical constraint set is then transformed into weight correction terms for each directed edge in the directed weighted graph of the sentence segmentation candidate. The sentence segmentation candidate graph construction and optimal path solving module outputs the sentence segmentation scheme globally through the optimal path solving algorithm. The acoustic-semantic confidence fusion scoring module dynamically weights and comprehensively scores the confidence of each candidate position of sentence segmentation boundary based on acoustic features, semantic integrity features, and intent state features. This replaces the scoring method in the existing technology where acoustic weights and semantic weights are fixed. As the noise environment quality and speech recognition confidence change in real time, the dominant features of the sentence segmentation scoring can be adaptively switched accordingly. The online adaptive feedback learning module continuously collects downstream task consistency feedback signals and implicit user behavior feedback signals to update the dynamic weighting coefficients and intent state transition probabilities online. Within a controlled time window, the sentence segmentation output and posterior correction module merges and revises the output sentence segmentation results based on the dual conditions of the difference in comprehensive confidence scores and consistency of intent states. While retaining the posterior correction capability, the consistency of intent states is used as a constraint condition for the revision operation to prevent the posterior merging operation from incorrectly merging adjacent paragraphs that span different intent states. Attached Figure Description

[0017] Figure 1 This is the overall flowchart of the speech recognition text segmentation accuracy optimization system of the present invention; Figure 2 This is a flowchart illustrating the state machine modeling and processing rules of the present invention. Figure 3 This is a flowchart of the semantic anchor point extraction process for the cross-granularity of the present invention; Figure 4 This is a flowchart of the sentence segmentation candidate graph construction and optimal path solving process of the present invention; Figure 5 This is a flowchart of the online adaptive feedback learning process of the present invention; Figure 6 This is a flowchart of the sentence segmentation output and posterior correction process of the present invention. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] Please see Figure 1This invention provides a speech recognition text segmentation accuracy optimization system, comprising an intent state machine modeling module, a cross-granularity semantic anchor point extraction module, an acoustic-semantic confidence fusion scoring module, a segmentation candidate graph construction and optimal path solving module, an online adaptive feedback learning module, and a segmentation output and posterior correction module. These six modules operate collaboratively in a pipeline manner in a streaming speech recognition scenario, driven by the speaker's intent state, guided by multi-granularity semantic anchor point constraints, and optimized by online feedback, collectively forming a closed-loop segmentation processing system.

[0020] The intent state machine modeling module is used to annotate the intent state of the token sequence output by streaming speech recognition, obtain the posterior probability distribution of the intent state at each token position, and suppress, retain or forcibly mark the generation of candidate positions for sentence segmentation boundaries according to the intent state type.

[0021] In a preferred embodiment of the present invention, the intention state machine modeling module defines the speaker's intention state into five types: declarative state, interrogative state, filler state, error correction state, and topic switching state. The declarative state corresponds to the context in which the speaker makes a coherent and complete statement; the interrogative state corresponds to the context in which the speaker is constructing an interrogative sentence, at which point the interrogative structure has not yet been closed; the filler state corresponds to the context in which the speaker generates spoken filler words, such as "um," "that," "that is to say," etc.; the error correction state corresponds to the context in which the speaker verbally corrects the content already expressed, such as "No, I mean..."; and the topic switching state corresponds to the context in which the speaker is about to change the topic, at which point the dialogue content will show a clear semantic jump.

[0022] The intent state machine modeling module employs a sequence labeling model combining a bidirectional long short-term memory network (LSTM) and a conditional random field (CRF) to calculate the posterior probability of the intent state for each token within the current text window in real time. The LSTM extracts contextual features from both the forward and reverse directions of the token sequence, concatenates the forward and reverse hidden states, and inputs them into the CRF layer. The CRF layer applies transition constraints to the intent state labeling results of each token at the overall sequence level, ensuring that the state labeling results of adjacent tokens conform to reasonable state transition rules, thereby outputting the globally optimal intent state labeling sequence. The intent state transition probability matrix of the intent state machine modeling module is obtained through offline training on multi-turn dialogue corpora and updated online via an online adaptive feedback learning module. During the offline training phase, the actual transition frequencies between intent states are statistically analyzed from a large-scale multi-turn dialogue corpus covering various scenarios such as customer service dialogues, meeting transcription, and government Q&A, and an initial intent state transition probability matrix is ​​fitted to obtain the model, enabling the model to recognize intent states across scenarios upon system startup.

[0023] In a preferred embodiment of the present invention, the intention state machine modeling module processes the candidate positions of sentence segmentation boundaries according to the intention state type as follows: When the intent state of a token is filled, the intent state machine modeling module performs a suppression operation on all candidate positions for sentence segmentation boundaries within that token's interval, preventing the generation of any candidate sentence segmentation boundaries within the interval. The suppression operation is implemented by marking all token positions within the filled interval as unselectable in the set of candidate sentence segmentation boundaries. Subsequently, when constructing the directed weighted graph, the sentence segmentation candidate graph construction module skips these unselectable positions and does not set directed edge endpoints on them, thus ensuring that the filled word interval is not segmented into independent semantic paragraphs.

[0024] When the intent state of the token is in the error correction state, the intent state machine modeling module marks the text segments before and after the error correction marker as the same semantic segment, preventing the generation of sentence break boundary candidates on both sides of the error correction marker. The identification of the error correction marker is based on the trigger word pattern appearing in the error correction state token sequence, such as "no," "I mean," "should be," etc. After the system detects the above pattern, it marks the interval between the position of the error correction marker and several tokens before and after it as the error correction merging area. No sentence break boundary candidates are generated within this interval.

[0025] When the intent state of a token is a topic-switching state, the intent state machine modeling module generates high-priority sentence segmentation boundary candidates at the corresponding position of the token and applies a positive enhancement label to the overall confidence score of the candidate position. This positive enhancement label is passed to the acoustic-semantic confidence fusion scoring module as a superimposed correction amount. It is directly added to the scoring result when this module calculates the overall confidence score, making the overall confidence score of the topic-switching position higher than that of surrounding non-topic boundary positions. Therefore, it is preferentially selected as a sentence segmentation boundary during the optimal path solution stage of the sentence segmentation candidate graph. For declarative states, the intent state machine modeling module retains the sentence segmentation boundary candidates at that position, and the acoustic-semantic confidence fusion scoring module decides whether to segment the sentence at that position based on the overall score. For interrogative states, the intent state machine modeling module retains the sentence segmentation boundary candidates but reduces their initial weight. The sentence segmentation boundary position is then confirmed by subsequent modules after the complete structure of the interrogative sentence is formed.

[0026] The cross-granularity semantic anchor extraction module is used to extract semantic anchors at three levels: phrase granularity, clause granularity, and topic granularity, and generate a hierarchical constraint set for candidate positions of sentence segmentation boundaries.

[0027] As a preferred embodiment of the present invention, such as Figure 2 As shown, the cross-granularity semantic anchor extraction module extracts semantic anchors at three granularities in the following specific ways: At the phrase granularity level, the cross-granularity semantic anchor extraction module uses dependency parsing to analyze the phrase structure of the current text window, extracting the boundary ranges of complete noun phrases, verb phrases, and prepositional phrases. Each token position within the complete phrase structure is marked as a hard-constrained prohibited segmentation zone. Segmentation boundary candidate positions within these zones are downweighted in the directed weighted graph of segmentation candidates by applying a large positive penalty term. Dependency parsing accurately locates the shared boundaries between core nouns and modifying elements by identifying syntactic dependency arcs between tokens, ensuring that the core word of a complete noun phrase and its modifiers and determiners are not cut off by segmentation. For verb phrases, parsing identifies the shared range of the predicate verb with its object, complement, and adverbial, ensuring that the integrity of the predicate-argument structure is not affected by segmentation. The large positive penalty term corresponding to the hard-constrained prohibited segmentation zone is significantly larger than the penalty terms of other constraint types, putting directed edges covering hard-constrained intervals at a significant disadvantage in global path optimization, thus leading to their avoidance in actual path selection.

[0028] At the clause granularity level, the cross-granularity semantic anchor extraction module identifies clause boundary positions guided by conjunctions, transition words, and causal markers through a clause boundary detection model. These clause boundary positions are marked as soft-constraint punctuation point candidates, and each token position within a clause is marked as a soft-constraint prohibited punctuation zone. Punctuation boundary candidate positions within the soft-constraint prohibited punctuation zone are weighted by applying a small positive penalty term. The clause boundary detection model identifies introductory words such as "but," "however," "although," "because," "therefore," and "however," marking their positions as medium-priority punctuation point candidates. Simultaneously, each token position within subordinate clauses guided by conjunctions is marked as a soft-constraint prohibited punctuation zone, with a penalty term value smaller than that for hard-constraint prohibited punctuation zones. This allows for the generation of punctuation boundaries within clauses under special circumstances, provided the overall confidence score is sufficiently high, thus maintaining system flexibility.

[0029] At the topic granularity level, the cross-granularity semantic anchor extraction module employs an online incremental topic segmentation algorithm. It calculates the cosine similarity between the semantic embedding vector of the current text segment and the centroid vectors of each topic in the topic vector library. When the cosine similarity is lower than the topic boundary threshold dynamically adjusted based on the topic density in the dialogue history, a forced topic boundary anchor is generated at that position. This forced anchor is weighted in the directed weighted graph of the sentence segmentation candidates by applying a positive reward. The topic vector library is continuously updated incrementally during the dialogue. Whenever a new topic segment is confirmed, the moving average of its semantic embedding vector is used as the centroid vector of that topic and stored in the topic vector library. The topic boundary threshold is automatically adjusted based on the observed topic switching frequency in historical dialogues. The threshold is appropriately increased in scenarios with frequent topic switching and appropriately decreased in scenarios with stable topics, adapting to the topic structure characteristics of different dialogue scenarios.

[0030] In a preferred embodiment of the present invention, in the cross-granularity semantic anchor extraction module, the hard constraint prohibited sentence segmentation area, the soft constraint prohibited sentence segmentation area, and the topic boundary forced anchor point constitute a hierarchical constraint set; in the directed weighted graph of the sentence segmentation candidate, for the same directed edge, the weight correction terms of the three types of constraints work together in a superimposed manner on the comprehensive weight of the directed edge; when both the hard constraint prohibited sentence segmentation area and the topic boundary forced anchor point exist within the coverage area of ​​the same directed edge, the penalty term of the hard constraint prohibited sentence segmentation area has a higher priority than the reward term of the topic boundary forced anchor point, that is, the hard constraint penalty term is not canceled out by the existence of the reward term.

[0031] The specific implementation of the three-constraint superposition calculation is as follows: The sentence segmentation candidate graph construction module scans the positions of each token within the coverage interval of each directed edge from the starting point i to the ending point j, accumulates the weight correction amount corresponding to all constraint types within the interval, and finally obtains the comprehensive constraint correction value of the directed edge, which is then added to the base score of the directed edge weight. When both hard constraint prohibited sentence segmentation areas and topic boundary forced anchor points exist within the coverage interval of the same directed edge, the system prioritizes the superposition of hard constraint penalty items, and the upper limit of the hard constraint penalty items is not affected by the reward items. The basis for the above priority rule is that: the cutting within the phrase structure will destroy the most basic syntactic integrity under any circumstances, which is an uncompromising protective constraint; while the occurrence of topic boundaries within the phrase structure is a very low probability boundary case. When the two conflict, the protection of phrase syntactic integrity takes priority, and the topic boundary sentence segmentation operation is performed only after the hard constraint interval is exceeded. In actual dialogues, the value of hard constraint penalties is about 3 to 5 times that of soft constraint penalties, while the value of positive reward items is about the same as that of soft constraint penalties, thus ensuring that the priority order of the three types of constraints is effectively reflected in the numerical relationship.

[0032] The acoustic-semantic confidence fusion scoring module is used to dynamically weight and comprehensively score the confidence of each candidate position of sentence segmentation boundary based on acoustic features, semantic integrity features, and intent state features.

[0033] As a preferred embodiment of the present invention, such as Figure 3 As shown, the comprehensive confidence score of each candidate position of the sentence break boundary in the acoustic-semantic confidence fusion scoring module is obtained by linearly weighting three parts: acoustic feature score, semantic integrity score, and intent state score through dynamic weights.

[0034] The acoustic feature score is calculated by normalizing the duration of speech pauses, the rate of change of fundamental frequency, and the rate of attenuation of speech energy before and after the candidate position. The duration of speech pauses reflects the speaker's natural pausing tendency at that position, with inter-sentence pauses usually being significantly longer than intra-sentence pauses; the rate of change of fundamental frequency reflects the amplitude of pitch change at that position, with the pitch at the end of a sentence usually showing a clear downward or upward trend; the rate of attenuation of speech energy reflects the degree of attenuation of the speaker's vocal intensity at that position. After the three acoustic indicators are processed by max-min normalization, they are weighted and summed according to their respective acoustic sub-weights to obtain the acoustic feature score for the candidate position.

[0035] The semantic integrity score is obtained by inputting the text to the left of the candidate sentence break boundary position into a pre-trained language model and calculating the probability that the text segment constitutes an independent and complete semantic unit. The pre-trained language model receives the complete token sequence to the left of the candidate boundary, performs semantic modeling on the sequence through an encoder, and adds a binary classification linear layer at the [CLS] position at the end of the sequence. The output is the probability value of the text segment as an independent and complete semantic unit. The higher the probability value, the closer the text segment is to a complete expression unit semantically, and the higher the corresponding semantic integrity score.

[0036] The intent state score is derived from the posterior probability of the intent state at that location, output by the intent state machine modeling module. Positive scores are assigned to declarative and topic-switching states, while negative scores are assigned to fill-in and error-correction states. A neutral score is assigned to the interrogative state. The final sentence segmentation decision is jointly determined by the acoustic feature score and the semantic integrity score. The positive or negative value of the intent state score is calculated by multiplying the fixed symbol weight corresponding to each intent state by the posterior probability of that state, thus quantifying and integrating the output of the intent state modeling module into the comprehensive confidence scoring system.

[0037] In a preferred embodiment of the present invention, the dynamic weights in the acoustic-semantic confidence fusion scoring module are adaptively adjusted based on three dimensions: the noise environment quality of the current dialogue, the speech recognition confidence estimate, and the dialogue scenario type. When the speech recognition confidence estimate is lower than a preset confidence threshold, the dynamic weight corresponding to the semantic integrity score increases, while the dynamic weight corresponding to the acoustic feature score decreases accordingly. The speech recognition confidence estimate is performed in real time by weighted fusion of the acoustic model score and the language model score output by the acoustic decoder. When the estimate is lower than the preset threshold, it indicates that the current speech signal quality or recognition accuracy is insufficient, and the reliability of the acoustic features decreases. The system automatically transfers the dynamic weight corresponding to the acoustic feature score to the dynamic weight corresponding to the semantic integrity score proportionally, allowing semantic information to dominate the sentence segmentation decision.

[0038] When the density of technical terms in the current scene output by the dialogue scene classifier exceeds a preset term density threshold, the dynamic weight corresponding to the acoustic feature score increases, while the dynamic weight corresponding to the semantic integrity score decreases accordingly. When the system detects that the current dialogue scene is in professional fields such as medicine, law, or finance, the reliability of the semantic integrity score decreases because the pre-trained language model has relatively limited semantic modeling ability for low-frequency technical terms. The system automatically increases the dynamic weight of the acoustic feature score, using acoustic signals as the primary driver for sentence segmentation in this professional scene, preventing inaccurate sentence segmentation due to insufficient terminology coverage by the semantic model.

[0039] The dialogue scenario type is output in real time by the scenario classifier based on the lexical and domain features of the current text window. The update step size of the dynamic weights is controlled by the reward signal of the online adaptive feedback learning module. The scenario classifier uses a lightweight text classification model to analyze the lexical distribution and domain word frequency of the current text window, and outputs the domain category label and confidence score of the current scenario. The sum of the three dynamic weights is constrained to a fixed constant. When one weight increases, the other weights decrease proportionally by an equal amount, ensuring that the comprehensive confidence score is consistent in scale and comparable across different scenarios.

[0040] The sentence segmentation candidate graph construction and optimal path solving module is used to construct a directed weighted graph of sentence segmentation candidates with the Token position as the node, transform the hierarchical constraint set into graph weights, and output the sentence segmentation scheme through the optimal path solving algorithm.

[0041] As a preferred embodiment of the present invention, such as Figure 4 As shown, the sentence segmentation candidate graph construction and optimal path solving module uses each Token position in the current text window as a node, and a directed edge from node i to node j represents a complete sentence segment within the interval [i,j]. For the current text window of length N, the graph contains N+1 nodes, corresponding to the starting position node and each Token position node respectively; the directed edges only point from the node with the smaller number to the node with the larger number, ensuring the directed acyclicity of the graph and ensuring that the optimal path solving algorithm can run efficiently in a dynamic programming manner.

[0042] The overall weight of a directed edge consists of three superimposed components: the overall confidence score at the endpoint j, the sum of penalty terms transformed by cross-granularity semantic anchor constraints within the interval [i,j], and the topic boundary reward term at the endpoint j. The overall confidence score reflects the linguistic rationality of segmentation at position j; the sum of penalty terms reflects the multi-granularity constraint pressure faced when treating the text within the interval [i,j] as a whole segment, with the large positive penalty corresponding to hard constraint prohibiting segmentation and the small positive penalty corresponding to soft constraint prohibiting segmentation both added to this term; the topic boundary reward term takes a positive value when the endpoint j is exactly located at the mandatory anchor point position of the topic boundary, otherwise it takes zero, encouraging the optimal path to segment at the topic boundary.

[0043] The optimal path finding module employs an improved Viterbi algorithm to solve for the highest comprehensive weight path from the starting node to the ending node on the directed weighted graph of sentence segmentation candidates. The node sequence corresponding to the highest comprehensive weight path is output as the sentence segmentation scheme for the current text window. The improved Viterbi algorithm introduces two improvements on the standard Viterbi dynamic programming framework: First, it introduces an intent state consistency check during path expansion. When a path expands to a node, if the intent state of that node undergoes an unreasonable jump compared to the dominant intent state of the preceding path, a state transition inconsistency penalty is applied to the cumulative weight of that path. Second, during the path pruning stage, path branches covering tokens within the hard-constrained prohibited segmentation area are pruned in advance, reducing the search space size and improving algorithm execution efficiency while ensuring global optimality. For streaming recognition scenarios, graph construction and path finding are executed batch by batch using a sliding text window approach, with a certain overlap between adjacent windows to ensure the semantic coherence of cross-window sentence segmentation results.

[0044] The online adaptive feedback learning module is used to collect downstream task consistency feedback signals and implicit user behavior feedback signals, and to update the dynamic weighting coefficients and intent state transition probabilities online.

[0045] As a preferred embodiment of the present invention, such as Figure 5 As shown, the online adaptive feedback learning module collects two types of feedback signals and performs different parameter update operations respectively.

[0046] Regarding the downstream task consistency feedback signal: when the downstream intent recognition module returns an ambiguous intent marker for a certain segment, a negative reward signal is generated; when the downstream intent recognition module returns a high-confidence intent recognition result for a certain segment, a positive reward signal is generated. Based on the reward signal, the online adaptive feedback learning module updates the dynamic weight coefficients in the acoustic-semantic confidence fusion scoring module in small steps using an online gradient update method. The ambiguous intent marker in the downstream intent recognition module is automatically triggered when the intent recognition confidence falls below a preset confidence threshold. The online gradient update updates the dynamic weight coefficients according to the sign and amplitude of the reward signal, along the direction that enhances the correlation between the comprehensive confidence score and the downstream task confidence. The step size of each update is set to a small fixed value to ensure that the disturbance to the overall system parameters of a single update is within a controllable range.

[0047] For implicit feedback signals from user behavior: When a user verbally corrects the previous round of voice input after the system's response, the online adaptive feedback learning module recognizes this behavior as a negative signal of the previous sentence segmentation result and triggers an update of the relevant intent state transition probability matrix in the intent state machine modeling module. The recognition mechanism for user verbal correction behavior is as follows: The intent state machine modeling module detects whether a sequence of error-correction state marker words appears in the user's new round of voice input. If it appears and the semantic relevance of the current speech content to the previous system response exceeds a preset threshold, then this event is recorded as an implicit negative feedback signal from user behavior, triggering an incremental update of the error-correction state-related rows and columns in the intent state transition probability matrix, so that the system's sensitivity to error-correction state recognition in similar contexts gradually improves.

[0048] The online adaptive feedback learning module employs an exponential moving average mechanism to smooth the update amounts of both types of parameters, preventing single abnormal feedback signals from causing parameters to exceed the preset stability range. The decay coefficient of the exponential moving average mechanism is set to a value close to 1, ensuring that historical parameter values ​​maintain a strong constraint on the current parameter. The influence of historical parameters decays slowly over time, ensuring that system parameters evolve smoothly with feedback signals. In addition, an upper limit constraint is set for each parameter update amount to prevent occasional abnormal feedback (such as user misoperation or occasional failures of downstream modules) from causing excessive deviations in parameters in a single update, thus ensuring the real-time service stability of the system during the online adaptive learning process.

[0049] The sentence segmentation output and posterior correction module is used to merge and revise the output sentence segmentation results within a controlled time window based on the dual conditions of the difference in comprehensive confidence score and consistency of intent state, and output the final sentence segmentation text.

[0050] As a preferred embodiment of the present invention, such as Figure 6 As shown, the sentence segmentation output and the posterior correction module must simultaneously meet the following two conditions to perform the merge and revision operation: The first condition is that, within a controlled time window, the difference between the recalculated overall confidence score of two adjacent sentence segments after merging and the sum of the overall confidence scores of the two sentence segments before merging exceeds a preset revision threshold. The merged overall confidence score is obtained by recalculating the merged text segment based on the acoustic-semantic confidence fusion scoring module; the preset revision threshold is determined by the statistical distribution of historical sentence segmentation results, and its value ensures that the sentence segments that truly need merging and revision have sufficient distinguishability from those that do not need revision in terms of the difference distribution.

[0051] The second condition is that the intent state annotation results corresponding to the merged text segment of two adjacent sentence breaks belong to a single intent state type. The intent state annotation results are obtained by the intent state machine modeling module re-annotating the sequence of the merged text segment. If the annotation results show that all tokens in the merged text segment belong to the same intent state type, then the second condition is satisfied.

[0052] When the first condition is met but the second condition is not met (i.e., the merged text segment spans two or more intent state types), the sentence segmentation output and posterior correction module do not perform a merge revision operation, maintaining the original sentence segmentation result output. When both conditions are met simultaneously, the sentence segmentation output and posterior correction module perform a merge revision operation, merging two adjacent sentence segments into one sentence segment before outputting. When the second condition is not met, even if the difference in the comprehensive confidence score exceeds the preset revision threshold, the system still maintains the original sentence segmentation boundary on the grounds of inconsistent intent states, preventing the incorrect merging of adjacent segments with different speaker intents into the same semantic unit. Only when both conditions are met simultaneously does the system determine that the original sentence segmentation boundary has been incorrectly segmented too finely, perform a merge revision operation, merge the two segments into one sentence segment, and input the revision result along with the revision mark into the online adaptive feedback learning module for reference during subsequent feedback signal processing. The length of the controlled time window is dynamically configured based on the system's real-time requirements for downstream tasks. In real-time dialogue scenarios, the time window is set shorter, while in offline transcription scenarios, the time window can be appropriately extended to achieve higher accuracy in a posteriori revisions. Example

[0053] In a customer service conversation, the user's voice input is: "The one I bought... um... the phone I bought last time, its... charger seems to be faulty, it won't charge, and I want to ask if I can return it?" The intent state machine modeling module annotates the above token sequence with intent states, identifying "um," "that's," and "that" as filling state intervals, and performing suppression operations on all candidate positions for sentence break boundaries within these intervals; it identifies "The charger seems to have a problem and can't charge" as a statement state, retaining its candidate sentence break boundaries; it identifies "Then I want to ask something" as a topic switching state, generating high-priority candidate sentence break boundaries at this position and applying positive enhancement markers. The cross-granularity semantic anchor extraction module identifies that "The charger seems to have a problem" and "can't charge" belong to the same clause granularity semantic continuation, marking the token positions between them as soft-constrained prohibited sentence break areas; simultaneously, it identifies "Then I want to ask something" as triggering a topic granularity boundary, generating a forced topic boundary anchor at this position. In this low-noise indoor environment, the acoustic-semantic confidence fusion scoring module achieves a normal level of speech recognition confidence estimation, with balanced configuration of the three dynamic weights. The sentence segmentation candidate graph construction and optimal path solving module solves the optimal sentence segmentation path using an improved Viterbi algorithm, outputting the sentence segmentation results as:

The phone I bought last time seems to have a problem with its charger and won't charge

Then I want to ask if I can return it

[0054] In meeting transcription scenarios, multiple speakers take turns speaking, frequently switching between topics such as product development progress, marketing strategies, and financial budgets. The online incremental topic segmentation algorithm in the cross-granularity semantic anchor extraction module continuously tracks the centroid vectors of each topic. At the moment of topic switching, it accurately calculates that the cosine similarity between the semantic embedding vector of the current text segment and the existing topic centroid vectors is below a dynamic threshold. It then generates mandatory topic boundary anchors at the corresponding positions, ensuring that cross-topic content is segmented correctly. When a speaker uses dense professional financial terminology, the scene classifier in the acoustic-semantic confidence fusion scoring module identifies that the terminology density exceeds a preset threshold. It automatically increases the dynamic weight corresponding to the acoustic feature score, using acoustic signals as the primary driver for sentence segmentation in this professional scenario, preventing inaccurate sentence segmentation due to insufficient terminology coverage in the pre-trained language model. If the optimal path of the candidate graph for sentence segmentation in a subsequent round of speech missegments two adjacent paragraphs, and the difference between the sentence segmentation output and the posterior correction module detects within the controlled time window that the combined confidence score after merging exceeds the preset revision threshold, and the intent state annotation result of the merged text segment belongs to a single statement state, satisfying the dual conditions, then merging and revising is performed, and the revision event record is transmitted to the online adaptive feedback learning module to trigger a small-step online update of the dynamic weight coefficients.

[0055] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the statement "comprising a…" does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0056] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A speech recognition text segmentation accuracy optimization system, characterized in that, include: The intent state machine modeling module is used to annotate the intent state of the token sequence output by streaming speech recognition, obtain the posterior probability distribution of the intent state at each token position, and suppress, retain or forcibly mark the generation of candidate positions for sentence segmentation boundaries according to the intent state type. The cross-granularity semantic anchor extraction module is used to extract semantic anchors at three levels: phrase granularity, clause granularity, and topic granularity, and generate a hierarchical constraint set for candidate positions of sentence segmentation boundaries. The acoustic-semantic confidence fusion scoring module is used to dynamically weight and comprehensively score the confidence of each candidate position of sentence segmentation boundary based on acoustic features, semantic integrity features, and intent state features. The sentence segmentation candidate graph construction and optimal path solving module is used to construct a directed weighted graph of sentence segmentation candidates with Token positions as nodes, transform the hierarchical constraint set into graph weights, and output the sentence segmentation scheme through the optimal path solving algorithm; An online adaptive feedback learning module is used to collect downstream task consistency feedback signals and implicit user behavior feedback signals, and to update the dynamic weighting coefficients and intent state transition probabilities online. The sentence segmentation output and posterior correction module is used to merge and revise the output sentence segmentation results within a controlled time window based on the dual conditions of the difference in comprehensive confidence score and consistency of intent state, and output the final sentence segmentation text.

2. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The intent state machine modeling module defines the speaker's intent state into five types: declarative state, interrogative state, fill-in state, error correction state, and topic switching state. The intent state machine modeling module adopts a sequence labeling model that combines a bidirectional long short-term memory network with a conditional random field to calculate the posterior probability of the intent state of each token in the current text window in real time. The intent state transition probability matrix of the intent state machine modeling module is obtained through offline training of multi-round dialogue corpora and is updated online through the online adaptive feedback learning module.

3. The speech recognition text segmentation accuracy optimization system according to claim 2, characterized in that, The intent state machine modeling module processes candidate positions for sentence segmentation boundaries according to the intent state type as follows: When the intent state of the Token is filled, the intent state machine modeling module performs a suppression operation on all candidate positions of sentence break boundaries within the Token interval, and no candidate sentence break boundaries are generated within the interval. When the intent state of the Token is in the error correction state, the intent state machine modeling module marks the text segments before and after the error correction flag as the same semantic segment, preventing the generation of sentence break boundary candidates on both sides of the error correction flag. When the intent state to which the Token belongs is a topic switching state, the intent state machine modeling module generates a high-priority sentence segmentation boundary candidate at the position corresponding to the Token, and applies a positive enhancement mark to the comprehensive confidence score of the candidate position.

4. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The specific method by which the cross-granularity semantic anchor extraction module extracts semantic anchors at three granularities is as follows: At the phrase granularity level, the cross-granularity semantic anchor extraction module uses dependency parsing to parse the phrase structure of the current text window, extracts the boundary range of complete noun phrases, verb phrases and prepositional phrases, and marks each token position inside the above complete phrase structure as a hard constraint prohibited sentence segmentation area. The sentence segmentation boundary candidate positions in the hard constraint prohibited sentence segmentation area are reduced in weight in the directed weighted graph of sentence segmentation candidates by applying a large positive value penalty term. At the clause granularity level, the cross-granularity semantic anchor extraction module identifies the clause boundary positions guided by conjunctions, transition words, and causal markers through the clause boundary detection model, marks the clause boundary positions as soft-constraint punctuation point candidates, and marks each token position inside the clause as a soft-constraint prohibited punctuation area. The punctuation boundary candidate positions within the soft-constraint prohibited punctuation area are downweighted by applying a small positive value penalty term. At the topic granularity level, the cross-granularity semantic anchor extraction module adopts an online incremental topic segmentation algorithm to calculate the cosine similarity between the semantic embedding vector of the current text segment and the centroid vectors of each topic in the topic vector library. When the cosine similarity is lower than the topic boundary threshold dynamically adjusted according to the topic density of the dialogue history, a forced topic boundary anchor is generated at that position. The forced topic boundary anchor is weighted in the directed weighted graph of the sentence segmentation candidate by applying a positive reward term.

5. The speech recognition text segmentation accuracy optimization system according to claim 4, characterized in that, In the cross-granularity semantic anchor extraction module, the hard constraint prohibited sentence segmentation area, the soft constraint prohibited sentence segmentation area, and the topic boundary forced anchor point constitute a hierarchical constraint set. In the directed weighted graph of the sentence segmentation candidate, for the same directed edge, the weight correction terms of the three types of constraints work together in a superimposed manner on the comprehensive weight of the directed edge. When both the hard constraint prohibited sentence segmentation area and the topic boundary forced anchor point exist within the coverage area of ​​the same directed edge, the penalty term of the hard constraint prohibited sentence segmentation area has a higher priority than the reward term of the topic boundary forced anchor point, that is, the hard constraint penalty term is not offset by the existence of the reward term.

6. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The acoustic-semantic confidence fusion scoring module obtains the comprehensive confidence score for each candidate position of sentence segmentation boundary by dynamically weighting three parts: acoustic feature score, semantic integrity score, and intent state score. The acoustic feature score is calculated by normalizing the speech pause duration, fundamental frequency change rate, and speech energy attenuation rate before and after the candidate position; The semantic integrity score is obtained by inputting the text to the left of the candidate position of the sentence break boundary into a pre-trained language model and calculating the probability that the text segment constitutes an independent and complete semantic unit. The intent state score is taken from the posterior probability of the intent state at that position output by the intent state machine modeling module, wherein the statement state and topic switching state correspond to positive scores, and the filling state and error correction state correspond to negative scores.

7. The speech recognition text segmentation accuracy optimization system according to claim 6, characterized in that, The dynamic weights in the acoustic-semantic confidence fusion scoring module are adaptively adjusted based on three dimensions: the noise environment quality of the current dialogue, the speech recognition confidence estimate, and the dialogue scenario type. When the confidence estimate of speech recognition is lower than the preset confidence threshold, the dynamic weight corresponding to the semantic integrity score increases, and the dynamic weight corresponding to the acoustic feature score decreases accordingly. When the density of technical terms in the current scene output by the dialogue scene classifier is higher than the preset term density threshold, the dynamic weight corresponding to the acoustic feature score increases, and the dynamic weight corresponding to the semantic integrity score decreases accordingly. The dialogue scenario type is output in real time by the scenario classifier based on the lexical features and domain features of the current text window, and the update step size of the dynamic weights is controlled by the reward signal of the online adaptive feedback learning module.

8. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The sentence segmentation candidate graph construction and optimal path solving module takes each Token position in the current text window as a node, and uses the directed edge from node i to node j to represent the text in the interval [i,j] as a complete sentence segment; The comprehensive weight of the directed edge is composed of the following three parts: the comprehensive confidence score at the endpoint position j, the sum of the penalty terms transformed by cross-granularity semantic anchor constraints in the interval [i,j], and the topic boundary reward term at the endpoint position j. The optimal path solving module uses an improved Viterbi algorithm to solve for the highest comprehensive weight path from the start node to the end node on the directed weighted graph of the sentence segmentation candidate, and outputs the node sequence corresponding to the highest comprehensive weight path as the sentence segmentation scheme of the current text window.

9. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The online adaptive feedback learning module collects two types of feedback signals and performs different parameter update operations for each: Regarding the downstream task consistency feedback signal: when the downstream intent recognition module returns an ambiguous intent marker for a certain segment, a negative reward signal is generated; when the downstream intent recognition module returns a high-confidence intent recognition result for a certain segment, a positive reward signal is generated; the online adaptive feedback learning module updates the dynamic weight coefficients in the acoustic-semantic confidence fusion scoring module in small steps according to the reward signal using an online gradient update method; For implicit feedback signals of user behavior: when a user verbally corrects the content of the previous round of voice input after the system answers, the online adaptive feedback learning module recognizes the behavior as a negative signal of the previous round of sentence segmentation results and triggers the update of the relevant intention state transition probability matrix in the intention state machine modeling module. The online adaptive feedback learning module uses an exponential moving average mechanism to smooth the update amounts of both types of parameters to prevent parameters from exceeding the preset stable range due to a single abnormal feedback signal.

10. The speech recognition text segmentation accuracy optimization system according to claim 1, characterized in that, The merging and revision operation between the sentence segmentation output and the a posteriori correction module must simultaneously meet the following two conditions: First condition: Within a controlled time window, the difference between the combined confidence score recalculated after merging two adjacent sentence segments and the sum of the combined confidence scores of the two sentence segments before merging exceeds the preset revision threshold. The second condition is that the intent state annotation result corresponding to the text segment after merging the two adjacent sentence segments belongs to a single intent state type. When the first condition is met but the second condition is not met, that is, when the merged text segment spans two or more intent state types, the sentence segmentation output and posterior correction module does not perform the merge revision operation and maintains the original sentence segmentation result output; when both conditions are met, the sentence segmentation output and posterior correction module performs the merge revision operation, and merges two adjacent sentence segment segments into one sentence segment before outputting it.