AI multi-turn question answering evaluation methods, equipment, and storage media
By constructing a global contextual feature pool and dual-model evaluation, the technical problem of no valuable corpus in AI multi-turn generation was solved, and the accurate quantitative identification of AI multi-turn generated content was achieved, improving the accuracy of quality identification and the efficiency of value screening.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LINGE TECHNOLOGY CO LTD
- Filing Date
- 2026-04-30
- Publication Date
- 2026-06-30
AI Technical Summary
Existing AI question-answering evaluation methods cannot quantify the quality and innovation of AI's multi-round generation in real time and with context awareness. This results in the generated content exhibiting probability mean regression, forming a uniform plateau without any meaningful content, and failing to provide early warnings.
By constructing a global context feature pool, combining the output text with a weighted evaluation of the first and second preset models, an absolute quality evaluation value is obtained. The output text is then compared with the previous round of global context feature pool based on preset dimensions to obtain a relative incremental evaluation value. Finally, the weighted fusion is used to obtain the text evaluation value, thus achieving full-process quantitative identification of AI-generated content in multiple rounds.
It accurately identifies homogeneous and worthless corpora generated by AI in multiple rounds, improves the accuracy of quality identification and the efficiency of value screening, and overcomes the defects of worthless corpora formed by probability mean regression.
Smart Images

Figure CN122309681A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of large model corpus evaluation technology, and in particular to an AI multi-turn question answering evaluation method, device and storage medium. Background Technology
[0002] Currently, AI question answering is developing rapidly. However, due to the numerous illusions inherent in AI, it is necessary to evaluate the generated results. Current evaluations are all static, post-hoc checks, unable to provide real-time, context-aware quality and innovation metrics for each round of AI output. This is because multi-round AI generation naturally exhibits probabilistic mean regression; that is, multi-round text evaluation only provides static scoring for each round's output, failing to utilize historical context to determine whether the current output contains new information. This easily leads to the gradual formation of uniform, linguistically structured but contentless outputs. These problems cannot be anticipated and can only be discovered retrospectively. Summary of the Invention
[0003] The main purpose of this application is to provide an AI multi-turn question answering evaluation method, device and storage medium, which aims to solve the technical problem of AI multi-turn generation of naturally existing probability mean regression, resulting in valueless corpus.
[0004] To achieve the above objectives, this application provides an AI multi-turn question-answering evaluation method, which includes: In response to the input information from the task configuration control, a global context feature pool is constructed based on the global task parsed from the input information and the baseline evaluation value of the domain to which the global task belongs. In response to the output text of the large language model in the nth round, the output text is evaluated based on a weighted average of the first and second preset models to obtain an absolute quality evaluation value; Based on a preset dimension, the output text is compared with the global context feature pool corresponding to the (n-1)th round to obtain a relative incremental evaluation value; Based on the global task completion level of the output text, the absolute quality evaluation value and the relative incremental evaluation value are weighted and fused to obtain the text evaluation value.
[0005] In one embodiment, constructing a global context feature pool based on the global task obtained from the parsed input information and the baseline evaluation value of the domain to which the global task belongs includes: The input information of the task configuration control is parsed to obtain the generation target, core theme, text type, length requirement, and domain of the global task. Based on the pre-built domain text library, the baseline evaluation value is calculated, which includes a first model benchmark value and a second model benchmark value. Based on the generation target, core theme, genre, and length requirements, and combined with the baseline evaluation value, the global context feature pool is initialized. The global context feature pool includes at least a global core concept set, a global theme semantic vector, a global statistical baseline, a genre adaptation threshold, a length adaptation threshold, and a platform determination threshold.
[0006] In one embodiment, the step of weighting the output text based on a preset first model and a second model to obtain an absolute quality evaluation value includes: The output text is processed based on the first model to obtain a first model evaluation value, and the output text is processed based on the second model to obtain a second model evaluation value. The first model includes a reward module, a penalty module, and a compensation module, and the second model includes a preset number of expression factor dimensions. Based on the text type, the evaluation values of the first model and the second model are fused to generate stable field labels and local features corresponding to the input text; The absolute quality evaluation value of the output text is generated based on the stable field label and local features.
[0007] In one embodiment, a relative incremental evaluation value is obtained by comparing the output text with the global context feature pool corresponding to the (n-1)th round based on a preset dimension, including: Extract the core concepts, semantic vectors, stylistic features, and length features of the output text for this round; The core concepts of this round are compared with the set of global core concepts in the global context feature pool of the (n-1)th round, and the proportion of newly added core concepts is calculated. The semantic vector of this round is compared with the global topic semantic vector in the global context feature pool of the (n-1)th round to calculate the topic matching degree; The current round's text style features and current round's length features are compared with the text style adaptation threshold and length adaptation threshold in the (n-1)th round's global context feature pool, respectively, and the feature adaptation degree is calculated. Based on the global statistical baseline in the (n-1)th round of global context feature pool, the abstract level transition degree and text redundancy of the output text are calculated. The relative incremental evaluation value is obtained by integrating the newly added proportion of core concepts, topic matching degree, feature adaptation degree, abstraction level leap degree, and text redundancy degree.
[0008] In one embodiment, after obtaining the text evaluation value by weighted fusion of the absolute quality evaluation value and the relative incremental evaluation value based on the global task completion degree of the output text, the process includes: Obtain the first model evaluation value, the second model evaluation value, and the relative incremental evaluation value corresponding to the output text; The evaluation values of the first model, the second model, and the relative incremental evaluation values are compared one by one with the plateau determination threshold in the global context feature pool. When the evaluation value of the second model meets the standard, the evaluation value of the first model does not meet the standard, and the relative incremental evaluation value does not meet the standard, the single-round low incremental degradation label of the output text is determined to be the first label; Otherwise, the single-round low-increment degradation label of the output text is determined to be the second label.
[0009] In one embodiment, the method further includes: Count the first tag within a preset number of rounds and generate multiple rounds of low-increment degradation tags; Compare the multi-round low-incremental-degradation tags with a preset alarm threshold; If the multiple rounds of low-increment degradation labels trigger an alarm, then the corresponding level of platform warning information will be output.
[0010] In one embodiment, the method further includes: The relative incremental evaluation value, the first model evaluation value, and the topic matching degree are weighted and summed according to preset weights to obtain the global semantic incremental contribution value. The corresponding plateauing penalty coefficient is determined based on the single-round low-increment degradation label, and the global quality contribution is obtained based on the text evaluation value and the plateauing penalty coefficient.
[0011] In one embodiment, the method further includes: Extract the core concepts, semantic vectors, statistical features, stylistic features, and length features of the output text for this round. The current round's core concepts are merged with the global core concept set to update and obtain the nth round's global core concept set; The current round semantic vector is fused with the global topic semantic vector to update and obtain the nth round global topic semantic vector; Based on the statistical characteristics of this round, the global statistical baseline is corrected and updated to obtain the global statistical baseline of the nth round; The style adaptation threshold and length adaptation threshold are adjusted according to the style characteristics and length characteristics of this round, respectively. The terrace determination threshold is updated by combining the multiple rounds of low-increment degradation labels; By integrating and updating the nth round global core concept set, the nth round global topic semantic vector, the nth round global statistical baseline, the style adaptation threshold, the length adaptation threshold, and the platform determination threshold, the nth round global context feature pool is obtained.
[0012] In addition, to achieve the above objectives, this application also provides an AI multi-turn question-answering evaluation device, which includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the computer program is configured to implement the steps of the AI multi-turn question-answering evaluation method as described above.
[0013] In addition, to achieve the above objectives, this application also provides a storage medium, which is a computer-readable storage medium, on which a program implementing the AI multi-turn question answering evaluation method is stored. The program implementing the AI multi-turn question answering evaluation method is executed by a processor to implement the steps of the AI multi-turn question answering evaluation method as described above.
[0014] This application provides an AI multi-turn question answering evaluation method. First, by responding to the input information of the task configuration control, a global context feature pool is constructed based on the global task obtained from the parsed input information and the baseline evaluation value of the domain to which the global task belongs. This establishes a globally unified evaluation benchmark to locate abnormal features of probability mean regression during AI multi-turn generation. Second, by responding to the output text of the large language model in the nth round, the absolute quality evaluation value is obtained by weighting the output text based on a preset first model and a second model. This quantifies the basic quality attributes of the output text itself and initially filters out low-quality corpora lacking basic quality. Next, by comparing the output text with the global context feature pool corresponding to the (n-1)th round based on a preset dimension, a relative incremental evaluation value is obtained to identify the added value of the output text relative to the global context and determine whether it is corpora without incremental value generated by probability mean regression. Finally, the absolute quality evaluation value and the relative incremental evaluation value are weighted and fused based on the completion degree of the global task to obtain a text evaluation value. This accurately identifies worthless corpora formed by probability mean regression in AI multi-turn generation and completes quality labeling.
[0015] In summary, this application achieves full-process quantitative identification of AI multi-turn generated content by combining global context benchmark construction, dual-model absolute quality assessment, global relative incremental comparison, and staged weighted fusion scoring. This overcomes the technical deficiency inherent in AI multi-turn generation, which naturally leads to probabilistic mean regression and the formation of worthless corpora. It accurately identifies homogeneous, worthless corpora in AI multi-turn generation, improving the accuracy of quality identification and the efficiency of value screening for AI multi-turn question-answering content. Specifically, it addresses the issues of missing context, repeated misjudgments, and degradation trend identification in multi-turn generation evaluation through feature pooling, thresholding, vector fusion, label statistics, and dynamic early warning. Attached Figure Description
[0016] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a flowchart illustrating an embodiment of the AI multi-turn question-answering evaluation method of this application. Figure 2 This is a schematic diagram of the multi-turn dialogue control in Embodiment 9 of the AI multi-turn question answering evaluation method of this application; Figure 3 This is a schematic diagram of the hardware structure involved in the AI multi-round question-answering evaluation device of this application.
[0019] The purpose, features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0020] It should be understood that the specific embodiments described herein are merely illustrative of the technical solutions of this application and are not intended to limit this application.
[0021] To better understand the technical solution of this application, a detailed description will be provided below in conjunction with the accompanying drawings and specific implementation methods.
[0022] Currently, AI question answering is developing rapidly, but due to the numerous illusions inherent in AI, it is necessary to evaluate the generated results. Current evaluations are all static, post-hoc checks, unable to provide real-time, context-aware quality and innovation metrics for each round of AI output. This is because multi-round AI generation naturally exhibits probabilistic mean regression, easily forming a uniform plateau with linguistic structure but lacking substance. These problems cannot be predicted in advance and can only be discovered retrospectively.
[0023] The main solution of this application is as follows: In response to the input information of the task configuration control, a global context feature pool is constructed based on the global task obtained by parsing the input information and the baseline evaluation value of the domain to which the global task belongs; In response to the output text of the large language model in the nth round, the output text is evaluated by weighting based on the preset first model and the second model to obtain an absolute quality evaluation value; The output text and the global context feature pool corresponding to the (n-1)th round are compared based on a preset dimension to obtain a relative incremental evaluation value; The absolute quality evaluation value and the relative incremental evaluation value are weighted and fused according to the global task completion degree of the output text to obtain a text evaluation value.
[0024] This application achieves full-process quantitative identification of AI multi-turn generated content by combining global context benchmark construction, dual-model absolute quality assessment, global relative incremental comparison, and phased weighted fusion scoring. It overcomes the technical defect that AI multi-turn generation naturally has probability mean regression, resulting in worthless corpus. It accurately identifies homogeneous worthless corpus in AI multi-turn generation, and improves the quality identification accuracy and value screening efficiency of AI multi-turn question-answering generated content.
[0025] It should be noted that the executing entity in this embodiment can be an AI multi-turn question-answering evaluation system, or a computing service device with data processing, network communication, and program execution functions, such as a tablet computer, personal computer, or mobile phone, or an AI multi-turn question-answering evaluation device capable of performing the above functions. This embodiment does not specifically limit it in this way. The following uses an AI multi-turn question-answering evaluation system as the executing entity to describe this embodiment and the following embodiments.
[0026] Based on this, Embodiment 1 of this application proposes an AI multi-turn question-answering evaluation method, please refer to... Figure 1 The AI multi-round question-answering evaluation method includes steps S10 to S40: Step S10: In response to the input information from the task configuration control, construct a global context feature pool based on the global task parsed from the input information and the baseline evaluation value of the domain to which the global task belongs.
[0027] In this embodiment, the task configuration control is an interactive component used to receive user-generated configuration instructions, the global task is the overall text generation goal and constraints set by the user, the baseline evaluation value is the dual-model benchmark score of the standard text in the corresponding domain, and the global context feature pool is a structured collection that stores global evaluation benchmark data.
[0028] As an optional implementation, the global task type is mapped to a string identifier, the technical field label is mapped to a numerical code, the target text length is mapped to an integer value, the quality evaluation dimension is mapped to a Boolean array, and the domain baseline score is normalized to a floating-point number between 0 and 1. These are combined into a one-dimensional array in the order of "task identifier + domain code + target length + evaluation dimension + baseline score". The array length is fixed at 128 dimensions and is directly used as the global context feature pool.
[0029] As another optional implementation, a key-value pair structure feature pool is established, with the global task type as the first key, the technical field label as the second key, the target text length as the third key, the quality evaluation dimension as the fourth key, and the domain baseline score as the fifth key. Each key corresponds to a fixed length of numerical storage space, and the corresponding value is written in byte alignment to form a global context feature pool that can be directly read and called.
[0030] Step S20: In response to the output text of the large language model in the nth round, the output text is evaluated based on a weighted average of the first and second preset models to obtain an absolute quality evaluation value.
[0031] In this embodiment, the output text of the nth round is the text to be evaluated generated by the large language model in the nth round of interaction. The first model is a computational model used to evaluate the text's ideological dimension features, the second model is a computational model used to evaluate the text's expressive dimension features, and the absolute quality evaluation value is a quantitative score that characterizes the text's basic quality.
[0032] As one optional implementation method, the first evaluation model segments the text in this round, statistically analyzes three indicators: the number of effective sentences, the proportion of complete sentences, and information density. After normalizing each indicator, the three indicators are weighted and summed to obtain a content quality score. The second evaluation model performs logical verification on the text in this round, statistically analyzing three indicators: the proportion of logical coherence, the proportion of sentences without contradictions, and the proportion of sentences with complete structure. After normalizing each indicator, the three indicators are weighted and summed to obtain a logical quality score. The absolute quality evaluation value is obtained by weighting the content quality score (0.6) and the logical quality score (0.4).
[0033] As another optional implementation, the current text is converted into a 768-dimensional feature vector, input into the first evaluation model to obtain the vector similarity score, input into the second evaluation model to obtain the vector regularity score, multiply the two scores and normalize them to between 0 and 1, and directly use them as the absolute quality evaluation value.
[0034] Step S30: Based on a preset dimension, compare the output text with the global context feature pool corresponding to the (n-1)th round to obtain a relative incremental evaluation value.
[0035] In this embodiment, the preset dimension is a quantitative dimension used to measure the global incremental value of the text, the global context feature pool corresponding to the (n-1)th round is the global evaluation benchmark data updated after the previous round of interaction, and the relative incremental evaluation value is a quantitative score that represents the added value of the text relative to the global context.
[0036] As an optional implementation, three parameters are extracted from the current text: the number of effective sentences, information density, and the proportion of logical coherence. These parameters are then compared with the corresponding parameters stored in the global context feature pool from the previous round. The differences are normalized to between 0 and 1, and then weighted and summed with weights of 0.3, 0.4, and 0.3 to obtain the relative incremental evaluation value.
[0037] As another optional implementation, the text in this round is compared character by character with the text in the global context feature pool of the previous round. The three indicators are counted: the number of newly added characters, the proportion of newly added effective information, and the proportion of repeated characters. The result is normalized to between 0 and 1 by adding the proportion of newly added effective information to the proportion of newly added characters and subtracting the proportion of repeated characters, and the result is obtained as a relative increment evaluation value.
[0038] Step S40: Based on the global task completion level of the output text, the absolute quality evaluation value and the relative incremental evaluation value are weighted and fused to obtain the text evaluation value.
[0039] In this embodiment, the global task completion rate is the percentage of progress of the global generation task corresponding to the output text, and the text evaluation value is the final quantitative score that characterizes the overall quality and global value of the output text.
[0040] As an optional implementation, the global task completion rate is divided into three intervals: 0-30%, 30-70%, and 70-100%. When the task completion rate is between 0-30%, the absolute quality evaluation value has a weight of 0.7, and the relative incremental evaluation value has a weight of 0.3. When the task completion rate is between 30%-70%, both scores have a weight of 0.5. When the task completion rate is between 70%-100%, the absolute quality evaluation value has a weight of 0.3, and the relative incremental evaluation value has a weight of 0.7. The text evaluation value is obtained by weighted summation according to the corresponding weights and then normalized.
[0041] As another optional implementation, the global task completion score is converted into a floating-point number between 0 and 1. The absolute quality evaluation value is weighted by 1 minus the task completion score, and the relative incremental evaluation value is weighted by the task completion score. The two scores are multiplied by their corresponding weights and then added together to directly obtain the text evaluation value.
[0042] For example, a user inputs configuration information for generating an AI-era content ecosystem industry report through a task configuration control. The system parses the information to determine the global task: writing a 15,000-word industry report in the internet media field. It then retrieves a pre-built text library for that field to calculate a baseline evaluation value and initializes a global contextual feature pool containing data such as a global core concept set and global topic semantic vectors. After the large language model generates the first round of output text, the system calculates the text's thought dimension score using the first model and the text's expression dimension score using the second model, weighting and fusing them to obtain an absolute quality evaluation value. The system then compares this round of output text with the initial global contextual feature pool across multiple dimensions, calculating indicators such as the percentage of newly added core concepts and topic matching degree, fusing them to obtain a relative incremental evaluation value. Based on the completion level of the current report generation, corresponding weights are matched, and the absolute quality evaluation value and the relative incremental evaluation value are weighted and fused to obtain the text evaluation value of this round of output text.
[0043] This embodiment establishes a unified global evaluation benchmark by constructing a global context feature pool, relies on dual models to complete the quantitative evaluation of the text's basic quality, accurately identifies the added value of the text by comparing the global context, and obtains a comprehensive score by adaptive weighted fusion of the global task completion degree. It accurately identifies worthless corpus generated by probability mean regression in the AI multi-round generation process, effectively improving the accuracy of quality evaluation and value screening efficiency of AI multi-round question answering content.
[0044] Based on any of the above embodiments, in Embodiment 2 of this application, a global context feature pool is constructed according to the global task obtained by parsing the input information and the baseline evaluation value of the domain to which the global task belongs, including: Step S11: Parse the input information of the task configuration control to obtain the generation target, core theme, style type, length requirement, and domain of the global task.
[0045] In this embodiment, the input information for the task configuration control is structured configuration data or natural language configuration instructions submitted by the user. The generation goal is the text output purpose pointed to by the global task. The core theme is the core argument surrounding the global task. The genre type is the style specification of the text to be generated. The length requirement is a numerical constraint on the text length. The domain is the professional category corresponding to the global task.
[0046] As an optional implementation, character cleaning and stop word filtering are performed on the input information. A pre-set keyword dictionary and rule templates are used to locate and generate a description of the target, which is then processed via TF. The IDF and TextRank hybrid algorithm extracts high-frequency terms and clusters them to form core topics. It matches the text style feature library to determine the text type based on sentence structure features, paragraph features and conjunction distribution. It extracts numbers and length units through regular expressions and normalizes them to thousands of characters to obtain the length requirement. It determines the domain by mapping domain classification labels and professional knowledge graphs.
[0047] As another optional implementation, the structured form fields of the task configuration control are traversed, the content of the target field is directly read, the checked and input content of the core theme is read, the drop-down option value of the style type is read, the numerical value and unit of the length requirement are read and converted into the standard length, the classification code of the field is read, the format validation and value range validation are performed on all data, and five standardized parameters are output after the validation passes.
[0048] Step S12: Based on the domain, retrieve the pre-built domain text library and calculate the baseline evaluation value, which includes a first model benchmark value and a second model benchmark value.
[0049] In this embodiment, the domain text library is a collection of standard texts stored according to domain classification, and each text has been quality-annotated. The first model benchmark is the average score of the standard texts within the domain on the first model. The second model benchmark is the average score of the standard texts within the domain on the second model. The baseline evaluation value is structured data composed of the benchmark values from both models.
[0050] As an optional implementation, the domain text library is retrieved based on the unique code of the domain, all labeled texts in the same domain are obtained, all texts are input into the first model in sequence to obtain the score for each text, all scores are accumulated and divided by the number of texts to obtain the baseline value of the first model, all texts are input into the second model in sequence to obtain the score for each text, all scores are accumulated and divided by the number of texts to obtain the baseline value of the second model, and the two baseline values are combined to form the baseline evaluation value.
[0051] As another optional implementation, a subset of high-quality texts is extracted from the domain text library according to the domain, and extreme data with abnormal scores are removed. The median of the first model score of the remaining data is calculated to obtain the first model baseline value, and the median of the second model score of the remaining data is calculated to obtain the second model baseline value. The two baseline values are normalized to the range of 0 to 1 and then packaged into a baseline evaluation value.
[0052] Step S13: Based on the generation target, core theme, style type, and length requirements, and combined with the baseline evaluation value, initialize the global context feature pool. The global context feature pool includes at least a global core concept set, a global theme semantic vector, a global statistical baseline, a style adaptation threshold, a length adaptation threshold, and a platform determination threshold.
[0053] In this embodiment, the global context feature pool serves as the structured data carrier for the global evaluation benchmark. The global core concept set is the terminology set corresponding to the core topic. The global topic semantic vector is a fixed-dimensional vector obtained by semantically encoding the core topic. The global statistical baseline is a reference benchmark composed of baseline evaluation values. The style adaptation threshold is the critical value for determining style matching. The length adaptation threshold is the critical value for determining length conformity. The plateau determination threshold is the critical value for determining low-increment degradation label recognition.
[0054] As an optional implementation, keyword expansion and synonym merging are performed on the core topic to obtain a global core concept set. A 768-dimensional pre-trained semantic model is used to encode the core topic to obtain a global topic semantic vector. The first model baseline value and the second model baseline value are combined to form a global statistical baseline. The corresponding value is found in the style threshold table according to the style type as the style adaptation threshold. The fluctuation of 10% above and below the length requirement is set as the length adaptation threshold. The lowest value of the baseline evaluation value is multiplied by 0.6 to obtain the platform determination threshold. The six data items are encapsulated in a fixed structure to obtain a global context feature pool.
[0055] As another optional implementation, entities and attributes are extracted from the knowledge graph of the core topic association to form a global core concept set. The generation target and the core topic are jointly encoded to obtain a global topic semantic vector. The baseline evaluation value is processed by a moving average to obtain a global statistical baseline. The style adaptation threshold is set according to the standard level of the style type. The interval-type length adaptation threshold is set according to the target value of the length requirement. The platform judgment threshold is determined according to the weighted average of the first model benchmark value and the second model benchmark value. All data are uniformly formatted and integrated to form a global context feature pool.
[0056] For example, a user inputs configuration information via a task configuration control to generate a 10,000-word academic paper in the field of artificial intelligence, with the topic being a multi-round generated text quality assessment method. The system cleans, segments, and extracts keywords from the input information, resulting in a paper with the following characteristics: the goal is academic paper writing, the core topic is multi-round generated text quality assessment method, the genre is academic paper, the length requirement is 10,000 words, and the field is artificial intelligence. The system retrieves fifty standard academic texts from a domain text library based on the artificial intelligence field, inputs them into the first and second models respectively, calculates scores, and averages them to obtain the baseline evaluation value: 0.78 for the first model and 0.82 for the second model. The system then expands the core topic into a global core concept set, encodes the core topic to obtain a global topic semantic vector, uses the baseline evaluation value as the global statistical baseline, finds the genre adaptation threshold of 0.85 for the academic paper, sets a length adaptation threshold of 10% above or below 10,000 words, multiplies 0.78 by 0.6 to obtain a threshold of 0.468, and integrates these six data points to initialize the global context feature pool.
[0057] This embodiment extracts global task parameters completely by refining the task configuration information, accurately calculates the dual-model benchmark values by relying on the domain text library, and completes the initialization of the global context feature pool through engineering processes such as term expansion, semantic encoding, threshold setting, and data encapsulation. This provides a stable and reusable global benchmark for subsequent multi-round text evaluation, improving the consistency and reproducibility of the evaluation process.
[0058] Based on any of the above embodiments, in Embodiment 3 of this application, the output text is weighted and evaluated based on a preset first model and a second model to obtain an absolute quality evaluation value, including: Step S21: Process the output text based on the first model to obtain the first model evaluation value, and process the output text based on the second model to obtain the second model evaluation value. The first model includes a reward module, a penalty module, and a compensation module, and the second model includes a preset number of expression factor dimensions.
[0059] In this embodiment, the output text is the text content to be evaluated generated by the large language model. The first model is a computational model used to quantify the cognitive dimension of the text's ideas. The reward module is a computational unit that performs positive scoring on the text's high-quality ideas. The penalty module is a computational unit that performs negative scoring on the text's flawed ideas. The compensation module is a computational unit that performs score balancing on the text's unique ideas. The second model is a computational model used to quantify the text's structural expression dimension. The expression factor dimension is an independent computational dimension that measures the quality of text expression. The evaluation value of the first model is the quantitative score of the idea dimension output by the first model. The evaluation value of the second model is the quantitative score of the expression dimension output by the second model.
[0060] As an optional implementation, the output text is input into the first model. The reward module extracts the text's thought depth and concept density features, and accumulates reward scores according to a maximum score of 0.2 for each feature. The penalty module identifies logical contradictions and meaningless redundancy features in the text, and accumulates penalty scores according to a maximum score of 0.2 for each defect. The compensation module calculates compensation scores within 0.1 for the text's domain innovation and cross-paradigm expression features, based on the degree of innovation. The reward score is subtracted from the penalty score, and the compensation score is added. The result is normalized to the interval between 0 and 1 to obtain the evaluation value of the first model. The output text is input into the second model, which iterates through seven expression factor dimensions: syntactic complexity, symbol density, vector stability, logical depth, structural hierarchy clarity, concept dimension resolution, and cross-semantic domain transfer capability. Each dimension is standardized and scored within the interval between 0 and 1. The scores of the seven dimensions are summed with equal weights to obtain the evaluation value of the second model.
[0061] As an alternative implementation, the first model performs threshold determination on the thought features of the output text. The reward module accumulates a fixed score for thought features that reach the preset threshold. The penalty module deducts a fixed score for features that exceed the defect threshold. The compensation module dynamically calculates compensation scores based on the domain attributes and expression novelty of the text, and linearly scales the results of the three score calculations to obtain the evaluation value of the first model. The second model constructs a feature weight matrix using a preset number of expression factor dimensions. After scoring each expression factor dimension of the output text, it performs a weighted operation with the feature weight matrix, and normalizes the weighted operation result to obtain the evaluation value of the second model.
[0062] Step S22: Based on the text type, fuse the first model evaluation value and the second model evaluation value to generate a stable field label and local features corresponding to the input text.
[0063] In this embodiment, text type is a text category categorized based on the length and genre of the output text. Fusion is the process of performing a weighted operation on the evaluation values of the first model and the second model according to preset rules. Stability field label is a classification identifier representing the overall stability of the text's ideas and expression. Local features are a set of quantitative parameters representing the core quality of the text.
[0064] As an optional implementation, a weighted ratio is matched according to the text type. Short texts are assigned a weight of 0.4 based on the first model evaluation value and 0.6 based on the second model evaluation value; standard texts are assigned a weight of 0.55 based on the first model evaluation value and 0.45 based on the second model evaluation value; and long texts are assigned a weight of 0.65 based on the first model evaluation value and 0.35 based on the second model evaluation value. Weighted fusion is performed according to the weighted ratio to obtain a fusion score. Texts with a fusion score greater than or equal to 0.8 are marked as highly stable, texts with a fusion score between 0.5 and 0.8 are marked as moderately stable, and texts with a fusion score less than 0.5 are marked as lowly stable. Concept density, logical coherence, and structural adaptability parameters are extracted during the fusion process and combined to form local features.
[0065] As an alternative implementation, the fusion weights are adjusted based on the genre attributes of the text type. Academic texts receive a higher weight in the second model's evaluation value, creative texts receive a higher weight in the first model's evaluation value, and practical texts maintain a balanced weighting of both evaluation values. The scores are then fused according to the adjusted weights. A stable field label is determined based on the dispersion of the fused scores; lower dispersion indicates higher stability. Local thought features and local expression features are extracted from the output text, and the core quantification parameters of these two types of features are integrated to form a local feature set.
[0066] Step S23: Generate the absolute quality evaluation value of the output text based on the stable field label and local features.
[0067] In this embodiment, the absolute quality rating is the final quantitative score that characterizes the basic quality of the output text itself, and is used to measure the inherent quality level of the text independent of the context.
[0068] As an optional implementation, the baseline score range is matched according to the stability field label: high stability labels are matched with a baseline score range of 0.8 to 1, medium stability labels with a baseline score range of 0.5 to 0.8, and low stability labels with a baseline score range of 0 to 0.5. Standardized scores of the concept density, logical coherence, and structural adaptability parameters in the local features are applied within the 0 to 1 range, and the average of the scores for the three parameters is used as a correction coefficient. The median value of the baseline score range is multiplied by the correction coefficient to normalize the calculation result to the 0 to 1 range, yielding the absolute quality evaluation value.
[0069] As an alternative implementation, the stable field labels are converted into fixed base scores: 0.9 for high stability, 0.65 for medium stability, and 0.3 for low stability. Core parameters are extracted from local features, and a comprehensive score is calculated. This comprehensive score is divided by two to obtain a score correction margin. The correction margin is then added to the fixed base scores, limiting the calculation result to the range of zero to one, thus obtaining the absolute quality evaluation value.
[0070] For example, a large language model generates paragraph text of an academic paper in the field of artificial intelligence. The system inputs this paragraph text into the first model. The reward module calculates the scores for thought depth and concept density, obtaining a reward score of 0.72. The penalty module identifies local redundancy features, obtaining a penalty score of 0.1. The compensation module provides a compensation score of 0.08 for academic innovation expression. After calculation, the first model's evaluation value is 0.7. The second model scores each of the seven expression factor dimensions, and after equal weighting and summing, obtains the second model's evaluation value of 0.84. This paragraph text belongs to standard academic text. It is fused according to the weights of 0.55 and 0.45 to obtain a fusion score of 0.762, which is marked as a moderately stable label. Concept density, logical coherence, and structural fit are extracted to form local features. Based on the moderately stable label matching a fixed base score of 0.65, a correction margin of 0.12 is calculated by combining the comprehensive score of local features. After superposition, the absolute quality evaluation value of the paragraph text is 0.77.
[0071] This embodiment achieves precise quantification of the thought dimension through the reward, punishment, and compensation modules of the first model, and comprehensive quantification of the expression dimension through the multiple expression factor dimensions of the second model. It also achieves adaptive fusion of the scores of the two models by combining text type, and completes the accurate calculation of the absolute quality evaluation value by relying on stable field labels and local features, thereby improving the comprehensiveness and accuracy of the text quality assessment.
[0072] Based on any of the above embodiments, in Embodiment 4 of this application, a relative incremental evaluation value is obtained by comparing the output text with the global context feature pool corresponding to the (n-1)th round based on a preset dimension, including: Step S31: Extract the core concepts, semantic vectors, stylistic features, and length features of the output text for this round.
[0073] In this embodiment, the core concept of this round is the set of technical terms and keywords that carry the core ideas in the output text. The semantic vector of this round is the vectorized representation of the overall semantics of the output text. The stylistic features of this round are the genre structure and expression style features of the output text. The length features of this round are the statistical data on the length of the output text.
[0074] As an optional implementation, the input is the output text generated by the large language model in the nth round. During processing, the TextRank algorithm combined with TF-IDF weights is used to extract high-frequency keywords and core terms from the output text, forming the core concepts for this round. A pre-trained semantic coding model is used to globally encode the output text, generating a fixed 768-dimensional floating-point array as the semantic vector for this round. Syntactic, structural, and stylistic features of the output text are extracted through syntactic tree depth statistics, sentence type distribution matching, and paragraph structure division rules, combined to form the stylistic features for this round. Character-level and vocabulary-level statistics are performed on the output text; after removing punctuation and stop words, the number of effective characters and words is calculated to obtain the length features for this round. After execution, the output results are the core concepts, semantic vectors, stylistic features, and length features corresponding to the output text for this round.
[0075] As an alternative implementation, the input is the output text generated by the large language model in the nth round. During processing, the output text is precisely matched with a domain-specific core vocabulary, and successfully matched terms are extracted as the core concepts for this round. A lightweight semantic coding model is used to encode the output text, generating a 512-dimensional vector as the semantic vector for this round. The genre attribute of the output text is determined by clustering using paragraph count, average sentence length, and connectives based on frequency features, forming the genre features for this round. Based on the word segmentation results, the number of tokens in the output text is counted, and combined with a preset length conversion rule, the length features for this round are obtained. After execution, the output results are the core concepts, semantic vectors, genre features, and length features corresponding to the output text for this round.
[0076] Step S32: Compare the core concepts of this round with the set of global core concepts in the global context feature pool of the (n-1)th round, and calculate the proportion of newly added core concepts.
[0077] In this embodiment, the percentage of newly added core concepts is the ratio of the number of unique core concepts in this round to the total number of core concepts in this round. The global core concept set is the summary data of global task core terms stored in the feature pool of round n-1.
[0078] As an optional implementation, the input objects are the core concepts of the current round and the global core concept set of the (n-1)th round. During processing, a precise string match is performed between the core concepts of the current round and the global core concept set to filter out core concepts of the current round that are not present in the global core concept set, and the number of unique core concepts of the current round is calculated. The percentage of newly added core concepts is calculated by dividing the number of unique core concepts of the current round by the total number of core concepts of the current round and then multiplying by 100%. After execution, the output result is the percentage of newly added core concepts.
[0079] As an alternative implementation, the input objects are the core concepts of the current round and the global core concept set of the (n-1)th round. During processing, the core concepts of the current round and the global core concept set are converted into semantic vectors, and the cosine similarity between the vectors is calculated. Concepts with a similarity greater than 0.8 are considered existing concepts, while those with a similarity less than 0.8 are considered new concepts. After counting the number of new concepts in the current round, the percentage of new core concepts is calculated as the number of new core concepts in the current round divided by the total number of core concepts in the current round, multiplied by 100%. After execution, the output is the percentage of new core concepts.
[0080] Step S33: Compare the semantic vector of this round with the global topic semantic vector in the global context feature pool of the (n-1)th round, and calculate the topic matching degree.
[0081] In this embodiment, the topic matching degree is a numerical value representing the similarity between the current round's semantic vector and the global topic semantic vector. The global topic semantic vector is the vectorized data of the core topics of the global task stored in the feature pool of the (n-1)th round.
[0082] As an optional implementation, the input objects are the semantic vector of the current round and the global topic semantic vector of the (n-1)th round. During processing, a cosine similarity algorithm is used to calculate the vector comparison. The formula is: topic matching degree equals the dot product of the current round semantic vector and the global topic semantic vector divided by the product of the magnitude of the current round semantic vector and the magnitude of the global topic semantic vector. The calculation result is normalized to the interval between zero and one. After execution, the output is the topic matching degree.
[0083] As an alternative implementation, the input objects are the current round semantic vector and the (n-1)th round global topic semantic vector. During processing, the global topic semantic vector is weighted and smoothed to obtain a corrected global vector. Then, the current round semantic vector and the corrected global vector are compared using Euclidean distance. The Euclidean distance result is then reverse-normalized using the formula: topic matching degree equals one divided by one plus the calculated Euclidean distance value. After execution, the output is the topic matching degree.
[0084] Step S34: Compare the current round of text style features and current round of length features with the text style adaptation threshold and length adaptation threshold in the (n-1)th round of global context feature pool, and calculate the feature adaptation degree.
[0085] In this embodiment, the feature fit is a comprehensive value representing the degree of matching between the current round's style features and length features and the corresponding threshold. The style fit threshold is a preset threshold for style feature judgment in the (n-1)th round feature pool. The length fit threshold is a preset threshold for length feature judgment in the (n-1)th round feature pool.
[0086] As an optional implementation, the specific input objects are the current round of text style features, the current round of article length features, and the text style adaptation threshold. During processing, the current round of text style features and the text style adaptation threshold are subjected to Boolean matching. If a match is found, the text style adaptation score is recorded as one; otherwise, it is recorded as zero. The deviation rate between the current round of article length features and the article length adaptation threshold is calculated. If the deviation rate is less than or equal to 10%, the article length adaptation score is recorded as one. For every 10% increase in the deviation rate, the article length adaptation score decreases linearly by 0.1. The feature fit is calculated according to the formula: feature fit = the sum of the text style adaptation score and the article length adaptation score divided by two. After execution, the output result is the feature fit.
[0087] As an alternative implementation, the specific input objects are the current round's style features, current round's length features, style adaptation threshold, and length adaptation threshold. During processing, feature similarity fitting is performed between the current round's style features and the style adaptation threshold to obtain a style adaptation score in the range of zero to one. The current round's length features are then mapped to the target range of the length adaptation threshold, yielding a length adaptation score in the range of zero to one. The style adaptation score and length adaptation score are then weighted and summed, with weights of 0.6 and 0.4 respectively, to obtain the feature adaptation degree. After execution, the output result is the feature adaptation degree.
[0088] Step S35: Based on the global statistical baseline in the (n-1)th round of global context feature pool, calculate the abstract level transition degree and text redundancy of the output text.
[0089] In this embodiment, the abstraction level leap is the increase in the abstraction level of the output text in this round relative to the global average abstraction level. Text redundancy is the proportion of repetitive semantic content in the output text. The global statistical baseline is the statistical reference value of the domain text features stored in the feature pool of round n-1.
[0090] As an optional implementation, the input objects are the output text and the global statistical baseline of the (n-1)th round. During processing, the output text is annotated with abstract levels, divided into four levels: phenomenon description, rule summary, principle, abstract paradigm, and innovation, and assigned scores from one to four. The abstract level transition degree is calculated according to the formula: the abstract level score of this round minus the average abstract level score in the global statistical baseline. The output text is segmented into sentences, and the mean cosine similarity of the semantic vectors of adjacent sentences is calculated, with the mean used as the text redundancy. After execution, the output results are the abstract level transition degree and the text redundancy.
[0091] As an alternative implementation, the input objects are the output text and the global statistical baseline of the (n-1)th round. During processing, the average abstraction score of the output text is calculated using a concept abstraction level graph. The abstraction level transition is equal to the average abstraction score of this round minus the abstraction benchmark score in the global statistical baseline. A sliding window is used to detect repeated semantic segments in the output text, and the proportion of repeated segment characters to the total number of characters is counted to obtain the text redundancy. After execution, the output results are the abstraction level transition and the text redundancy.
[0092] Step S36: Combine the newly added proportion of core concepts, topic matching degree, feature adaptation degree, abstraction level leap degree, and text redundancy to obtain the relative incremental evaluation value.
[0093] In this embodiment, the relative incremental evaluation value is a comprehensive quantitative score of the global added value of the text obtained by integrating multiple incremental indicators. The integration calculation is a process of weighting multiple indicators to obtain a comprehensive score.
[0094] As an optional implementation, the specific input objects are the proportion of newly added core concepts, topic matching degree, feature adaptation degree, abstraction level transition degree, and text redundancy. During processing, all indicators are normalized to the range of zero to one, and fixed weights are set as follows: 0.3 for the proportion of newly added core concepts, 0.2 for topic matching degree, 0.15 for feature adaptation degree, 0.25 for abstraction level transition degree, and 0.1 for text redundancy. The relative incremental evaluation value is calculated by multiplying 0.3 by the proportion of newly added core concepts plus 0.2 by topic matching degree plus 0.15 by feature adaptation degree plus 0.25 by abstraction level transition degree minus 0.1 by text redundancy. After execution, the output is the relative incremental evaluation value.
[0095] As an alternative implementation, the specific input objects are the proportion of newly added core concepts, topic matching degree, feature fit, abstraction level leap degree, and text redundancy. During processing, the weights are dynamically adjusted based on the global task completion rate, generating initial weights to improve feature fit, mid-term weights to improve the proportion of newly added core concepts, and late-term weights to improve abstraction level leap degree. Each indicator is substituted into the dynamic weight formula to complete the fusion calculation, and the calculation result is normalized to obtain the relative incremental evaluation value. After execution, the output result is the relative incremental evaluation value. The dynamic weights are obtained by looking up a table based on the default or user-defined model divergence of the large model, and are pre-configured weight settings.
[0096] For example, after the large language model generates the third round of output text for academic papers in the field of artificial intelligence, the system extracts the core concepts of this round using the TextRank algorithm, generates the semantic vector of this round using a pre-trained semantic encoding model, extracts the stylistic features of this round through syntactic structure analysis, and obtains the length features of this round through character statistics. The core concepts of this round are precisely matched with the global core concept set of the second round, and the newly added core concepts are calculated to be 35%. The cosine similarity of the semantic vector of this round with the global topic semantic vector of the second round is calculated, and the topic matching degree is 0.82. The stylistic features of this round are matched with the stylistic adaptation threshold, resulting in a stylistic adaptation score of 1. The length features of this round deviate from the length adaptation threshold by 5%, resulting in a length adaptation score of 1, and the feature fitting degree is calculated to be 1. Based on the global statistical baseline of the second round, the abstract level score of this round is 3, the global average abstract level score is 2, the abstract level transition degree is calculated to be 1, the average semantic similarity between adjacent sentences is 0.21, and the text redundancy is 0.21. Based on the five indicators weighted by a fixed weight, the relative incremental evaluation value of the output text in this round was finally calculated to be 0.68.
[0097] This embodiment accurately obtains the core representation data of the output text through multi-dimensional feature extraction, calculates the proportion of newly added core concepts by combining precise matching and semantic matching, quantifies the topic matching degree by relying on vector similarity algorithm, calculates the feature fit degree by threshold comparison and linear fitting, and completes the accurate calculation of the degree of abstraction level leap and text redundancy by combining global statistical baseline. It uses a weighted fusion formula to integrate multiple indicators to obtain the relative incremental evaluation value, which solves the technical problem of missing dimensions and fuzzy calculation of relative incremental evaluation of AI multi-round generated text, and improves the accuracy of global incremental value evaluation and the feasibility of engineering implementation.
[0098] Optionally, this embodiment effectively suppresses the misleading influence of artificially inflated indicators on relative incremental evaluation values by introducing concept quality verification, abstract validity testing, multi-dimensional redundancy fusion, information gain, and structural change degree. This significantly enhances the anti-misjudgment capability of multi-round generation evaluation, while maintaining compatibility with the technical terminology system of the original steps, as detailed below: Similar to Example 4, extract the core concepts, semantic vectors, stylistic features, and length features of the output text for this round.
[0099] Calculate the percentage of new core concepts and simultaneously calculate the quality correction coefficient for new concepts.
[0100] The percentage of newly added core concepts will still be calculated in the original way: the number of newly added concepts divided by the total number of core concepts in this round.
[0101] For each new concept c, calculate the definition completeness Def(c) and the causal correlation degree Causal(c).
[0102] The definition completeness Def(c) is calculated as follows: In this round of text, it is checked whether there is a definition sentence with concept c as the main body. If there are patterns such as "c refers to", "c is", "c is defined as", "so-called c", etc., and the definition sentence is longer than 5 words, then Def(c) = 1, otherwise it is 0.
[0103] The causal correlation coefficient Causal(c) is calculated as follows: check whether the sentence in which concept c appears contains a causal conjunction that is simultaneously associated with any existing concept in the global core concept set. Causal conjunctions include words such as "cause", "cause", "due to", "therefore", and "make". If they exist, Causal(c) = 1; otherwise, it is 0.
[0104] The formula for calculating the newly added concept quality score is as follows: Q(c)=α×Def(c)+β×Causal(c).
[0105] The default weights are α = 0.6 and β = 0.4.
[0106] The new concept quality correction coefficient Q_concept is the average of the quality scores of all new concepts. If there are no new concepts, Q_concept is set to 1.0.
[0107] The final formula for calculating the percentage of newly added effective core concepts used for subsequent integration is as follows: The percentage of new effective core concepts = the percentage of new core concepts × Q_concept.
[0108] The topic matching degree calculation is the same as in the original Example 4. The feature fit degree calculation is the same as in the original Example 4.
[0109] When calculating the degree of abstraction level transition and text redundancy, embed validity checks and multidimensional redundancy analysis are performed.
[0110] The effective abstraction level transition is calculated by first calculating the current abstraction level score L_round and the global average abstraction level L_global in the original way, and then obtaining the original transition degree: ΔL=L_round-L_global.
[0111] Based on this, the penalty coefficient P_fake for false abstraction is calculated as follows: Iterate through all sentences in the current text that express an increase in the level of abstraction. Identify these sentences by detecting the presence of words from the abstract level vocabulary, which includes terms such as "essence," "paradigm," "bottom layer," "system," "mechanism," and "trend." Count whether these sentences contain specific examples, data, citations, or causal reasoning chains. If there is no such supporting evidence, mark them as "false abstraction" and count them as N_fake.
[0112] The formula for calculating the penalty coefficient P_fake for false abstraction is: P_fake=min(1,N_fake / N_abstract).
[0113] Where N_abstract is the total number of abstract lifting sentences. If N_abstract=0, then P_fake is 0.
[0114] The formula for calculating the effective abstraction level transition degree is: ΔL_valid = ΔL × (1 - P_fake).
[0115] Considering the overall text redundancy, the redundancy is expanded into a weighted sum of four dimensions: Semantic redundancy R_sem: The calculation result based on the mean cosine similarity of semantic vectors of adjacent sentences in the original Implementation Example 4 remains unchanged.
[0116] Argument structure redundancy R_struct: The current text is segmented into “claim-basis-reasoning” segments and compared with the historical argument structure templates stored in the global context feature pool. The argument structure templates are represented by argument role sequence encoding. The longest common subsequence similarity is calculated, and the maximum value is taken as the argument structure redundancy.
[0117] Sentence template redundancy R_pattern: Extract the syntax tree depth, clause nesting level, and main-subordinate conjunction sequence as sentence template features for each sentence in this round, match them with the historical sentence template library, and count the proportion of duplicate templates.
[0118] Semantic role redundancy R_role: Using semantic role annotation, extract the agent, patient, instrument, and other role sequences of each sentence, compare them with the role sequences in the preceding text, and calculate the repetition rate of the role sequences.
[0119] The formula for calculating overall redundancy is: R_total=w_sem×R_sem+w_struct×R_struct+w_pattern×R_pattern+w_role×R_role.
[0120] The default weights for each item are: w_sem = 0.25, w_struct = 0.35, w_pattern = 0.25, and w_role = 0.15. All redundancy metrics are normalized to the range of 0 to 1.
[0121] When the relative incremental evaluation value is obtained by fusion, marginal information gain and structural change degree are added, and the original indicators are replaced.
[0122] Marginal information gain G_marg is calculated by subtracting the current semantic vector V_current from the global topic semantic vector V_global_n-1 of the (n-1)th round, resulting in the information increment vector ΔV. The calculation formula is as follows: ΔV = V_current - V_global_n - 1.
[0123] Calculate the ratio of the projected magnitude of ΔV in the global topic semantic direction to the original magnitude of ΔV, and then multiply it by the magnitude of ΔV to obtain the marginal information gain: G_marg=(‖proj_V_global(ΔV)‖ / ‖ΔV‖)×‖ΔV‖.
[0124] The formula for normalizing to the 0 to 1 interval is: G_marg_norm=tanh(G_marg).
[0125] The structural change factor S_change breaks down the output text into paragraphs or semantic blocks, abstracting them into a hierarchical node graph. It compares the current node graph with the chapter or argument topology graph already generated by the global task, using the normalized value of the graph edit distance as the structural change factor, with a value ranging from 0 to 1. If there is no historical structure, S_change is set to 1.0.
[0126] The fusion formula, the relative incremental evaluation value is weighted and fused from the following seven factors, where the proportion of newly added effective core concepts replaces the proportion of newly added original core concepts, comprehensive redundancy replaces the original text redundancy, and effective abstraction level leap degree replaces the original abstraction level leap degree: Score_inc = w1 × percentage of new valid core concepts + w2 × topic matching degree + w3 × feature adaptation degree + w4 × ΔL_valid - w5 × R_total + w6 × G_marg_norm + w7 × S_change.
[0127] The default weights are configured as follows: The weight w1 for the percentage of newly added effective core concepts is 0.25; the weight w2 for topic matching is 0.15; the weight w3 for feature adaptation is 0.10; the weight w4 for effective abstraction level transition is 0.20; the weight w5 for overall redundancy is 0.20 (this is a penalty term, included with a negative sign in the formula); the weight w6 for marginal information gain is 0.05; and the weight w7 for structural change is 0.05. All indicators are normalized to the 0-1 range, and the final Score_inc is truncated to the 0-1 range. These weights are configurable parameters and can be fine-tuned according to the actual task type.
[0128] This embodiment effectively suppresses the misleading influence of artificially inflated indicators on relative incremental evaluation values by introducing concept quality verification, abstract validity testing, multi-dimensional redundancy fusion, information gain, and structural change degree. This significantly enhances the anti-misjudgment capability of multi-round generation evaluation while maintaining compatibility with the technical terminology system of the original steps.
[0129] Based on any of the above embodiments, in Embodiment 5 of this application, after obtaining the text evaluation value by weighted and fused the absolute quality evaluation value and the relative incremental evaluation value according to the global task completion degree of the output text, the process includes: Step S51: Obtain the first model evaluation value, the second model evaluation value, and the relative incremental evaluation value corresponding to the output text.
[0130] In this embodiment, the first model is a shallow evaluation model oriented towards the basic normativity of the text, and the output is the first model evaluation value M1, with the value range normalized to [0,1]; the second model is a deep evaluation model oriented towards the semantic integrity and logical rationality of the text, and the output is the second model evaluation value M2, with the value range normalized to [0,1]; the relative increment evaluation value ΔM is the text relative increment quality score obtained in step S16, with the value range normalized to [0,1].
[0131] As an optional implementation, the deployed shallow evaluation model interface is called, the input and output text is processed by word segmentation, feature extraction and fully connected layer inference, and the first model evaluation value M1 is output; the deep evaluation model interface is called simultaneously, the input and output text is processed by semantic encoding and context interaction inference, and the second model evaluation value M2 is output; the calculated relative incremental evaluation value ΔM is read from the cache unit to complete the synchronous acquisition of the three types of evaluation values.
[0132] As another optional implementation, the output text input-side inference engine directly generates floating-point M1 and M2 through parallel inference of the quantized first model and the second model, while retrieving ΔM from the incremental calculation partition of the global context feature pool to achieve low-latency acquisition.
[0133] Step S52: Compare the first model evaluation value, the second model evaluation value, and the relative incremental evaluation value with the plateau determination threshold in the global context feature pool one by one.
[0134] In this embodiment, the global context feature pool is pre-configured with a set of threshold groups for determining the platform, including: a first model compliance threshold T1, a second model compliance threshold T2, and a relative incremental compliance threshold T3. The default parameter configuration is: T1=0.60, T2=0.70, and T3=0.50, which can be dynamically calibrated according to the business scenario.
[0135] The comparison rule uses direct numerical comparison: The comparison result of the first model evaluation value: If M1 ≥ T1, it is recorded as meeting the standard; if M1 < T1, it is recorded as not meeting the standard. The comparison result of the second model evaluation value: If M2 ≥ T2, it is recorded as meeting the standard; if M2 < T2, it is recorded as not meeting the standard. The comparison result of the relative increment evaluation value: If ΔM ≥ T3, it is recorded as meeting the standard; if ΔM < T3, it is recorded as not meeting the standard.
[0136] As an optional implementation, M1, M2, and ΔM are input into a comparator in pairs with T1, T2, and T3 respectively, and boolean达标 status flags are generated one by one to complete the comparison one by one.
[0137] As another optional implementation, the difference between the three types of evaluation values and the corresponding thresholds is calculated through a threshold mapping table. If the difference ≥ 0, it is recorded as meeting the standard; if the difference < 0, it is recorded as not meeting the standard, and a standardized comparison result is generated.
[0138] Step S53, when the second model evaluation value meets the standard, the first model evaluation value does not meet the standard, and the relative increment evaluation value does not meet the standard, it is determined that the single-round low-increment degradation label of the output text is the first label; otherwise, it is determined as the second label.
[0139] In this embodiment, the first label is a single-round terrace text (indicating that the text is only superficially smooth, meets the standard in depth but has no substantial increment, belonging to terrace output); the second label is a non-single-round terrace text (indicating that the text has basic quality or substantial increment).
[0140] The determination logic formula: IF (M2 ≥ T2) AND (M1 < T1) AND (ΔM < T3) → label = the first label; ELSE → label = the second label.
[0141] As an optional implementation, the three types of达标 status obtained in step S52 are input into an AND logic unit. Only when the second meets the standard, the first does not meet the standard, and the increment does not meet the standard are satisfied simultaneously, the first label is output; in other cases, the second label is directly output.
[0142] [[ID=二十九]]As another optional implementation, the three types of comparison results are encoded as binary status bits (meeting the standard = 1, not meeting the standard = 0). When the status bit combination is 010, it is determined as the first label, and all other encoding combinations are determined as the second label.
[0143] Exemplarily, a certain output text is calculated to obtain: the evaluation value of the first model M1 = 0.52, the evaluation value of the second model M2 = 0.78, and the relative incremental evaluation value ΔM = 0.41; the global terrace determination thresholds: T1 = 0.60, T2 = 0.70, T3 = 0.50; by comparing one by one, it is obtained that M1 < T1 (not meeting the standard), M2 ≥ T2 (meeting the standard), ΔM < T3 (not meeting the standard); satisfying the combined determination of the three conditions, it is finally determined that the single-round low-increment degradation label of this output text is the first label (single-round terrace text).
[0144] In this embodiment, through the standardized comparison logic of three-layer evaluation values + three-dimensional thresholds, combined with the hard threshold combined determination rule, the quantitative, reproducible, and engineering determination of the single-round low-increment degradation label is realized, subjective judgment is discarded, the parameters are configurable and calibratable, and it can accurately identify the terraced text of "shallow non-compliance, deep compliance but no substantial increment", effectively improving the automation degree and recognition accuracy of text quality control, and at the same time adapting to the end-side and cloud inference scenarios.
[0145] Optionally, regarding the terrace determination threshold, it is divided into the first model compliance threshold T1, the second model compliance threshold T2, and the relative increment compliance threshold T3. For each threshold, the threshold can be obtained by statistical analysis of the domain text library; the threshold can be adaptive according to the style type; the threshold can be dynamically calibrated according to the historical task performance; the threshold can be configured by the user; the warning level is determined by ROC, F1 or the manual annotation set.
[0146] Optionally, the terrace determination threshold is determined based on statistical analysis of the domain text library. In this implementation manner, the terrace determination threshold is obtained through statistical calculation by a pre-constructed domain text library, which is applicable to the system initialization stage or the scenario where the domain benchmark has not been established.
[0147] First, retrieve the pre-constructed domain text library according to the domain to which the global task belongs. The domain text library contains several high-quality standard texts in this domain, and each text has been manually annotated or reviewed by experts to confirm that it belongs to non-terraced text.
[0148] For each text in the domain text library, calculate its first model evaluation value M1_i, second model evaluation value M2_i, and relative incremental evaluation value ΔM_i respectively. The calculation method of the relative incremental evaluation value is: compare each text with the previous text. If this text is the first text in the domain text library, then compare it with the global topic semantic vector of the domain text library, and the calculation method is the same as steps S30 to S36.
[0149] Statistically analyze the first model evaluation value, second model evaluation value, and relative incremental evaluation value of all texts in the domain text library, and calculate the distribution characteristics of each index respectively.
[0150] The first model achievement threshold T1 is determined by taking the lower quartile of the first model evaluation value of all texts in the domain text library, that is, T1 is equal to the value at the 25th position after sorting the first model evaluation values from smallest to largest.
[0151] The second model achievement threshold T2 is determined by taking the median of the second model evaluation values of all texts in the domain text library. That is, T2 is equal to the value at the 50th position after the second model evaluation values are sorted from smallest to largest.
[0152] The relative increment threshold T3 is determined as follows: take the lower quartile of the relative increment evaluation values of all texts in the domain text library, that is, T3 is equal to the value at the 25th position after the relative increment evaluation values are sorted from smallest to largest.
[0153] The three thresholds mentioned above together constitute the platform determination threshold set. This implementation method can be periodically re-executed to keep the thresholds synchronized with the domain text quality standards.
[0154] Based on the adaptive platform determination threshold of text type, in this embodiment, the platform determination threshold is adjusted differently according to the text type specified by the global task, so that the threshold is adapted to the expression characteristics and quality standards of different text types.
[0155] First, retrieve the document type specified in the global task configuration. Optional document types include academic papers, technical reports, press releases, official documents, creative copywriting, and conversational texts.
[0156] A text style threshold offset table is constructed, which is obtained through prior statistical analysis of domain text libraries of different text styles. The table records a set of threshold offset coefficients for each text style, specifically including: first model offset coefficient k1_style, second model offset coefficient k2_style, and incremental offset coefficient k3_style. Each offset coefficient represents the adjustment ratio required to be multiplied relative to the default baseline threshold.
[0157] The default baseline thresholds T1_base, T2_base, and T3_base are obtained by the domain text library statistical method in Implementation A, or by the system's preset global default values. Examples of global default values are T1_base = 0.60, T2_base = 0.70, and T3_base = 0.50.
[0158] The construction logic of the style threshold offset table is as follows: For each style type, the median of the first model evaluation value, the second model evaluation value, and the relative incremental evaluation value of high-quality texts in the text library of that style type are calculated separately, and denoted as M1_style_median, M2_style_median, and ΔM_style_median, respectively. Then, the formula for calculating the offset coefficient of that style is: k1_style=M1_style_median / M1_all_median; k2_style=M2_style_median / M2_all_median; k3_style=ΔM_style_median / ΔM_all_median; Among them, M1_all_median, M2_all_median, and ΔM_all_median are the medians of the full-text mixed-genre domain text library.
[0159] After determining the style type of the current task, look up the style threshold offset table to obtain the corresponding offset coefficients k1_style, k2_style, and k3_style, and calculate the stage determination threshold for style adaptation: T1 = T1_base × k1_style; T2 = T2_base × k2_style; T3 = T3_base × k3_style; Examples of typical offset coefficients for different document types: For academic papers, which require a high degree of intellectual depth, k1_style is typically set to 1.10 to 1.20, meaning the threshold for the first model is increased by 10% to 20%; For official documents, which have high requirements for structural standardization but relatively low requirements for intellectual innovation, k1_style can be set to 0.85 to 0.95, and k2_style can be set to 1.05 to 1.15; For creative copywriting, which requires a high degree of incremental improvement, k3_style can be set to 1.10 to 1.30.
[0160] If the current text type is not recorded in the offset table, the offset coefficient is set to 1.0, and the default baseline threshold is used.
[0161] Based on the dynamic calibration of historical task performance, the platform determination threshold in this embodiment is dynamically calibrated according to the evaluation results of historical tasks during system operation, so that the threshold gradually converges to the value that is most suitable for actual use scenarios.
[0162] The system maintains a historical calibration dataset, recording the following information after each task is completed: the number of texts judged as terraced (first label) and non-terraced (second label) in each round of the task, the terraced judgment threshold group used in the task, and the user's manual score or the system's comprehensive score for the overall generation quality after the task is completed.
[0163] After each task, the plateau detection rate P_plateau for this round of tasks is calculated using the following formula: P_plateau = Number of rounds in which the first label was determined / Total number of rounds.
[0164] At the same time, obtain the user feedback quality score Q_user and normalize it to the range of 0 to 1.
[0165] Based on the detection rate of terraced formation and user feedback quality scores, the terraced formation judgment threshold is incrementally calibrated. The calibration formula is: T1_new=T1_old+λ×(P_plateau_target-P_plateau)×δ1; T2_new=T2_old+λ×(P_plateau_target-P_plateau)×δ2; T3_new=T3_old+λ×(P_plateau_target-P_plateau)×δ3; Where λ is the learning rate, which defaults to 0.05; P_plateau_target is the target plateau detection rate, which is preset by the system based on domain experience, and defaults to 0.15, meaning that it is expected that about 15% of the rounds will be identified as plateau content; δ1, δ2, and δ3 are the adjustment step size coefficients of each threshold, which default to 0.02, 0.02, and 0.03 respectively.
[0166] Additional user feedback correction items are introduced: If Q_user is lower than 0.6, it means that users are not satisfied with the overall quality, and there may be too many missed detections in the platform. In this case, an additional threshold tightening operation is performed, and T1, T2, and T3 are lowered to 95% of their original values respectively. If Q_user is higher than 0.85, it means that there may be too many false positives in the platform. T1, T2, and T3 are raised to 105% of their original values respectively.
[0167] After calibration, the updated threshold set is persisted and stored as the initial threshold for the next task.
[0168] In this embodiment, the terrace determination threshold is configured by the user. The terrace determination threshold can be configured by the user through the task configuration control or the settings interface to meet the personalized requirements of different users for the strictness of terrace determination, different field needs and different usage preferences.
[0169] Add a platform determination threshold configuration area to the task configuration control, providing the following three configuration modes for users to choose from: Mode 1 allows for quick configuration of the severity level. The user selects a severity level, and the system automatically sets the corresponding threshold group based on that level. The mapping relationship between severity levels and threshold groups is as follows: Strict settings: T1 is 0.70, T2 is 0.75, and T3 is 0.60, suitable for high-quality generation scenarios with zero tolerance for platforming; Standard gear settings: T1 is set to 0.60, T2 to 0.70, and T3 to 0.50, which is the system default configuration; The relaxed settings are: T1 = 0.50, T2 = 0.60, and T3 = 0.40. These settings are suitable for creative and divergent tasks and allow for a greater amount of smooth content on the surface.
[0170] Mode 2 allows for custom numerical configuration. Users directly input specific values for T1, T2, and T3, all within the range of 0 to 1. The system validates the user input according to the following rules: T1 must be greater than 0 and less than or equal to T2, T2 must be greater than 0 and less than or equal to 1, and T3 must be greater than 0 and less than or equal to 1. If validation fails, the system prompts the user to correct the input and refuses to save.
[0171] Mode 3 is domain template configuration. The system pre-sets threshold templates for several typical domains. After the user selects a domain, the corresponding threshold group is automatically loaded. The construction method of the domain template is a combination of implementation methods A and B. That is, the basic threshold is first obtained by statistical analysis of the domain text library, and then adjusted by the style offset coefficient. The final result is then encapsulated as a domain template. Examples of domain templates: Academic paper template T1 is 0.68, T2 is 0.72, and T3 is 0.55; News release template T1 is 0.52, T2 is 0.65, and T3 is 0.45; Technical report template T1 is 0.62, T2 is 0.70, and T3 is 0.50.
[0172] After user configuration, the system writes the user-configured threshold group into the terrace determination threshold field in the global context feature pool, overriding the default value. During task execution, the user can pause the task and adjust the threshold configuration at any time; the adjustment will take effect from the next round.
[0173] Based on the landform determination threshold determined by the early warning level evaluation index, in this embodiment, the landform determination threshold and the division of multiple rounds of early warning levels are systematically optimized and determined through labeled datasets and classification evaluation indicators to ensure that the landform determination and early warning system achieves the expected results in practical applications.
[0174] First, a platforming annotation dataset was constructed. Text data from historical multi-turn dialogue tasks was collected, and each turn's text was annotated by human annotators. The annotation labels were divided into two categories: the first label indicated that the text in that turn was platforming content, i.e., superficially smooth but without substantial incremental changes; the second label indicated that the text in that turn was non-platforming content. At the same time, the degradation trend of consecutive turns was annotated, recording the turn position where the first signs of platforming appeared and the turn position where severe degradation began.
[0175] Then, the multiple candidate threshold groups to be evaluated are applied to the labeled dataset, and the terrace determination result corresponding to each threshold group is calculated. The candidate threshold groups are generated as follows: within the value range of T1 (0.40 to 0.80), T2 (0.50 to 0.85), and T3 (0.30 to 0.70), a grid search is performed with a step size of 0.05 to generate multiple candidate threshold combinations.
[0176] For each candidate threshold group, calculate its classification performance metrics on the labeled dataset, including precision, recall, and F1 score, using the following formulas: Precision = Number of rounds correctly identified as the first label / Total number of rounds the system correctly identified as the first label Recall = Number of rounds correctly identified as the first label / Total number of rounds manually labeled as the first label F1=2×Precision×Recall / (Precision+Recall).
[0177] Simultaneously, the area under the ROC curve (AUC) is calculated to measure the overall classification ability of the threshold group under different levels of judgment strictness.
[0178] The threshold group with the highest F1 score is selected as the optimal single-round platform determination threshold group. If multiple threshold groups have the same highest F1 score, the group with the largest AUC value is selected.
[0179] For the classification of multiple warning levels, based on the statistical distribution of consecutive plateauing rounds in the labeled dataset, the probability of task quality degradation corresponding to different cumulative plateauing ratios is calculated. The analysis method is as follows: a sliding window is used to traverse the labeled data, with a window size of 5 rounds. The proportion of the first label within the window is counted, and whether subsequent tasks in that window show severe degradation is marked. The relationship curve between the plateauing ratio and the degradation probability is plotted.
[0180] Three key inflection points on the curve of degradation probability were selected as candidate values for warning thresholds: the proportion of plateauing corresponding to the first significant rise in degradation probability to about 20% was used as the first-level yellow warning threshold; the proportion of plateauing corresponding to the rise in degradation probability to about 50% was used as the second-level orange warning threshold; and the proportion of plateauing corresponding to the rise in degradation probability to about 80% was used as the third-level red warning threshold.
[0181] The selected single-round plateau determination threshold group and multi-round warning level threshold are encapsulated into an evaluation configuration scheme, which serves as the system default configuration and can be adjusted by the user according to implementation method D.
[0182] Based on any of the above embodiments, in Embodiment Six of this application, the method further includes: Step S61: Count the first label within a preset number of rounds and generate multiple rounds of low-increment degradation labels.
[0183] Extract the single-round low-increment degradation label judgment results within a preset number of rounds, filter the number of texts marked as the first label, and calculate the multi-round low-increment degradation label according to the formula: Multi-round low-increment degradation label = Number of texts with the first label ÷ Preset number of rounds × 100%. Wherein, the preset number of rounds is a continuous statistical round, and the default configuration is 10 rounds.
[0184] Step S62: Compare the multiple rounds of low-incremental degradation labels with the preset alarm threshold.
[0185] The calculated multi-round low-increment degradation labels are numerically compared with preset alarm thresholds, which adopt a three-level gradient threshold: Level 1 warning threshold: 30%; Level 2 warning threshold: 50%; Level 3 warning threshold: 70%. The comparison rule is to directly compare the numerical values to determine the gradient interval in which the multi-round low-increment degradation labels are located.
[0186] Step S63: If an alarm is triggered, output the corresponding level of platform warning information.
[0187] When multiple rounds of low-incremental degradation labels exceed or exceed the corresponding gradient threshold, a corresponding level of alarm is triggered: ≥70% triggers a Level 3 red alarm, outputting "Terraced text proportion is too high, text quality is severely degraded"; ≥50% triggers a Level 2 orange alarm, outputting "Terraced text proportion is too high, text quality has significantly decreased"; ≥30% triggers a Level 1 yellow alarm, outputting "Terraced text proportion is too high, text quality is at risk of decline". If no gradient threshold is triggered, the output is "Text quality is normal, no terraced risk".
[0188] For example, after 10 rounds of continuous text analysis, 6 rounds are identified as the first label. The calculated percentage of low-increment degenerate labels in multiple rounds is 6 ÷ 10 × 100% = 60%. When compared with the third-level gradient threshold, if 50% ≤ 60% < 70%, a second-level orange alarm is triggered, and the corresponding warning information is output.
[0189] This embodiment generates multiple rounds of low-increment degradation labels by quantitatively analyzing the first label in consecutive rounds. It uses both numerical comparison and hierarchical matching to complete the alarm determination and outputs the corresponding level of warning information based on the triggering result. This solves the technical problem that continuous plateau-like content cannot be warned in advance during the AI multi-round generation process, realizes multi-round trend monitoring and advance risk warning of plateau effect, and improves the dynamic control capability of AI multi-round question and answer content quality.
[0190] Based on any of the above embodiments, in Embodiment Seven of this application, the method further includes: Step S71: The relative incremental evaluation value, the first model evaluation value, and the topic matching degree are weighted and summed according to preset weights to obtain the global semantic incremental contribution value.
[0191] In this embodiment, the global semantic increment contribution value is a comprehensive quantitative score representing the innovative value and intellectual increment generated by the output text for the global task. The preset weights are the allocation ratios of various indicators in the novelty calculation pre-set by the system. Weighted summation is a numerical calculation method that accumulates the values obtained by multiplying each indicator by its corresponding weight.
[0192] As an optional implementation, the specific inputs are the relative incremental evaluation value, the first model evaluation value, and the topic matching degree. During processing, a preset weight configuration is invoked, setting the weight of the relative incremental evaluation value to 0.4, the weight of the first model evaluation value to 0.45, and the weight of the topic matching degree to 0.15. The global semantic incremental contribution value is calculated using the formula: 0.4 multiplied by the relative incremental evaluation value, plus 0.45 multiplied by the first model evaluation value, plus 0.15 multiplied by the topic matching degree. The result is then normalized to the range of zero to one. After execution, the output is the global semantic incremental contribution value corresponding to the output text.
[0193] As another optional implementation, the specific inputs are the relative incremental evaluation value, the first model evaluation value, and the topic matching degree. During processing, the preset weights are dynamically adjusted according to the generation stage of the global task. In the early stage of generation, the weight of the relative incremental evaluation value is set to 0.3, the weight of the first model evaluation value is set to 0.5, and the weight of the topic matching degree is set to 0.2. In the middle and later stages of generation, the weight of the relative incremental evaluation value is increased to 0.5, the weight of the first model evaluation value is set to 0.4, and the weight of the topic matching degree is set to 0.1. The weighted summation operation is performed using the weights corresponding to the corresponding stages, and the result is determined as the global semantic incremental contribution value. After execution, the output result is the global semantic incremental contribution value adapted to the generation stage.
[0194] As an optional implementation, a fixed weight configuration is adopted, with the relative incremental evaluation value weight set to 0.4, the first model evaluation value weight set to 0.45, and the topic matching degree weight set to 0.15. The global semantic incremental contribution value is calculated using a linear weighted formula: global semantic incremental contribution value = 0.4 × relative incremental evaluation value + 0.45 × first model evaluation value + 0.15 × topic matching degree. The calculation result is then cropped to a numerical range of 0 to 1 to obtain the final global semantic incremental contribution value.
[0195] Step S72: Determine the corresponding plateauing penalty coefficient based on the single-round low-increment degradation label, and obtain the global quality contribution based on the text evaluation value and the plateauing penalty coefficient.
[0196] In this embodiment, the plateauing penalty coefficient is a quality score correction coefficient set based on the single-round low-increment degradation label, used to reduce the quality score of the plateaued text. The global quality contribution is the true quality contribution score of the output text to the global task after removing the artificially inflated plateauing components. The single-round low-increment degradation label includes two classification states: a first label and a second label.
[0197] As an optional implementation, the input objects are single-round low-increment degradation labels and text evaluation values. During processing, the label type of the single-round low-increment degradation label is detected. If the single-round low-increment degradation label is the first label, the plateauing penalty coefficient is set to 0.5; if the single-round low-increment degradation label is the second label, the plateauing penalty coefficient is set to 1. The global quality contribution is calculated according to the formula: global quality contribution equals text evaluation value multiplied by plateauing penalty coefficient. After execution, the output result is the global quality contribution corresponding to the output text.
[0198] As an alternative implementation, the input objects are single-round low-increment degradation labels, text evaluation values, and multi-round low-increment degradation labels. During processing, the penalty coefficient is set in stages based on the multi-round low-increment degradation labels. If the single-round low-increment degradation label is the first label and the multi-round low-increment degradation labels are in the mild range, the penalty coefficient is set to 0.6. If the single-round low-increment degradation label is the first label and the multi-round low-increment degradation labels are in the moderate or higher range, the penalty coefficient is set to 0.4. If the single-round low-increment degradation label is the second label, the penalty coefficient is constant at 1. The text evaluation value is multiplied by the corresponding penalty coefficient to obtain the corrected global quality contribution. After execution, the output result is the global quality contribution after the staged penalty.
[0199] As an optional implementation, a mapping relationship between single-round low-increment degradation labels and penalty coefficients is established. When the single-round low-increment degradation label is the first label, the plateauing penalty coefficient is assigned a value of 0.5. When the single-round low-increment degradation label is the second label, the plateauing penalty coefficient is assigned a value of 1. The global quality contribution is calculated using a multiplication correction formula: global quality contribution = text evaluation value × plateauing penalty coefficient, thus directly obtaining the calibrated global quality contribution.
[0200] For example, the system obtains a relative incremental evaluation value of 0.68 for the output text in a certain round, a first model evaluation value of 0.72, and a topic matching degree of 0.82. Calculated using a weighted formula, the global semantic incremental contribution value is 0.4 × 0.68 + 0.45 × 0.72 + 0.15 × 0.82 = 0.71. The single-round low incremental degradation label of the output text in this round is the second label, the plateauing penalty coefficient is 1, and the text evaluation value is 0.76. Calculated using a correction formula, the global quality contribution is 0.76 × 1 = 0.76. When the single-round low incremental degradation label of the output text in this round is the first label, the plateauing penalty coefficient is 0.5, and the global quality contribution is 0.76 × 0.5 = 0.38.
[0201] For example, the system obtains a relative incremental evaluation value of 0.68 for the output text in a certain round, a first model evaluation value of 0.72, and a topic matching degree of 0.82. It then performs a weighted summation using fixed preset weights to calculate a global semantic incremental contribution value of 0.71. If the single-round low incremental degradation label of the output text in this round is the second label, the system sets the plateauing penalty coefficient to one, and the text evaluation value of this text is 0.76. After multiplication, the global quality contribution value is 0.76. If the single-round low incremental degradation label of the output text in this round is the first label, the system sets the plateauing penalty coefficient to 0.5, the text evaluation value is 0.76, and the global quality contribution value is 0.38.
[0202] This embodiment calculates the global semantic increment contribution value using two weighting methods: fixed weight and stage dynamic weight. It accurately quantifies the global innovative value of the output text and corrects the text evaluation value to obtain the true global quality contribution by matching the single-round low-increment degradation label with the graded plateauing penalty coefficient. This solves the technical problems of the inability to quantify the novelty of AI-generated text in multiple rounds and the existence of plateauing inflated quality scores, thus improving the authenticity and accuracy of the global text value assessment.
[0203] Based on any of the above embodiments, in Embodiment 8 of this application, the method further includes: Step S81: Extract the core concepts, semantic vectors, statistical features, stylistic features, and length features of the output text for this round.
[0204] In this embodiment, the statistical features of this round are a set of quantified features obtained by statistically analyzing the abstract level score, concept density, first model score distribution, and second model score distribution of the output text. The core concepts, semantic vectors, stylistic features, and length features of this round are all defined features and will not be explained again here.
[0205] As an optional implementation, the input is the output text generated by the large language model in the nth round. During processing, the core terms of the output text are extracted using a fusion algorithm of TF-IDF and TextRank to form the core concepts of this round. A 768-dimensional pre-trained semantic coding model is used to globally encode the output text, generating the semantic vector for this round. The output text undergoes abstraction level assignment, concept density statistics, and dual-model score aggregation to obtain the statistical features of this round. The stylistic features of this round are extracted through syntactic tree distribution and paragraph structure recognition, and the length features of this round are obtained through the statistics of effective character count and token count. After execution, the output results are the core concepts, semantic vectors, statistical features, stylistic features, and length features of this round corresponding to the output text.
[0206] As an alternative implementation, the input is the output text generated by the large language model in the nth round. During processing, the core concepts for this round are obtained through domain-specific lexicon matching, and a lightweight 512-dimensional semantic model is used to generate the semantic vector for this round. The average abstraction level, the proportion of core concepts, and the score fluctuation range of the output text are statistically analyzed to form the statistical features for this round. The stylistic features for this round are determined based on the average sentence length and the frequency of conjunctions, and the length features for this round are determined based on the total number of tokens in the text. After execution, the output results are the core concepts, semantic vectors, statistical features, stylistic features, and length features for this round corresponding to the output text.
[0207] Optionally, the output text is preprocessed, and the core concepts of this round are extracted using a TF-IDF+TextRank fusion algorithm. The formula is: TextRank_score(w)=(1-d)+d×Σ(In(w)×Out(w) / Σ(Out(w))) where d=0.85, In(w) is the word in-degree weight, and Out(w) is the word out-degree weight. A 768-dimensional pre-trained semantic coding model is used to encode the output text, and the semantic vector V_current of this round is generated through global average pooling (GAP). The output text is assigned abstract hierarchical values (levels 1-4), and the concept density is calculated as the number of core concepts / number of effective characters. The scores M1 of the first model and M2 of the second model are collected and integrated to obtain the statistical features of this round. The stylistic features of this round are determined by the syntactic tree depth D and the frequency of connectives F, and the length features of this round are obtained by the number of effective characters L and the number of tokens T.
[0208] Step S82: Merge the current round core concept with the global core concept set to update and obtain the nth round global core concept set.
[0209] In this embodiment, the global core concept set for the nth round is the latest global terminology set used for the (n+1)th round evaluation after incorporating the newly added core concepts from this round. Incorporation refers to the process of deduplicating, merging, and standardizing the core concepts from this round and the original global core concept set.
[0210] As an optional implementation, the input objects are the current round's core concepts and the global core concept set. During processing, precise string matching and semantic similarity matching are performed on the current round's core concepts and the global core concept set to remove duplicate and synonymous concepts, retaining only unique core concepts. The retained unique core concepts are then appended to the global core concept set, sorted alphabetically and by importance to form the nth round's global core concept set. After execution, the output is the nth round's global core concept set that has undergone fusion and updating.
[0211] As an alternative implementation, the input objects are the current core concepts and the global core concepts set. During processing, the current core concepts and the global core concepts set are converted into concept vector groups. The cosine similarity between the vectors is calculated, and concepts with a similarity greater than 0.85 are identified as synonyms and removed. The remaining newly added concepts are merged with the original global core concepts set, and a weighted sort is applied based on concept importance to update and obtain the nth-round global core concepts set. After execution, the output is the nth-round global core concepts set after semantic deduplication and fusion.
[0212] Optionally, the current core concept W_current is matched with the global core concept set W_global using string exact matching and semantic vector cosine similarity matching: completely duplicate concepts are directly eliminated; concepts with a cosine similarity > 0.85 are judged as synonyms and eliminated from the current round; unique concepts are added to the global set; finally, the global core concept set W_global_n for the nth round is obtained by sorting the concepts in descending order according to their importance weight Score = TF-ID score × 0.6 + TextRank score × 0.4.
[0213] Step S83: Merge the current round semantic vector with the global topic semantic vector to update and obtain the nth round global topic semantic vector.
[0214] In this embodiment, the global topic semantic vector for the nth round is the latest global semantic vector used for topic matching calculations in subsequent rounds after fusing the semantics of the text in this round. The fusion adopts a vector operation method of sliding weighted average or exponential moving average.
[0215] As an optional implementation, the input objects are the current-round semantic vector and the global topic semantic vector. During processing, a sliding weighted average formula is used to perform vector fusion. The calculation formula is: the global topic semantic vector of the nth round equals the global topic semantic vector multiplied by n minus one, plus the current-round semantic vector, and then divided by n. The calculated vectors are normalized to ensure a constant vector magnitude. After execution, the output is the globally topic semantic vector of the nth round updated using the sliding weighted average.
[0216] As an alternative implementation, the input objects are the current-round semantic vector and the global topic semantic vector. During processing, an exponential moving average algorithm is used for vector fusion. The calculation formula is: the nth-round global topic semantic vector equals the current-round semantic vector multiplied by a smoothing coefficient plus the global topic semantic vector multiplied by one minus a smoothing coefficient, where the smoothing coefficient is set to 0.3. The calculated vector is then standardized to obtain the nth-round global topic semantic vector. After execution, the output is the exponentially smoothed updated nth-round global topic semantic vector.
[0217] As an optional implementation, a sliding weighted average formula is used for vector fusion: V_global_n=(V_global×(n-1)+V_current) / n. The fused vector is normalized by modulus, so that ||V_global_n||=1, to obtain the global topic semantic vector of the nth round.
[0218] Step S84: Based on the statistical characteristics of this round, correct the global statistical baseline and update it to obtain the global statistical baseline for the nth round.
[0219] In this embodiment, the global statistical baseline for the nth round is the latest statistical reference value used for calculating the abstract hierarchy transition degree in subsequent rounds after integrating the text statistical features of this round. The numerical calculation method using mean iteration updates is corrected.
[0220] As an optional implementation, the specific inputs are the current round of statistical features and the global statistical baseline. During processing, the average abstraction level, concept density, and mean scores of the two models are extracted from the current round of statistical features. An arithmetic iterative correction formula is used, where the nth round global statistical baseline equals the global statistical baseline multiplied by n minus one, plus the current round's statistical feature value, and then divided by n. Iterative calculations are performed on each statistical indicator in the baseline, and the results are integrated to form the nth round global statistical baseline. After execution, the output is the arithmetically iteratively corrected nth round global statistical baseline.
[0221] As an alternative implementation, the input objects are the current round of statistical features and the global statistical baseline. During processing, outlier data in the current round of statistical features are removed. A weighted iterative correction method is used, assigning a weight of 0.2 to the effective statistical features of the current round and a weight of 0.8 to the original global statistical baseline. A weighted summation is then performed to update the global statistical baseline for the nth round. After execution, the output is the weighted corrected global statistical baseline for the nth round after removing outliers.
[0222] As an optional implementation, the abstract level mean A, concept density D, first model mean M1, and second model mean M2 are extracted from the statistical features of this round. The arithmetic iterative correction formulas are used: A_n=(A_global×(n-1)+A_current) / n, D_n=(D_global×(n-1)+D_current) / n, M1_n=(M1_global×(n-1)+M1_current) / n, and M2_n=(M2_global×(n-1)+M2_current) / n. The four indicators are integrated to obtain the global statistical baseline Base_global_n of the nth round.
[0223] Step S85: Adjust the style adaptation threshold and length adaptation threshold according to the style characteristics and length characteristics of this round, respectively.
[0224] In this embodiment, adjustment refers to the process of calibrating and dynamically floating the original threshold based on the actual characteristics of the current text, so as to ensure that the threshold matches the actual generation of the global task.
[0225] As an optional implementation, the specific input objects are the current round of text style features, the current round of document length features, the text style adaptation threshold, and the document length adaptation threshold. During processing, the current round of text style features are matched with the text style adaptation threshold. If the matching deviation is less than 5%, the threshold remains unchanged; if the deviation is greater than 5%, the text style adaptation threshold is adjusted in the same direction according to the deviation magnitude. The deviation rate between the current round of document length features and the original document length adaptation threshold is calculated. If the deviation rate is within ±10%, no adjustment is made; if it exceeds this range, the document length adaptation threshold is updated to the target value of the current round of document length features. After execution, the output results are the dynamically adjusted text style adaptation threshold and document length adaptation threshold.
[0226] As an alternative implementation, the specific input objects are the current round's style features, current round's length features, style adaptation threshold, and length adaptation threshold. During processing, the average of the style features from multiple consecutive rounds is calculated, and the style adaptation threshold is recalibrated based on the average result. Based on the remaining length and number of rounds in the global task, the target length value for a single round is calculated, and this value is updated as the new length adaptation threshold. After execution, the output results are the style adaptation threshold and length adaptation threshold after multi-round mean calibration.
[0227] As an optional implementation, the stylistic feature deviation rate is calculated as follows: Calculate |Current stylistic feature - Stylistic adaptation threshold| / Stylistic adaptation threshold × 100% : If the deviation rate is ≤ 5%, the threshold remains unchanged; if the deviation rate is > 5%, adjust the stylistic adaptation threshold in the same direction according to the deviation magnitude. The length feature deviation rate is calculated as follows: Calculate |Current length feature - Length adaptation threshold| / Length adaptation threshold × 100% : If the deviation rate is within ±10%, the threshold remains unchanged; if the deviation rate is > ±10%, update the length adaptation threshold to the current length feature target value.
[0228] Step S86: Update the terrace determination threshold by combining the multi-round low-increment degradation labels.
[0229] In this embodiment, "updating" refers to the process of adjusting the sensitivity of the terrace determination threshold based on the degree of terrace formation in multiple rounds. The higher the degree of terrace formation in multiple rounds, the more stringent the threshold setting.
[0230] As an optional implementation, the input objects are the multi-round low-increment degradation labels and the original terrace determination threshold. During processing, if the multi-round low-increment degradation label value is greater than 60%, the terrace determination threshold is lowered by 10%; if the multi-round low-increment degradation label value is between 40% and 60%, the threshold remains unchanged; and if the multi-round low-increment degradation label value is less than 40%, the terrace determination threshold is increased by 5%. After execution, the output result is the terrace determination threshold after sensitivity adjustment.
[0231] As an alternative implementation, the input consists of multiple rounds of low-increment degradation labels and the original terrace determination threshold. During processing, a linear mapping formula is used to map the values of the multiple rounds of low-increment degradation labels to a threshold adjustment coefficient. The terrace determination threshold is updated by multiplying the original threshold by the adjustment coefficient, with the adjustment coefficient ranging from 0.9 to 1.1. After execution, the output is the terrace determination threshold adjusted by the linear mapping.
[0232] As an optional implementation, the multi-round low-increment degradation label P is divided into three intervals, and the terrace determination threshold T is updated according to the following rules: P>60%, T_new=T_old×0.9 (down by 10%); 40%≤P≤60%, T_new=T_old (remain unchanged); P<40%, T_new=T_old×1.05 (up by 5%).
[0233] Step S87: Integrate the updated nth round global core concept set, nth round global topic semantic vector, nth round global statistical baseline, style adaptation threshold, length adaptation threshold, and platform determination threshold to obtain the nth round global context feature pool.
[0234] In this embodiment, integration refers to the structured encapsulation and unified storage of all updated benchmark data to form a complete dataset that can be directly used for the next round of evaluation.
[0235] As an optional implementation, the input consists of all updated feature and threshold data. During processing, following a preset data structure template, the nth round global core concept set, the nth round global topic semantic vector, the nth round global statistical baseline, the adjusted style adaptation threshold, length adaptation threshold, and platform determination threshold are sequentially filled into the corresponding fields, and then structured encapsulated and uniquely identified. After execution, the output is the complete nth round global context feature pool.
[0236] As an alternative implementation, the input consists of all updated feature and threshold data. During processing, all data undergoes format standardization and normalization, redundant data is removed, and the data is stored according to data type, generating structured feature pool data containing index relationships, thus forming the nth round of the global context feature pool. After execution, the output is the standardized nth round of the global context feature pool.
[0237] As an optional implementation, the data is encapsulated according to a fixed structure: Global_Pool_n={W_global_n,V_global_n,Base_global_n,T_style,T_length,T_plateau}, where T_style is the updated style adaptation threshold, T_length is the updated length adaptation threshold, and T_plateau is the updated background judgment threshold. Format validation and unique index binding are performed on the encapsulated data to generate the nth round of global context feature pool, which can be directly used for the next round of evaluation.
[0238] For example, after the system completes the third round of evaluation of academic papers in the field of artificial intelligence, it extracts the core concepts, semantic vectors, statistical features, stylistic features, and length features of this round. The newly added core concepts of this round are deduplicated and merged with the original global core concept set to update and obtain the third round global core concept set. A sliding weighted average algorithm is used to fuse the semantic vectors of this round with the global topic semantic vectors to generate the third round global topic semantic vector. Based on the statistical features of this round, the global statistical baseline is corrected through arithmetic iteration to obtain the third round global statistical baseline. The corresponding adaptation threshold is dynamically adjusted according to the length and stylistic features of this round. Combining the low-incremental degradation label values from multiple rounds of this round, the plateau judgment threshold is lowered by 5%. Finally, all updated data are structured and integrated to generate a third-round global context feature pool that can be used for the fourth round of evaluation.
[0239] For example, after completing the third round of output text evaluation, the system extracts the core concepts, 768-dimensional semantic vectors, 3 levels of abstraction, a concept density of 0.012, a first model score of 0.78, and a second model score of 0.82. The core concepts of this round are deduplicated and fused with the global set, and sorted by importance weight to obtain the third round of global core concept set. A sliding weighted average formula is used to fuse the semantic vectors: V_global_3 = (V_global_2 × 2 + V_current) / 3, and normalized. The global statistical baseline is corrected using an arithmetic iteration formula to obtain the third round of global statistical baseline. The text style feature deviation rate is 3%, keeping the threshold unchanged; the length feature deviation rate is 12%, updating the length adaptation threshold. Multiple rounds of low-increment degenerate labels are 55%, keeping the plateau judgment threshold unchanged. All data is encapsulated in a fixed structure to generate the third round of global context feature pool, which is directly used for the fourth round of text evaluation.
[0240] This embodiment extracts multi-dimensional features from the current round of text and dynamically updates the core concepts, semantic vectors, statistical baselines, and various thresholds of the global context feature pool in all dimensions. It relies on engineering algorithms such as moving average, exponential smoothing, and linear mapping to achieve data fusion and threshold calibration. Combined with multi-round low-increment degradation label optimization of platform judgment sensitivity, it solves the technical problem of static solidification of the global evaluation benchmark and inability to adapt to the dynamic changes of multi-round interactions. It forms a closed-loop iterative mechanism for multi-round question-and-answer evaluation, continuously improving the accuracy and reliability of subsequent rounds of text evaluation.
[0241] Optionally, within the framework of steps S81 to S87, a feature writing condition judgment based on the evaluation results of this round, anomaly feature isolation storage, quality-weighted update, and low-quality cycle rollback mechanism are introduced to prevent low-quality or terraced round features from polluting the global context feature pool, as detailed below.
[0242] Extract the core concepts, semantic vectors, statistical features, stylistic features, and length features of the output text for this round. At the same time, obtain the text evaluation value calculated in step S40, the single-round platform feature labels determined in step S53, and the topic matching degree calculated in step S33, as the basis for subsequent writing condition judgment.
[0243] The core concepts of this round are merged with the global core concept set to obtain the global core concept set of the nth round. During the fusion process, effective feature writing conditions are added.
[0244] First, write eligibility is determined. Write eligibility is determined if both of the following conditions are met simultaneously: Condition 1: The text evaluation value in this round is greater than or equal to the preset write quality threshold T_write, where T_write is 0.5 by default; Condition 2: The single-round terrace feature label for this round is the second label, meaning that this round was not judged as terraced text.
[0245] If the writing qualification is met, the core concepts of this round and the global core concept set will be deduplicated and merged. The merging method is as follows: exact string matching and semantic vector cosine similarity matching will be performed on the core concepts of this round and the global core concept set. Completely duplicate concepts and concepts with a cosine similarity greater than a preset threshold T_sim will be removed. T_sim is set to 0.85 by default. The newly added concepts that are retained will be sorted in descending order according to their importance weight. The formula for calculating the importance weight is: Score_import = TF_IDF score × 0.6 + TextRank score × 0.4.
[0246] The newly added concepts, after being sorted, are added to the global core concept set to form the nth round of the global core concept set.
[0247] If a concept does not meet the writing requirements, its core concept for this round will not be written to the global core concept set, but instead to the abnormal concept observation pool. The abnormal concept observation pool is an independent storage structure used to record concepts introduced in low-quality rounds for subsequent manual review or automatic cleanup. The global core concept set for round n remains consistent with that for round n-1.
[0248] The semantic vector of this round is fused with the global topic semantic vector to obtain the global topic semantic vector of the nth round. A quality weighting mechanism is introduced during the fusion process.
[0249] First, the quality weight w_quality of the semantic vector in this round is calculated, which is based on the segmented mapping of the text evaluation value Score_text in this round: If Score_text is greater than or equal to 0.7, w_quality is set to 1.0; If Score_text is between 0.5 and 0.7, w_quality is set to 0.6; If Score_text is between 0.3 and 0.5, w_quality is set to 0.3; If Score_text is less than 0.3, w_quality is set to 0.
[0250] At the same time, the topic matching degree of this round is checked. If the topic matching degree is less than the preset topic deviation threshold T_topic (T_topic is 0.4 by default), then w_quality is forcibly set to 0, that is, the semantic vector of this round does not participate in the global topic semantic vector update.
[0251] After determining w_quality, a quality-weighted exponential moving average algorithm is used for vector fusion. The fusion formula is as follows: V_global_n=w_quality×γ×V_current+(1-w_quality×γ)×V_global_n-1 Here, γ is the basic smoothing coefficient, which defaults to 0.3. When w_quality is 0, V_global_n = V_global_n - 1, meaning that the global topic semantic vector is not updated in this round.
[0252] After merging, V_global_n is normalized to ensure that its magnitude is 1.
[0253] Based on the statistical characteristics of this round, the global statistical baseline is corrected and updated to obtain the global statistical baseline of the nth round. During the correction process, a quality coefficient is introduced to adjust the iteration step size.
[0254] The global statistical baseline includes four indicators: average abstraction level A_global, average concept density D_global, mean score of the first model M1_global, and mean score of the second model M2_global. The quality-weighted iterative correction formulas for each indicator are as follows: A_global_n=A_global_n-1+w_quality×(A_current-A_global_n-1) / n; D_global_n=D_global_n-1+w_quality×(D_current-D_global_n-1) / n; M1_global_n=M1_global_n-1+w_quality×(M1_current-M1_global_n-1) / n; M2_global_n=M2_global_n-1+w_quality×(M2_current-M2_global_n-1) / n.
[0255] Here, w_quality uses the quality weight calculated in step S83, and n is the current cumulative round number. When w_quality is 0, none of the indicators are updated.
[0256] The updated four indicators are integrated to form the global statistical baseline for the nth round.
[0257] The text style adaptation threshold and length adaptation threshold are adjusted according to the characteristics of the text style and length of this round, respectively, and an abnormal fluctuation protection mechanism is added during the adjustment process.
[0258] To adjust the style matching threshold, first calculate the deviation rate between the current style features and the current style matching threshold. The formula for calculating the deviation rate is: Deviation rate = |Current style features - Style adaptation threshold| / Style adaptation threshold × 100%.
[0259] When the deviation rate is less than or equal to 5%, the style adaptation threshold remains unchanged. When the deviation rate is greater than 5%, it is further determined whether the text evaluation value of this round is greater than or equal to the write quality threshold T_write. If so, the style adaptation threshold is adjusted in the same direction by 50% of the deviation amplitude; if not, the style adaptation threshold remains unchanged. The purpose of halving the adjustment amplitude is to prevent the threshold from drifting significantly due to abnormal style features in a single round.
[0260] To adjust the length adaptation threshold, the deviation rate between the current length feature and the length adaptation threshold is calculated. If the deviation rate is within ±10%, the length adaptation threshold remains unchanged. If the deviation rate exceeds ±10%, it is further determined whether the current text evaluation value is greater than or equal to T_write. If so, the length adaptation threshold is updated to the arithmetic mean of the current length feature value and the current length adaptation threshold; otherwise, the length adaptation threshold remains unchanged.
[0261] The terrace determination threshold is updated by combining multiple rounds of terrace characteristics, and the update logic is the same as in the original embodiment eight.
[0262] When the value of the terrace feature in multiple rounds is greater than 60%, the terrace judgment threshold will be lowered to 90% of the original value; when the value of the terrace feature in multiple rounds is between 40% and 60%, the terrace judgment threshold will remain unchanged; when the value of the terrace feature in multiple rounds is less than 40%, the terrace judgment threshold will be raised to 105% of the original value.
[0263] By integrating and updating the nth round global core concept set, the nth round global topic semantic vector, the nth round global statistical baseline, the style adaptation threshold, the length adaptation threshold, and the platform determination threshold, the nth round global context feature pool is obtained.
[0264] Before integration and encapsulation, a consistency verification step is added: This step checks whether the data structures in the nth round global context feature pool are complete, whether the dimensions of each vector are consistent, and whether each threshold is within a preset valid value range. If the verification passes, the data is encapsulated according to a preset data structure template, bound with a unique identifier, and a nth round global context feature pool that can be directly used for the (n+1)th round evaluation is generated.
[0265] Building upon this, a rollback trigger mechanism is further introduced: when a preset number of consecutive low-quality outputs occur, a global context feature pool rollback is triggered, restoring the feature pool to the state of the most recent checkpoint before the low-quality output occurred. The criterion for consecutive low-quality outputs is: the text evaluation value for K consecutive rounds is lower than T_write, where K defaults to 3. The checkpoint generation strategy is: each time the global context feature pool is successfully updated and the text evaluation value for that round is greater than or equal to 0.7, a snapshot of the current feature pool is saved as a checkpoint. When a rollback is triggered, the most recent valid checkpoint is loaded, and the corresponding abnormal concept record in the abnormal concept observation pool is cleared.
[0266] This embodiment effectively prevents low-quality rounds from polluting the global context by introducing write eligibility determination, isolated storage of abnormal concepts, quality-weighted semantic vector fusion, iteration step size adjustment, abnormal fluctuation protection, and rollback triggering mechanisms. This enhances the engineering credibility and long-term stability of the global context feature pool in multi-round interactions. Furthermore, for large-scale multi-round generation tasks, a global context feature pool that updates with each round is constructed. In each round, absolute quality evaluation values and relative incremental evaluation values are calculated simultaneously. Combined with dynamic fusion of task completion, the system further identifies single-round plateaued outputs through a combination of conditions: "high expression quality, low cognitive quality, and low relative increment." Multi-round plateaued output warnings are then provided through continuous round statistics.
[0267] Based on any of the above embodiments, in Embodiment 9 of this application, an initialized global context feature pool is first constructed based on the input information of the task configuration control. Then, for each round of output of the AI large model, its text evaluation value is obtained and the global context feature pool of the current round is updated to realize real-time result evaluation of AI dialogue.
[0268] Optionally, refer to Figure 2 , Figure 2An example of a multi-turn dialogue control is shown, displaying the nth-turn output text from a large language model. When the cursor selects the i-th-turn output text, a display control for the text evaluation value corresponding to the i-th-turn output text is rendered, showing the text evaluation value, global novelty contribution, global quality contribution, and plateau warning.
[0269] Optionally, a dual-dimensional anchored detail text evaluation value calculation model: decomposes the text evaluation value of a single round of output into an absolute quality score and a relative incremental score, adapting to the dynamic characteristics of multi-round interactions and avoiding the bias of isolated evaluation. A progressive plateau trend detection mechanism: from single-round low-increment degradation label recognition to multi-round cumulative plateau trend graded early warning, achieving an upgrade from "post-event detection" to "pre-emptive prevention," proactively capturing the risk of AI mean regression. A global context dynamic baseline system: as multi-round interactions progress, the feature baseline of the global generation task is dynamically updated, ensuring that each round of evaluation is anchored to the latest global context, adapting to the iterative process of text generation. A global semantic incremental contribution value quantification algorithm: based on the core dimension of semantic incremental complexity, quantifies the cognitive increment brought to the global text by a single round of output, accurately determining "whether it has novelty to the whole," and severely penalizing homogeneous and redundant content.
[0270] Optionally, the user initiates a global generation task, initializes a global baseline, engages in multiple rounds of interaction (user inputs instructions, AI generates output, triggers real-time calculation, outputs evaluation feedback, updates the global baseline, and the next round of interaction), and outputs a full-process report upon task completion.
[0271] Step 1: Global generation task initialization and baseline construction.
[0272] When a user initiates a global text generation task (such as an industry report, paper, solution, or novel), a global baseline is first established to provide an anchor for subsequent rounds of evaluation, avoiding vague judgments without a benchmark. Global task information collection: This involves collecting the user's initial needs, generation goals, domain, style, length requirements, core themes, and core concept set to clarify the boundaries and evaluation criteria of the global generation task. Domain benchmark library construction: Based on the task's domain and theme, a high-quality text library of the corresponding domain is matched, and a domain benchmark text evaluation value (including the average domain semantic incremental complexity, average expression structure stability, extreme value threshold, and variance benchmark) is calculated as an external reference for global quality. Global context feature pool initialization: A dynamically updated feature pool is created. The initial state includes the core concept table of the user's initial needs, the theme semantic vector, and the target abstraction level. Effective features output by AI in each subsequent round will be synchronously updated to this pool. Terrace prevention baseline initialization: A single-round low-increment degradation label judgment threshold and a multi-round cumulative terrace trend warning threshold are preset and adaptively adjusted based on style and domain (e.g., higher semantic incremental complexity threshold for academic papers and higher expression structure stability threshold for official documents).
[0273] Step 2: Multi-round interactive core loop, each AI output triggers full-process calculation. After the AI completes its current output, the full-process calculation for this step is immediately triggered. This is the core of the solution, fully addressing the user's need for "calculation for every AI output," and is divided into 4 core sub-steps.
[0274] Sub-step 2.1 Refine the evaluation value of the detail text output by the AI in this round.
[0275] The detail text evaluation score is a refined comprehensive score of the content in a single round. It is a weighted fusion of absolute quality score and relative incremental score, with the weights adaptively adjusted according to the global generation stage, perfectly adapting to the different stage requirements of multi-round generation.
[0276] (1) Absolute quality score: The dual entropy quantification of the content itself in this round. For the independent text output by AI in this round, the refined semantic incremental complexity and expression structure stability score are calculated based on the basic theory of text evaluation value, normalized to the [0,1] interval, and the dimensions are optimized to adapt to the characteristics of short text in a single round: Semantic incremental complexity of this round PhilEnt_round (5 core dimensions, weights are initialized equally, and text style adaptation is supported): Core concept density: the number of effective topics / professional concepts in this round / the length of text in this round, eliminating meaningless concept stacking; Increment of thought depth: the degree of improvement of the abstract level of the content in this round relative to the user's instructions in this round, divided into 4 levels: phenomenon description, rule summary, principle abstraction, and paradigm innovation; Structural adaptability: The complexity of the syntax structure and the completeness of the logical chain in this round match the user's command requirements in this round; Cross-domain adaptability: The adaptability of the cross-domain concepts introduced in this round to the global theme, distinguishing effective innovation from irrelevant content; Logical self-consistency: The logical recursive depth and the completeness of the argument / narrative loop in this round of content.
[0277] This round's expression structure stability LogiEnt_round (7 core dimensions, initialized with equal weights): Syntactic compliance and complexity: syntactic tree depth, clause nesting level, syntactic error rate (error rate results in negative deduction); Symbolic information density: number of effective technical terms, data, and proprietary concepts per unit length; Semantic stability: semantic coherence of clauses in this round, without ambiguity or logical jumps; Logical depth: number of reasoning steps from premise to conclusion in this round, and completeness of argumentation; Structural hierarchy clarity: recognizability of the hierarchical structure of paragraphs and clauses in this round; Concept definition accuracy: boundary clarity and definition accuracy of core concepts in this round; Contextual coherence: logical smoothness of the connection between this round and the previous round's interactive content.
[0278] The formula for calculating the absolute mass score is: Score_abs = α × PhilEnt_round + β × LogiEnt_round.
[0279] Rule Explanation: AI outputs in each round are mostly short texts of less than 1,000 characters, with a default α=0.4 and β=0.6 (in accordance with the basic rule that short texts have a higher weight for structural stability). If the user's instructions in this round explicitly require innovation / deep thinking, α will adaptively increase to 0.6-0.7.
[0280] (2) Relative incremental score: The value increment of this round of content to the whole.
[0281] This is a core innovation adapted to multi-round interaction scenarios, addressing the pain point of "seemingly high-quality single rounds, but with no value to the whole." By comparing the content of this round with the global context feature pool, the added value is quantified and normalized to the [0,1] interval: Core calculation dimensions: Core concept addition rate: the number of effective core concepts not appearing in the global feature pool in this round / the total number of core concepts in this round; Abstraction level leap: the improvement of the abstraction level of this round's ideas relative to the average abstraction level of the global text; Logical structure innovation: whether the argument / narrative structure introduced in this round is an effective innovative structure not appearing globally; Topic matching degree: the semantic similarity between the content of this round and the core theme of the global generation task, avoiding invalid increments that deviate from the theme; Redundancy reverse index: the repetition rate and homogenization degree between this round and the globally generated text, reverse normalized (the higher the repetition rate, the lower the score).
[0282] The formula for calculating the relative incremental score is: Score_inc = γ × (concept addition rate × 0.3 + abstract leap degree × 0.3 + structural innovation degree × 0.2 + theme matching degree × 0.15 - redundancy degree × 0.35).
[0283] Rule Explanation: γ is the normalization coefficient, ensuring that Score_inc falls within the [0,1] interval; the core of the formula emphasizes new concepts and deep leaps, and severely punishes homogeneous and redundant content.
[0284] (3) Integration of detailed text evaluation values and comprehensive scores.
[0285] Based on the current completion status of the global generation task, the weights of the two components are adaptively adjusted to align with the core objectives of different generation stages: Early generation: <30%, absolute quality score 60%, relative incremental score 40%, establishing the basic framework and ensuring basic content quality. Mid-generation: 30%-70%, absolute quality score 45%, relative incremental score 55%, filling in content and deepening the theme, balancing quality and incremental growth. Late generation: >70%, absolute quality score 30%, relative incremental score 70%, optimization and improvement, supplementing and innovating, focusing on incremental value.
[0286] The final formula for calculating the detail text evaluation score is: Text evaluation score_detail = w_abs × Score_abs + w_inc × Score_inc.
[0287] Where w_abs is the absolute quality score weight, w_inc is the relative incremental score weight, and w_abs+w_inc=1.
[0288] Sub-step 2.2 Single-round uniform low-increment degradation label recognition and multi-round trend early warning.
[0289] For this round of AI output, we first identify low-incremental-rate degradation labels in a single round, and then combine historical data from multiple rounds to judge the overall trend of plateau formation, so as to achieve proactive prevention and control rather than reactive remediation.
[0290] (1) Determination of single-wheel platform content.
[0291] The core judgment logic is that content must meet three conditions simultaneously: high expression structure stability, low semantic incremental complexity, and no incremental complexity to be considered as a platform-based single-round content. 1. LogiEnt_round ≥ the threshold for acceptable stability of the expression structure (default 0.7).
[0292] 2. PhilEnt_round ≤ semantic increment complexity threshold (default 0.4).
[0293] 3. Score_inc ≤ Incremental Insufficient Threshold (default 0.3).
[0294] Content that conforms to this rule refers to AI-generated content that users perceive as "seemingly smooth and compliant, but in reality lacking depth, originality, and being merely filler content."
[0295] (2) Multi-round cumulative trend classification early warning of the plateau.
[0296] Based on the results of the last 5 rounds of calculations, cumulative statistical indicators are calculated to achieve a three-level early warning system and detect mean regression risks in advance: Red warning (severe plateauing): content is judged as plateaued for 3 or more consecutive rounds; or the proportion of plateauing in the last 5 rounds is ≥60%; or the semantic incremental complexity variance narrowing rate is ≥50%. AI has entered a severe mean regression, the content is highly homogenized, and a uniform plateau is formed, requiring mandatory intervention.
[0297] Orange alert (moderate plateauing): The proportion of plateauing in the last 5 rounds is ≥40%; or the semantic incremental complexity variance narrowing rate is ≥30%; or the average incremental score decrease rate is ≥40%. AI shows a clear mean reversion trend and is evolving towards a uniform plateau, which requires timely intervention.
[0298] Yellow alert (mild platforming): A single round is judged to be platformed content; or the proportion of platformed content in the last 5 rounds is ≥20%, indicating that AI is showing signs of platforming, and subsequent trends need to be monitored.
[0299] Sub-step 2.3 Comprehensive evaluation of global novelty and quality fit.
[0300] Based on the above calculation results, we directly answer the user's core question: Is the content of this round novel and good enough for the whole? We quantify it into two practical core indicators: (1) Global semantic increment contribution value, which accurately quantifies the innovative increment brought by the output of this round to the global generation task. The formula is as follows: Novelty_global = Score_inc × PhilEnt_round × Topic matching degree.
[0301] Judgment criteria: Novelty_global≥0.7: High novelty contribution, bringing significant innovation increment to the overall system in this round, belonging to high-value core content; 0.4≤Novelty_global<0.7: Moderate novelty contribution, with some incremental value in this round, but insufficient innovation; Novelty_global < 0.4: Low / no novelty contribution, no effective innovation increment in this round, belonging to homogeneous / redundant content.
[0302] (2) Global quality fit.
[0303] The formula for comprehensively evaluating the adaptability and quality level of this round of content to the overall task is as follows: Quality_global = Text evaluation value_detail × (1 - Terracing penalty coefficient).
[0304] Terracing Penalty Coefficient Rules: If a single round of content is judged as terraced, the penalty coefficient is 0.5; if an orange or higher warning is triggered, the penalty coefficient is increased to 0.7; if there is no low-increment degradation label, the penalty coefficient is 0. Judgment criteria: Quality_global≥0.8: Excellent quality, fully meets global requirements, and meets quality standards; 0.6 ≤ Quality_global < 0.8: Quality is good, basically meets global requirements, and can be slightly optimized; 0.4 ≤ Quality_global < 0.6: The quality is average, with obvious shortcomings, and needs to be modified; Quality_global < 0.4: Quality is unacceptable and cannot meet global requirements; rewrite is required.
[0305] Sub-step 2.4 Real-time result output and user feedback.
[0306] The results of this round of calculations will be fed back to users in real time in a visual and practical format, including: Key Indicator Cards: Detail text evaluation score, global semantic increment contribution, global quality fit, and platform risk level; Detailed analysis: Scores for each dimension of semantic incremental complexity and expression structure stability in this round, along with an analysis of core strengths and weaknesses; Risk warning: If a landslide warning is triggered, the risk level, cause, and impact on the overall situation will be clearly indicated; Implementation optimization suggestions: In response to shortcomings and risks, specific optimization directions are given, such as "It is recommended to add cross-domain cases to improve semantic incremental complexity" and "It is recommended to reduce homogeneous expressions and add in-depth interpretations of core concepts." One-click optimization instructions: Automatically generate optimization instructions that can be directly input into AI, reducing the iteration cost for users.
[0307] Step 3: Dynamically update the global context feature pool.
[0308] After completing this round of calculation and feedback, the global context feature pool is updated synchronously to provide the latest global anchoring basis for the next round of evaluation. The updates include: Global Core Concepts Table: Add valid core concepts for this round and remove invalid / irrelevant concepts; Global semantic vector: Updates the overall semantic vector of all generated text; Global statistical baseline: Update the average semantic incremental complexity, average expression structure stability, average concept density, average abstraction level, and other indicators of the global text; Terracing statistics: Updated cumulative terrace rounds, semantic incremental complexity variance change trend, incremental score change trend, etc.; Global generation completion rate: Based on the user's initial requirements, update the current generation completion rate for adaptive adjustment of weights in the next round.
[0309] Step 4: Generate a full-process quality report after the task is completed.
[0310] After the user completes all multi-round interactions and the global generation task ends, a complete end-to-end quality report is output, including: The final text evaluation score includes the overall score of the global text, the total score of semantic incremental complexity / expression structure stability, and the overall quality level. Entropy change curve throughout the process: the changing trend of detail text evaluation value and semantic incremental complexity / expression structure stability score in each round, marking the entropy peak (high-value innovation round) and entropy trough (low-quality / plateau round). Full-process analysis of terraced risk: marking the trigger rounds, causes, and impact on the overall text of early warnings; Global distribution of novelty contribution: Ranking of novelty contribution in each round of output, and identification of core innovation points; Global text optimization suggestions: Specific optimization solutions are provided for entropy valleys, weak links, and plateau-like content.
[0311] For example, users interact with AI in multiple rounds to write a 15,000-word industry report entitled "Innovation and Governance of Content Ecosystem in the AI Era," which is divided into 6 chapters, and this solution is used for real-time evaluation throughout the process.
[0312] Initialization phase: Collect user requirements, match benchmark libraries in the media / AI field, obtain the average domain semantic incremental complexity of 0.68 and the average expression structure stability of 0.72, and initialize the global feature pool and platform warning threshold.
[0313] First round of interaction (introduction generation, initial generation): User input: Please write a report introduction, 1500 words, introducing the overall impact of AI-generated content on the content ecosystem; AI outputs a 1500-word introduction, triggering real-time calculations: PhilEnt_round=0.72, LogiEnt_round=0.78, Score_abs=0.756; 8 new core concepts were added without redundancy, Score_inc=0.82; final text evaluation value _detail=0.78; Terrace detection: No low-increase degradation labels, no early warning; Overall assessment: Novelty contribution 0.58 (moderate), quality fit 0.78 (good); Feedback from users: The basic framework of the introduction is complete and the logic is compliant. It is suggested that subsequent chapters be supplemented with an interdisciplinary governance model to improve the semantic incremental complexity and innovation. Update the global feature pool; completion rate 10%.
[0314] Fifth round of interaction (refining the issue chapters, mid-generation): User input: Please elaborate on the existing problems in the AI content ecosystem, focusing on the impact of homogenized content; The AI outputs 1200 words of content, which is fluent but lacks depth, new core concepts, and is highly homogenized; Triggering real-time calculation: PhilEnt_round=0.38, LogiEnt_round=0.76, Score_abs=0.588; Core concept addition rate 0, redundancy 35%, Score_inc=0.22; Final text evaluation value _detail=0.39; Terrace detection: Meets the conditions for single-wheel terrace formation, triggering a yellow alert; Overall assessment: Novelty contribution 0.08 (no novelty), quality fit 0.19 (unacceptable); Feedback from the user: Yellow terrace warning. This round consists of terraced content with high structural stability and low semantic incremental complexity, lacking innovative additions and failing to meet quality standards. Simultaneously, optimization suggestions and a one-click optimization command are provided: "Based on this content, please supplement with quantitative data on the uniform terrace effect, cross-domain homogenization risk comparison cases, improve the depth of thought and conceptual density, and avoid homogenized expressions." The global feature pool is updated, marking this round as a terraced round.
[0315] Task completion: Output a full process report, mark the entropy change curves of 12 rounds of interaction, identify 3 high-value innovation rounds and 2 platforming rounds, and provide an overall optimization plan.
[0316] This application provides an AI multi-turn question answering evaluation device, which includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the AI multi-turn question answering evaluation method in the above embodiment 1.
[0317] The following is for reference. Figure 3 The diagram illustrates a structural schematic suitable for implementing the AI multi-turn question-answering evaluation device in the embodiments of this application. The AI multi-turn question-answering evaluation device in the embodiments of this application may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, personal digital assistants (PDAs), tablets, and in-vehicle terminals, as well as fixed terminals such as digital TVs and desktop computers. Figure 3 The AI multi-turn question-answering evaluation device shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.
[0318] like Figure 3As shown, the AI multi-turn question-answering evaluation device may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 1002 or a program loaded from storage device 1003 into random access memory (RAM) 1004. The random access memory 1004 also stores various programs and data required for the operation of the AI multi-turn question-answering evaluation device. The processing unit 1001, ROM 1002, and RAM 1004 are interconnected via a bus 1005. An input / output (I / O) interface 1006 is also connected to the bus. Typically, the following systems can be connected to I / O interface 1006: input devices 1007 including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices 1008 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1003 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1009. Communication device 1009 allows the AI multi-turn question-answering evaluation device to communicate wirelessly or wiredly with other devices to exchange data. Although the figure shows an AI multi-turn question-answering evaluation device with various systems, it should be understood that it is not required to implement or possess all the systems shown. More or fewer systems can be implemented alternatively.
[0319] Specifically, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device, or installed from storage device 1003, or installed from ROM 1002. When the computer program is executed by processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.
[0320] The AI multi-turn question-answering evaluation device provided in this application, employing the AI multi-turn question-answering evaluation method in the above embodiments, can solve the technical problem of AI multi-turn generation naturally existing probability mean regression, thus forming worthless corpora. Compared with the prior art, the beneficial effects of the AI multi-turn question-answering evaluation device provided in this application are the same as those of the AI multi-turn question-answering evaluation device provided in the above embodiments, and other technical features in this AI multi-turn question-answering evaluation device are the same as those disclosed in the method of the previous embodiment, and will not be repeated here.
[0321] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.
[0322] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0323] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, the computer-readable program instructions being used to execute the AI multi-turn question-answering evaluation method in the above embodiments.
[0324] The computer-readable storage medium provided in this application may be, for example, a USB flash drive, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this embodiment, the computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, system, or device. The program code contained on the computer-readable storage medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, radio frequency (RF), or any suitable combination thereof.
[0325] The aforementioned computer-readable storage medium may be included in the AI multi-turn question-answering evaluation device; or it may exist independently and not be assembled into the AI multi-turn question-answering evaluation device.
[0326] The aforementioned computer-readable storage medium carries one or more programs. When these programs are executed by an AI multi-turn question-answering evaluation device, the AI multi-turn question-answering evaluation device responds to the input information of the task configuration control, constructs a global context feature pool based on the global task parsed from the input information and the baseline evaluation value of the domain to which the global task belongs, responds to the output text of the large language model in the nth round, evaluates the output text based on a preset first model and a second model to obtain an absolute quality evaluation value, compares the output text with the global context feature pool corresponding to the (n-1)th round based on a preset dimension to obtain a relative incremental evaluation value, and weights and fuses the absolute quality evaluation value and the relative incremental evaluation value according to the global task completion degree of the output text to obtain a text evaluation value.
[0327] Computer program code for performing the operations of this application can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0328] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0329] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.
[0330] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., a computer program) for executing the above-described AI multi-turn question-answering evaluation method. This addresses the technical problem of AI multi-turn generation resulting in naturally occurring probability mean regression and thus creating worthless corpora. Compared to the prior art, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the AI multi-turn question-answering evaluation method provided in the above embodiments, and will not be elaborated upon here.
[0331] This application provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the AI multi-turn question-answering evaluation method described above.
[0332] The computer program product provided in this application can solve the technical problem of AI multi-turn generation of naturally occurring probability mean regression, resulting in worthless corpora. Compared with the prior art, the beneficial effects of the computer program product provided in the embodiments of this application are the same as the beneficial effects of the AI multi-turn question answering evaluation method provided in the above embodiments, and will not be repeated here.
[0333] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent scope of this application.
Claims
1. An AI multi-turn question-answering evaluation method, characterized in that, The AI multi-round question-answering evaluation method includes: In response to the input information from the task configuration control, a global context feature pool is constructed based on the global task parsed from the input information and the baseline evaluation value of the domain to which the global task belongs. In response to the output text of the large language model in the nth round, the output text is evaluated based on a weighted average of the first and second preset models to obtain an absolute quality evaluation value; Based on a preset dimension, the output text is compared with the global context feature pool corresponding to the (n-1)th round to obtain a relative incremental evaluation value; Based on the global task completion level of the output text, the absolute quality evaluation value and the relative incremental evaluation value are weighted and fused to obtain the text evaluation value.
2. The AI multi-turn question-answering evaluation method as described in claim 1, characterized in that, The step of constructing a global context feature pool based on the global task obtained from the parsed input information and the baseline evaluation value of the domain to which the global task belongs includes: The input information of the task configuration control is parsed to obtain the generation target, core theme, text type, length requirement, and domain of the global task. Based on the pre-built domain text library, the baseline evaluation value is calculated, which includes a first model benchmark value and a second model benchmark value. Based on the generation target, core theme, genre, and length requirements, and combined with the baseline evaluation value, the global context feature pool is initialized. The global context feature pool includes at least a global core concept set, a global theme semantic vector, a global statistical baseline, a genre adaptation threshold, a length adaptation threshold, and a platform determination threshold.
3. The AI multi-turn question-answering evaluation method as described in claim 1, characterized in that, The process of weighting the output text based on a preset first model and a second model to obtain an absolute quality evaluation value includes: The output text is processed based on the first model to obtain a first model evaluation value, and the output text is processed based on the second model to obtain a second model evaluation value. The first model includes a reward module, a penalty module, and a compensation module, and the second model includes a preset number of expression factor dimensions. Based on the text type, the evaluation values of the first model and the second model are fused to generate stable field labels and local features corresponding to the input text; The absolute quality evaluation value of the output text is generated based on the stable field label and local features.
4. The AI multi-turn question-answering evaluation method as described in claim 1, characterized in that, Based on a preset dimension, the output text is compared with the global context feature pool corresponding to the (n-1)th round to obtain a relative incremental evaluation value, including: Extract the core concepts, semantic vectors, stylistic features, and length features of the output text for this round; The core concepts of this round are compared with the set of global core concepts in the global context feature pool of the (n-1)th round, and the proportion of newly added core concepts is calculated. The semantic vector of this round is compared with the global topic semantic vector in the global context feature pool of the (n-1)th round to calculate the topic matching degree; The current round's text style features and current round's length features are compared with the text style adaptation threshold and length adaptation threshold in the (n-1)th round's global context feature pool, respectively, and the feature adaptation degree is calculated. Based on the global statistical baseline in the (n-1)th round of global context feature pool, the abstract level transition degree and text redundancy of the output text are calculated. The relative incremental evaluation value is obtained by integrating the newly added proportion of core concepts, topic matching degree, feature adaptation degree, abstraction level leap degree, and text redundancy degree.
5. The AI multi-turn question-answering evaluation method as described in claim 1, characterized in that, After obtaining the text evaluation value by weighted fusion of the absolute quality evaluation value and the relative incremental evaluation value based on the global task completion degree of the output text, the process includes: Obtain the first model evaluation value, the second model evaluation value, and the relative incremental evaluation value corresponding to the output text; The evaluation values of the first model, the second model, and the relative incremental evaluation values are compared one by one with the plateau determination threshold in the global context feature pool. When the evaluation value of the second model meets the standard, the evaluation value of the first model does not meet the standard, and the relative incremental evaluation value does not meet the standard, the single-round low incremental degradation label of the output text is determined to be the first label; Otherwise, the single-round low-increment degradation label of the output text is determined to be the second label.
6. The AI multi-turn question-answering evaluation method as described in claim 5, characterized in that, The method further includes: Count the first tag within a preset number of rounds and generate multiple rounds of low-increment degradation tags; Compare the multi-round low-incremental-degradation tags with a preset alarm threshold; If the multiple rounds of low-increment degradation labels trigger an alarm, then the corresponding level of platform warning information will be output.
7. The AI multi-turn question-answering evaluation method as described in claim 6, characterized in that, The method further includes: The relative incremental evaluation value, the first model evaluation value, and the topic matching degree are weighted and summed according to preset weights to obtain the global semantic incremental contribution value. The corresponding plateauing penalty coefficient is determined based on the single-round low-increment degradation label, and the global quality contribution is obtained based on the text evaluation value and the plateauing penalty coefficient.
8. The AI multi-turn question-answering evaluation method as described in claim 7, characterized in that, The method further includes: Extract the core concepts, semantic vectors, statistical features, stylistic features, and length features of the output text for this round. The current round's core concepts are merged with the global core concept set to update and obtain the nth round's global core concept set; The current round semantic vector is fused with the global topic semantic vector to update and obtain the nth round global topic semantic vector; Based on the statistical characteristics of this round, the global statistical baseline is corrected and updated to obtain the global statistical baseline of the nth round; The style adaptation threshold and length adaptation threshold are adjusted according to the style characteristics and length characteristics of this round, respectively. The terrace determination threshold is updated by combining the multiple rounds of low-increment degradation labels; By integrating and updating the nth round global core concept set, the nth round global topic semantic vector, the nth round global statistical baseline, the style adaptation threshold, the length adaptation threshold, and the platform determination threshold, the nth round global context feature pool is obtained.
9. An AI multi-turn question-answering evaluation device, characterized in that, The AI multi-turn question-answering evaluation device includes: a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being configured to implement the steps of the AI multi-turn question-answering evaluation method as described in any one of claims 1 to 8.
10. A storage medium, characterized in that, The storage medium is a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, it implements the steps of the AI multi-turn question-answering evaluation method as described in any one of claims 1 to 8.