Agent optimization method based on prompt word engineering and test set management

By using a closed-loop optimization method for agents, the problems of management randomness and insufficient coverage in the agent optimization process are solved, achieving efficient and accurate optimization results and sustainable agent improvement.

CN122242503APending Publication Date: 2026-06-19BOYA TRIZ (TIANJIN) TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BOYA TRIZ (TIANJIN) TECH CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

The current intelligent agent optimization process suffers from extensive management of prompt words, fragmented testing and verification, experience-based problem diagnosis, and inefficient optimization iteration. It lacks systematic correlation and closed-loop management, resulting in high randomness, insufficient coverage, inaccurate positioning, and low iteration efficiency in intelligent agent optimization.

Method used

By establishing a prompt word template configuration, structured test set management, automated batch testing, and cross-matrix diagnostic analysis, a closed loop is formed: "prompt word template → structured test set management → automated batch testing → cross-matrix diagnostic analysis → targeted optimization iteration," thereby achieving systematic, quantifiable, and sustainable improvement in agent optimization.

Benefits of technology

It has achieved the engineering of prompt word management, improved test coverage, enhanced the accuracy of problem localization, increased optimization efficiency by 3 to 5 times, and improved the accuracy of intelligent agents by 15% to 30%.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242503A_ABST
    Figure CN122242503A_ABST
Patent Text Reader

Abstract

This invention discloses an intelligent agent optimization method based on prompt word engineering and test set management, belonging to the field of artificial intelligence technology. The method establishes a closed loop of "prompt word template configuration → test set structured management → automated testing → cross-matrix diagnosis → targeted optimization." Prompt word management introduces a five-level engineering decomposition of system roles, task instructions, etc., and an effect fingerprint mechanism; the test set is based on multi-dimensional annotation and three-dimensional spatial quantitative evaluation. The core innovation lies in proposing a "sample-task cross-matrix" (STCM) diagnostic method, which automatically identifies combinations of failure factors through decision tree attribution and accurately quantifies the optimization effect using the Net Income Scale (NIS). This invention solves problems such as high randomness in the optimization process, insufficient test coverage, and inaccurate problem localization. In practical applications, optimization efficiency is improved by 3 to 5 times, and the problem localization accuracy reaches over 85%.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence technology, specifically relating to an agent optimization method based on prompt word engineering and test set management. Background Technology

[0002] With the rapid development of large language model technology, LLM-based agent systems are becoming increasingly popular in enterprise applications, covering various scenarios such as intelligent customer service, knowledge-based question answering, document processing, and workflow automation. However, the actual effectiveness of agents depends heavily on the design quality of prompts and the rationality of system configuration.

[0003] Currently, the optimization process of intelligent agents mainly faces the following problems:

[0004] The management of tooltips is rudimentary: Most development teams write tooltips using free text, lacking a structured template system and version control mechanism. Modifications to tooltips are often based on personal experience, lacking systematic engineering methodological guidance, resulting in a high degree of randomness and poor reproducibility in the optimization process.

[0005] Fragmented testing and validation: The validation of agent performance often relies on manual sampling tests, lacking a complete test dataset management system. The selection of test samples lacks scientific coverage design, failing to systematically expose the weaknesses of agents under different task types and data characteristics.

[0006] Problem diagnosis is based on experience: When an agent performs poorly, developers usually rely on experience to locate the cause of the problem. They lack quantitative diagnostic tools and cannot accurately distinguish whether the error is caused by flawed prompt word design, insufficient knowledge base coverage, model capability limitations, or abnormal data features.

[0007] Inefficient optimization iteration: Due to the lack of a closed-loop management mechanism of "test-diagnosis-optimization-regression verification", each optimization requires a lot of manpower to repeatedly verify, and it is difficult to evaluate the incremental effect brought by a single optimization.

[0008] Existing prompt optimization tools (such as LangSmith and PromptLayer) primarily focus on prompt log tracking and basic analysis, failing to establish a systematic connection between prompt engineering, test set management, and problem diagnosis. Among existing publicly available technical literature, CN120336515A discloses a prompt optimization method for large language models, CN119396735A discloses a method for generating test cases based on large models, CN120994544A discloses a method for evaluating assertion error attribution ability based on large models, CN120278153A discloses a prompt optimization method, CN120542583A discloses an automatic prompt optimization method for large language models, CN121166507A discloses an agent evaluation method, and CN120045901A discloses a method for evaluating the self-criticism ability of large language models. None of these existing technologies achieve a systematic integration of prompt template management, structured test set annotation, and sample-task cross-matrix diagnosis. Summary of the Invention

[0009] Technical problems to be solved

[0010] To address the aforementioned shortcomings of existing technologies, this invention aims to provide an agent optimization method based on prompt word engineering and test set management. By establishing a complete closed loop of "prompt word template configuration → test set structured management → automated batch testing → cross-matrix diagnostic analysis → targeted optimization iteration," this method solves the problems of high randomness in the agent optimization process, insufficient test coverage, inaccurate problem localization, and low iteration efficiency in existing technologies, thereby achieving a systematic, quantifiable, and sustainable improvement in agent accuracy.

[0011] Technical solution

[0012] To achieve the above objectives, the present invention provides the following technical solution:

[0013] An agent optimization method based on prompt word engineering and test set management includes the following steps:

[0014] S1. Prompt word template construction steps: Decompose the prompt words of the intelligent agent into five levels: system role layer, task instruction layer, constraint rule layer, output format layer and dynamic context layer. Each level is configured and managed independently. And associate each prompt word version with an effect fingerprint. The effect fingerprint records the score vector of each indicator on the standard test set for that version.

[0015] S2. Test Dataset Construction Steps: Construct a structured test dataset, label each test sample with task type, difficulty level, knowledge domain classification, and data feature labels, and calculate the coverage index C based on the three-dimensional space of task type-difficulty level-knowledge domain. The coverage index C = (N...covered / N total ) × W balance , where N covered N represents the number of 3D space cells that have been covered. total The total number of cells defined for business requirements, W balance This is a weighting coefficient based on the balance of sample size in each cell;

[0016] S3, Automated batch testing steps: The test dataset is submitted to the agent for inference in batches. The evaluation score of each sample is obtained through the multi-dimensional automatic evaluation module, and a result snapshot associated with the version number of the prompt word template constructed in step S1 is generated. Each batch test result generates an immutable snapshot, supports incremental comparison analysis between any two batch test results, and automatically identifies samples with improved effect, degraded effect, and unchanged effect.

[0017] S4. Sample-Task Cross Matrix Diagnostic Steps: Construct a sample-task cross matrix with test samples as rows and task dimension attributes as columns. Based on multidimensional aggregation analysis and decision tree attribution analysis algorithms, automatically identify the key factor combinations that lead to the agent's failure and output the attribution analysis results.

[0018] S5. Targeted Optimization and Validation Steps: Based on the attribution analysis results of step S4, targeted optimization suggestions are generated by matching key failure factors with preset optimization rule templates at each level of the prompt words. The prompt word templates are then modified accordingly. Targeted validation is first performed on a subset of failure samples, followed by regression testing on the full test set. The net gain index NIS = (N fixed - N regressed ) / Z total × Evaluate the optimization effect 100% and release a new version when NIS is greater than 0 and statistically significant;

[0019] S6. Closed-loop iteration step: Form an iterative closed loop from steps S1 to S5 and continue to execute until the agent's accuracy reaches the target threshold; at the same time, record each successful optimization operation as a structured optimization case and build an optimization knowledge base. In subsequent optimizations, perform vector similarity matching between the current combination of failure factors and the combination of failure factors of historical cases in the optimization knowledge base. When the similarity exceeds a preset threshold, automatically recommend the corresponding historical solution.

[0020] Furthermore, in step S1, the version management includes: automatically generating a version snapshot each time a prompt word is modified, recording the modification timestamp, the modifier, the difference in modified content, and the modification reason tag, supporting difference comparison between any two versions and one-click rollback based on the version number; when the system detects that the effect fingerprint of the new version has changed beyond a preset threshold, the preset threshold being that the change in the score of any indicator in the effect fingerprint exceeds twice the standard deviation of the historical mean of that indicator, an alarm or rollback suggestion is automatically triggered to prevent effect degradation during the optimization process.

[0021] Furthermore, step S1 also includes a parameterized variable injection mechanism: parameterized placeholders (such as business scenarios, knowledge domains, output constraints, etc.) are defined in the prompt word template, and corresponding parameter values ​​are dynamically injected at runtime according to the specific task type, so that the same basic template can be adapted to multiple business scenarios, greatly improving the reusability of prompt words.

[0022] Furthermore, step S2 also includes sample coverage analysis: the system automatically calculates the sample distribution density of the test set in each region of the three-dimensional space, identifies coverage blind spots (i.e., areas with business needs but lacking test samples), and automatically generates sample supplementation suggestions; and annotation quality control: an annotation consistency verification mechanism is introduced, multiple annotators are assigned to the same sample for cross-annotation, the consistency coefficient (Cohen's Kappa) between annotators is calculated, and the review process is automatically triggered when the consistency is lower than the threshold.

[0023] Furthermore, in step S3, the multi-dimensional automatic evaluation module integrates the following evaluation methods: Exact Match evaluation, semantic similarity evaluation based on cosine similarity of embedded vectors, structured output format verification (JSON Schema / regular expression), and evaluation based on Large Language Model Judge (LLM-as-Judge), supporting weighted comprehensive scoring of multiple evaluation dimensions.

[0024] Furthermore, in step S4, the decision tree attribution analysis algorithm uses the evaluation results of the test samples as the target variable and the task type, difficulty level, knowledge domain, data features, and prompt word version as feature variables. By training the decision tree model, it automatically identifies the combination of key factors that lead to failure and their importance ranking, and outputs an interpretable report containing Top-K key failure factors and their information gain ranking, interaction effect analysis between factors, and targeted optimization suggestions for each key factor.

[0025] Furthermore, step S4 also includes time-series analysis based on the cross matrix to automatically detect the following abnormal patterns: sudden degradation in a certain dimension (such as a sudden drop in scores for a specific task type due to a new version of the prompt word); systematic bias (such as consistently low scores for all samples containing table features); and periodic fluctuations (such as unstable inference results during a specific period, which may indicate nondeterministic behavior of the model API).

[0026] Beneficial effects

[0027] The present invention has the following beneficial effects:

[0028] (1) This invention proposes a multi-level template configuration and effect fingerprint management method for prompt words, which upgrades prompt words from free text to engineered multi-level configurable templates, and realizes the precise association between prompt word version and effect indicators through the effect fingerprint mechanism. This solves the problems of extensive prompt word management and difficult version tracing in the prior art. The single optimization iteration cycle is shortened from 3 to 5 working days in the traditional method to 0.5 to 1 working day, and the optimization efficiency is improved by 3 to 5 times.

[0029] (2) This invention proposes a sample-task cross matrix (STCM) diagnostic analysis method, which for the first time constructs a cross matrix with test samples and multi-dimensional task attributes, and automatically identifies the combination of key factors that cause the agent to fail based on the decision tree attribution analysis algorithm. This improves the traditional experience-based problem diagnosis to data-driven quantitative analysis, and increases the problem location accuracy from about 50% of experience-based diagnosis to more than 85%.

[0030] (3) This invention proposes a three-dimensional coverage quantification evaluation method for test datasets. Through coverage index and automatic identification algorithm for coverage blind spots, the business scenario coverage of the test dataset is increased from the initial 40% to 60% to more than 90%, which solves the problem of insufficient coverage and inability to quantify existing test solutions.

[0031] (4) This invention achieves precise quantitative evaluation of optimization effect through net gain index NIS and statistical significance test, and combines effect fingerprint to achieve automatic rollback protection, ensuring that each iteration is a positive optimization. The overall accuracy of the agent on the standard test set can be improved by 15% to 30% after 3 to 5 iterations.

[0032] (5) This invention realizes the automatic accumulation and recommendation mechanism of the optimization knowledge base, and records the entire process of each optimization in a structured way to form a searchable and recommendable optimization knowledge base, realize the accumulation and reuse of organizational-level optimization experience, and significantly reduce the manpower cost of subsequent optimization. Attached Figure Description

[0033] Figure 1 This is a flowchart illustrating the overall closed-loop process of the intelligent agent optimization method based on prompt word engineering and test set management according to the present invention.

[0034] Figure 2 This is a schematic diagram of the multi-level prompt word template architecture and version management of the present invention.

[0035] Figure 3 This is a schematic diagram of the Sample-Task Cross Matrix (STCM) structure of the present invention.

[0036] Figure 4 This is a schematic diagram of the decision tree attribution analysis process of the present invention. Detailed Implementation

[0037] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. As those skilled in the art will understand, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0038] Please see Figures 1 to 4 This invention provides an agent optimization method based on prompt word engineering and test set management, comprising the following steps:

[0039] Specifically, the process of S1 step—prompt word template construction and version management—is as follows:

[0040] This invention proposes a method for configuring prompt words using templates, which decomposes traditional free-text prompt words into structured, configurable templates, such as... Figure 2 As shown, it includes the following hierarchical design:

[0041] System Role Layer: Defines the agent's identity, area of ​​expertise, and behavioral boundaries, such as "You are a professional LNG receiving terminal safety inspection expert." This layer typically remains relatively stable; changes to it signify a fundamental adjustment to the agent's role.

[0042] Task Instruction Layer: Describes the core tasks that the agent needs to complete, using a structured, step-by-step description method, and supports parameterized placeholders (such as...). , ).

[0043] Constraint Layer: This layer specifies the constraints that the agent must adhere to when performing tasks, including output restrictions, prohibited behaviors, and boundary conditions. This layer is where cue word optimization occurs most frequently.

[0044] Output Format Layer: Defines the structured format requirements for the agent's response, such as JSONSchema, Markdown structure, table format, etc., and is used in conjunction with the format validation of the automated evaluation module.

[0045] Dynamic Context Layer: Context information dynamically injected at runtime based on specific business scenarios, such as relevant document fragments, historical dialogue records, and real-time data.

[0046] Each level can be configured and versioned independently, supporting cross-level combination and reuse. Parameterized placeholders are defined in the prompt word template, and corresponding parameter values ​​are dynamically injected at runtime based on the specific task type, allowing the same basic template to adapt to multiple business scenarios.

[0047] Version management mechanism: Each time the prompt word is modified, a version snapshot is automatically generated, recording the modification timestamp, the modifier, the difference in modified content (Diff), and the modification reason tag (such as "Fixed logical reasoning failure" or "Optimized format compliance"). It supports difference comparison between any two versions and one-click rollback based on version number.

[0048] Performance Fingerprint Mechanism: Each prompt word version is associated with a performance fingerprint, which records the score vectors of various metrics (such as accuracy, format compliance rate, response length distribution, etc.) on the standard test set for that version. When the system detects that the performance fingerprint of a new version has changed beyond a preset threshold (i.e., the change in any metric score exceeds twice the standard deviation of the historical mean of that metric), it automatically triggers an alarm or rollback suggestion to prevent performance degradation during the optimization process.

[0049] Specifically, the S2 step—the process of constructing and structurally labeling the test dataset—is as follows:

[0050] This invention proposes a test dataset management method for intelligent agent quality assessment, employing a multi-dimensional annotation system: each test sample not only includes the input question and expected output (standard answer), but also includes the following structured metadata:

[0051] Task type tags, such as "fact-based question and answer," "logical reasoning," "format generation," and "multi-turn dialogue," are used to distinguish tasks with different cognitive difficulties and ability dimensions.

[0052] Difficulty levels: L1 to L5, based on a comprehensive assessment of the number of reasoning steps required for the task, the depth of knowledge, and the difficulty of semantic understanding.

[0053] Knowledge domain classification: This corresponds to the domain division in the knowledge base and is used for correlation analysis between the retrieval effect and the coverage of the knowledge base.

[0054] Data feature labels, such as "long text input", "containing tables", "fuzzy expression", "polysemous words", etc., are used to identify data features that trigger specific failure modes.

[0055] Expected evaluation dimensions, such as "accuracy", "completeness", "format compliance", and "response timeliness", are used to guide the weight configuration of the multi-dimensional automatic evaluation module.

[0056] Coverage Analysis Algorithm: This invention proposes a coverage quantification and evaluation method based on a three-dimensional space of task type, difficulty level, and knowledge domain, such as... Figure 3 As shown. The system automatically calculates the sample distribution density of the test set in each region of three-dimensional space, identifies coverage blind spots (i.e., areas with business needs but lacking test samples), and automatically generates sample supplementation suggestions. The formula for calculating the coverage index C is: C = (N covered / N total ) × W balance , where N covered N represents the number of 3D space cells that have been covered. total The total number of cells defined for business requirements, W balance This is a weighting coefficient based on the balance of sample size across cells, with a value ranging from 0 to 1. W balance The specific calculation method is as follows: Let the number of samples contained in each covered cell in the three-dimensional space be n1, n2, ..., n k (k = N) covered ), calculate the sample percentage p of each cell. i = n i / ∑n j Then W balance = (-∑p i × log k (p i W represents the normalized information entropy of the sample distribution in each cell. This is true when the number of samples in all covered cells is completely uniform. balance = 1, W = 1 when the sample is concentrated in a single cell balance Approaching 0.

[0057] Dynamic annotation quality control: An annotation consistency verification mechanism is introduced, which assigns multiple annotators to cross-annotate the same sample and calculates the consistency coefficient (Cohen's Kappa) among annotators. When the consistency is lower than a preset threshold (usually 0.7), the review process is automatically triggered to ensure the annotation quality of the test dataset.

[0058] Specifically, the S3 step—automated batch testing and result collection—is as follows:

[0059] This invention designs an automated batch testing engine that supports submitting test datasets to the agent for inference in batches, automatically managing request queues, concurrency control, timeout retries, and exception handling. Each batch test is associated with a specific prompt word version number to ensure the traceability of test results.

[0060] The multi-dimensional automatic evaluation module integrates multiple automatic evaluation methods: Exact Match evaluation is used for question-answering tasks with standard answers; semantic similarity evaluation is based on cosine similarity of embedded vectors and is suitable for open-ended generation tasks; structured output format validation verifies the compliance of the output format through JSON Schema or regular expressions; and LLM-as-Judge evaluation is used for tasks with high subjectivity, scoring the output through another large language model. These methods support weighted comprehensive scoring across multiple evaluation dimensions.

[0061] Results snapshot and incremental comparison mechanism: Each batch test result generates an immutable snapshot, supporting incremental comparison analysis between any two batch test results. The system automatically identifies samples with improved performance, deteriorated performance, and unchanged performance, and calculates the proportion and statistical significance of each category.

[0062] Specifically, the process of step S4—constructing the sample-task cross matrix and diagnosing the problem—is as follows:

[0063] The core innovation of this invention lies in proposing a "Sample-Task Cross Matrix" (STCM) diagnostic analysis method, such as... Figure 3 As shown.

[0064] Cross-matrix construction: A two-dimensional matrix with test samples as rows and task dimension attributes as columns. Column dimensions include, but are not limited to, task type, difficulty level, knowledge domain, data features, prompt word template version, etc. Each cell in the matrix records the evaluation score of the sample under the corresponding dimension.

[0065] Multi-dimensional aggregation analysis engine: Based on cross matrices, it performs multi-dimensional aggregation statistics, automatically calculating the average score, pass rate, and failure rate distribution for each dimension. For example, it can quickly identify that the pass rate is significantly lower than other combinations when the combination of "task type = logical reasoning and difficulty level = L4" is used, thus accurately locating the agent's weaknesses.

[0066] Attribution analysis algorithms: such as Figure 4As shown, this invention proposes a multi-factor attribution analysis method based on decision trees. The evaluation result of the sample (pass / fail) is used as the target variable, and task type, difficulty level, knowledge domain, data characteristics, and prompt word version are used as feature variables. A decision tree model is trained to automatically identify the combination of key factors leading to failure and their importance ranking. The attribution analysis results are output in the form of an interpretable report, including: Top-K key failure factors and their information gain ranking; interaction effect analysis between factors; and targeted optimization suggestions for each key factor (e.g., "It is recommended to add a logical reasoning guidance step to the constraint rule layer of the prompt words").

[0067] Anomaly Pattern Recognition: Based on time-series analysis of the cross matrix, the system automatically detects the following anomaly patterns: Sudden Degradation Detection: The system calculates the sliding window mean (window size defaults to the first 5 versions) for the score sequences of each dimension (e.g., specific task type, specific knowledge domain) in the cross matrix. Sudden degradation is identified when the score of the latest version deviates from the sliding mean by more than twice the standard deviation (e.g., a sudden drop in score for a specific task type due to a newly added prompt word version); Systematic Bias Detection: One-way ANOVA is performed on sample groups with the same data feature labels in the cross matrix. Systematic bias is identified when the F-test p-value of the score difference between groups is less than 0.05 (e.g., all samples with the "table" feature consistently have consistently low scores); Periodic Fluctuation Detection: The coefficient of variation (CV) = σ / μ is calculated for multiple batch results sequences of the same prompt word version at different time points. Periodic fluctuations are identified when CV exceeds 0.1 (e.g., unstable inference results during a specific period may indicate non-deterministic behavior of the model API or temperature parameter setting issues).

[0068] Specifically, the S5 step—directed optimization and regression validation—is as follows:

[0069] Based on the diagnostic results of step S4, this invention implements a targeted optimization strategy:

[0070] Optimization suggestion generation mechanism: Based on the attribution analysis results, the system matches the key failure factors identified by the attribution analysis with a pre-defined optimization rule template library. Each rule template predefines the mapping relationship between the failure factor type and the corresponding prompt word level modification scheme, thereby generating structured optimization suggestions, including: the prompt word level and specific module that needs to be modified, the suggested modification direction and reference template fragment, the expected sample range of impact, and the evaluation dimensions. For example, if the attribution analysis finds that "the main reason for logical reasoning failure is the lack of step guidance in the constraint rule layer," the system automatically suggests: "Add a thought chain reasoning instruction to the constraint rule layer, requiring the model to list the reasoning steps before making a final judgment."

[0071] A / B testing-based gray-scale validation: The optimized prompt word version is first validated on a subset of failed samples to confirm that the target issue has been fixed; then regression testing is performed on the full test set to ensure that no new degradation issues are introduced. The system automatically calculates the net improvement score (NIS): NIS = (N fixed - N regressed ) / Z total ×100%, where N fixed N represents the number of samples to be repaired. regressed Z represents the number of newly introduced degenerate samples. total The total number of test samples is denoted as N. Statistical significance is determined using the McNemar test: a 2×2 contingency table of paired classification results for the old and new versions on the full test set is constructed, and the McNemar statistic χ² = (N... fixed - N regressed )² / (N fixed + N regressed The optimization effect is considered statistically significant when the p-value corresponding to χ² is less than 0.05. Only when NIS is greater than 0 and the McNemar test p-value is less than 0.05 is the new version allowed for official release.

[0072] Taking a practical application as an example: During the optimization process of a knowledge-based question-answering agent, STCM analysis revealed that the pass rate for the combination of "logic reasoning, L4 difficulty, and domain knowledge domain" was only 31%, significantly lower than the average of 68% for other combinations. Decision tree attribution analysis identified the key failure factor as "the lack of clear multi-step reasoning guidance in the cue word constraint rule layer" (information gain 0.43). To address this factor, the constraint rule layer was modified, adding the guiding instruction "Please first decompose the problem into several sub-problems, then deduce step by step, and finally synthesize the conclusion." Targeted validation showed that the pass rate for the target combination increased from 31% to 74%. The full regression test showed NIS = (47 - 3) / 200 × 100% = 22%, which was statistically significant, indicating a successful release of the new version.

[0073] Specifically, the S6 step—closed-loop iteration and optimization of knowledge base accumulation—is as follows:

[0074] Steps S1 to S5 are formed into an iterative closed loop and executed continuously until the agent's accuracy reaches the target threshold. Through this closed-loop optimization mechanism, the agent's overall accuracy on the standard test set can be improved by 15% to 30% after 3 to 5 iterations.

[0075] Optimization Knowledge Base Accumulation: Each successful optimization operation is recorded as an "Optimization Case," containing four fields: problem phenomenon (failure mode description and STCM visualization screenshot), root cause analysis (attribution analysis report), optimization measures (prompt word difference records), and effect data (NIS value, comparison of effects across dimensions). The system automatically builds an optimization knowledge base, supporting vector similarity-based retrieval. Specifically, the failure factors identified in the current attribution analysis are encoded into feature vectors, and cosine similarity is calculated between these vectors and the failure factor feature vectors of historical cases in the knowledge base. When the similarity exceeds a preset threshold (e.g., 0.8), the system automatically recommends the corresponding historical solution, achieving the organized accumulation and reuse of optimization experience.

[0076] This invention's method is applicable to various intelligent agent systems based on large language models, including but not limited to knowledge-based question-answering agents, task-execution agents, dialogue-interaction agents, and multi-agent collaborative systems. The large language model in this invention can be any mainstream LLM, including but not limited to the GPT series, Claude series, Wenxin Yiyan, Tongyi Qianwen, and Zhipu GLM, etc. This method does not depend on a specific model architecture.

[0077] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0078] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. As those skilled in the art will understand, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. An agent optimization method based on prompt word engineering and test set management, characterized in that, Includes the following steps: S1. Prompt word template construction steps: Decompose the prompt words of the intelligent agent into five levels: system role layer, task instruction layer, constraint rule layer, output format layer and dynamic context layer. Each level is configured and managed independently. Associate each prompt word version with an effect fingerprint to generate a prompt word template. The effect fingerprint records the score vector of each indicator on the standard test set for this version. S2. Test Dataset Construction Steps: Construct a structured test dataset, label each test sample with task type, difficulty level, knowledge domain classification, and data feature labels, and calculate the coverage index C based on the three-dimensional space of task type-difficulty level-knowledge domain. The coverage index C = (N... covered / N total )×W balance , where N covered N represents the number of 3D spatial cells in the covered test set. total The total number of cells defined for business requirements, W balance This is a weighting coefficient based on the balance of sample size in each cell; S3, Automated batch testing steps: The test dataset is submitted to the agent for inference in batches. The evaluation score of each test set sample is obtained through the multi-dimensional automatic evaluation module, and a result snapshot associated with the version number of the prompt word template constructed in step S1 is generated. The immutable snapshot generated by each batch test result supports incremental comparison analysis between any two batch test results and automatically identifies samples with improved effect, degraded effect, and unchanged effect. S4. Sample-Task Cross Matrix Diagnostic Steps: Construct a sample-task cross matrix with test samples in the test dataset as rows and task dimension attributes as columns. Automatically identify the key factor combinations that cause the agent to fail based on multidimensional aggregation analysis and decision tree attribution analysis algorithms, and output the attribution analysis results. The task dimension attributes include task type, difficulty level, knowledge domain, data features, and prompt word template version. S5. Targeted Optimization and Validation Steps: Based on the attribution analysis results of step S4, targeted optimization suggestions are generated by matching the key factors leading to agent failure with the preset optimization rule templates at each level of the prompt words. The prompt word templates are modified, and targeted validation is first performed on a subset of failure samples, followed by regression testing on the full test set. The net gain index NIS = (N fixed - N regressed ) / Z total × Evaluate the optimization effect by 100%. Release a new version when NIS is greater than 0 and statistically significant. fixed N represents the number of samples to be repaired. regressed Z represents the number of newly introduced degenerate samples. total This represents the total number of test samples; S6. Closed-loop iteration step: Form an iterative closed loop from steps S1 to S5 and continue to execute until the agent's accuracy reaches the target threshold; at the same time, record each successful optimization operation as a structured optimization case and build an optimization knowledge base. In subsequent optimizations, perform vector similarity matching between the current combination of failure factors and the combination of failure factors of historical cases in the optimization knowledge base. When the similarity exceeds a preset threshold, automatically recommend the corresponding historical solution.

2. The method according to claim 1, characterized in that, In step S1, the version management includes: automatically generating a version snapshot each time a prompt word is modified, recording the modification timestamp, the modifier, the difference in modified content, and the modification reason tag, supporting difference comparison between any two versions and one-click rollback based on the version number; when the system detects that the effect fingerprint of the new version has changed beyond a preset threshold, the preset threshold being that the change in the score of any indicator in the effect fingerprint exceeds twice the standard deviation of the historical mean of that indicator, an alarm or rollback suggestion is automatically triggered to prevent effect degradation during the optimization process.

3. The method according to claim 1, characterized in that, The S1 step also includes a parameterized variable injection mechanism: parameterized placeholders are defined in the prompt word template, and corresponding parameter values ​​are dynamically injected at runtime according to the specific task type, so that the same basic template can be adapted to multiple business scenarios and improve the reusability of prompt words.

4. The method according to claim 1, characterized in that, The S2 step also includes sample coverage analysis and annotation quality control; The sample coverage analysis process is as follows: traverse each cell of the three-dimensional space, count the number of samples in each cell, calculate the sample distribution density of the test set in each region of the three-dimensional space based on the calculation formula of the coverage index C, identify coverage blind spots, and automatically generate sample supplementation suggestions. The annotation quality control process is as follows: an annotation consistency verification mechanism is introduced, multiple annotators are assigned to cross-annotate the same sample, the consistency coefficient between annotators is calculated, and the review process is automatically triggered when the consistency is lower than the threshold.

5. The method according to claim 1, characterized in that, In step S3, the multi-dimensional automatic evaluation module defines a unified evaluation interface protocol, encapsulates each evaluation method into an independent evaluator, and integrates the following evaluation methods through a weighted fusion mechanism: precise matching evaluation, semantic similarity evaluation based on cosine similarity of embedded vectors, structured output format verification, and evaluation based on a large language model judge, supporting weighted comprehensive scoring of multiple evaluation dimensions.

6. The method according to claim 1, characterized in that, In step S4, the decision tree attribution analysis algorithm uses the evaluation results of the test samples as the target variable and the task type, difficulty level, knowledge domain, data features, and prompt word version as feature variables. By training the decision tree model, the algorithm constructs a decision tree by splitting each feature variable layer by layer based on the information gain criterion, thereby identifying the combination of key factors that lead to failure and their importance ranking, and outputting an interpretable report that includes the Top-K key failure factors and their information gain ranking, the interaction effect analysis between the factors, and targeted optimization suggestions for each key factor.

7. The method according to claim 1, characterized in that, The S4 step also includes time series analysis based on the cross matrix, which is used to automatically detect the following abnormal patterns: sudden degradation detection, calculating the sliding window mean of the score sequence of each dimension value in the cross matrix, and determining sudden degradation when the score of the latest version deviates from the sliding mean of the previous N versions by more than a preset standard deviation multiple threshold. Systematic bias detection involves performing one-way ANOVA on sample groups with the same data feature labels in the cross matrix. If the F-test p-value of the score difference between groups is less than the significance threshold, systematic bias is determined to exist. Periodic fluctuation detection involves calculating the coefficient of variation for multiple batch results of the same prompt word version. When the coefficient of variation exceeds a preset threshold, periodic fluctuations are identified.

8. The method according to claim 1, characterized in that, In step S6, the structured optimization case includes four fields: problem phenomenon, root cause analysis, optimization measures, and effect data. The optimization knowledge base supports historical solution recommendations based on similarity retrieval, enabling the organized accumulation and reuse of optimization experience.