Practical training behavior analysis method based on multi-agent cooperation

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a multi-agent collaborative training behavior analysis method, time-series alignment and aggregation of multi-source training data are performed to assess the margin of recoverable operations and generate a recoverability decay index. This solves the problem of difficulty in distinguishing between efficiency and safety in existing technologies and achieves stable assessment and accurate evaluation in high-risk simulation scenarios.

CN121860504BActive Publication Date: 2026-06-26NANJING XUHANG INFORMATION TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: NANJING XUHANG INFORMATION TECH CO LTD
Filing Date: 2026-03-17
Publication Date: 2026-06-26

Application Information

Patent Timeline

17 Mar 2026

Application

26 Jun 2026

Publication

CN121860504B

IPC: G06Q10/0639; G06Q50/20; G06F18/213; G06F18/25; G06F18/15; G06N3/006; G06N3/084

CPC: G06Q10/06393; G06Q50/205; G06F18/213; G06F18/253; G06F18/15; G06N3/006; G06N3/084

AI Tagging

Technology Topics

Evaluation result Reachability

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Non-transitory computer-readable recording medium, inference processing method, machine learning method, and information processing apparatus
US20260179166A1Data processing applications Character and pattern recognition Information processing Data set
A content management system for brand marketing and a marketing method using it
KR1020260113437AEvaluation result The Internet
A task performance evaluation method based on hierarchical KAN and RBF network combination
CN122134163ABiological models Evaluation result Data modeling
A text-to-sql-oriented sql generation quality evaluation method and system
CN122262168Aachieve fine granularityImplementation is interpretableDigital data information retrieval Semantic analysis Evaluation result Software engineering
A digital content asset evaluation method and system based on multi-dimensional structure coupling and phase change determination
CN122367517AEvaluation resultNonlinear model

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN121860504B_ABST

Patent Text Reader

Abstract

The application relates to the technical field of intelligent agent cooperation and discloses a practical training behavior analysis method based on multi-intelligent agent cooperation, which comprises the following steps: first, performing time sequence alignment and aggregation on multi-source practical training data to generate a time sequence sequence containing scene states and operation instructions; using a behavior recognition intelligent agent to establish a scene evolution model, performing multi-step deduction on a virtual environment state based on the sequence; in view of a preset reference instruction set, evaluating the reachability of the instruction set relative to risk constraints and task targets in the deduced state, and calculating a decay index reflecting the degree of recoverable space contraction; then, using a target matching intelligent agent to calculate task completion degree which is deeply coupled with the decay index, and combining resource consumption data of the practical training process to generate a comprehensive score; finally, mapping the comprehensive score to a qualified or unqualified evaluation result based on an offline calibrated qualified benchmark value. The application can accurately identify high-risk operation trends and realize the standardization of practical training evaluation.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent agent collaboration technology, and more specifically, to a training behavior analysis method based on multi-agent collaboration. Background Technology

[0002] With the development of virtual reality, augmented reality, and digital twin technologies, skills training based on simulation environments has been widely applied in high-risk fields such as power emergency response, surgical medicine, industrial safety, and aerospace. The value of such training systems lies in allowing trainees to experience dangerous situations without real consequences and master the ability to handle complex working conditions. Current training evaluation systems typically determine trainees' skill levels by collecting their operation sequences, tool usage records, and task completion times, using statistical methods or rule matching. However, existing training behavior analysis and evaluation methods have significant limitations when dealing with complex scenarios involving work condition-operation relationships and irreversible state transitions. First, existing evaluation systems often treat safety, efficiency, and quality indicators as parallel evaluation dimensions, calculating the total score through a weighted average. In high-fidelity simulations involving irreversible consequences (such as surgical procedures or hazardous chemical handling), this linear aggregation method masks the logical incompatibilities between different dimensions. For example, trainees might drastically shorten operation time (increase efficiency) through extremely dangerous violations (sacrificing safety), thereby obtaining a passing grade on the weighted total score. This phenomenon of efficiency offsetting safety violates the fundamental requirement that the safety baseline must not be crossed in high-risk operations, and fails to truly reflect the trainees' risk control capabilities when facing a critical accident state. Secondly, in a dynamically evolving simulation environment, the operating condition variables are not static backgrounds, but continuous quantities that change in real time with the trainees' control behaviors. Existing behavioral analysis methods are mostly based on discrete step matching, making it difficult to quantify the remaining space of the system from the accident state or irreversible state under the current operating conditions. Although some existing technologies attempt to introduce reachability analysis to assess the safety boundary, in high-dimensional, continuous simulation systems with numerous uncertainties and disturbances, accurately calculating the system's safe survival domain faces extremely high computational complexity and error accumulation problems, making it difficult to use as a stable quantitative indicator for online evaluation. Therefore, how to construct an evaluation mechanism in high-dimensional, continuous training scenarios that can avoid the mutual cancellation of efficiency and safety and accurately characterize the reversibility of the operation process is an urgent technical problem to be solved. Summary of the Invention

[0003] This invention provides a training behavior analysis method based on multi-agent collaboration, which solves the technical problems mentioned in the background art.

[0004] This invention provides a training behavior analysis method based on multi-agent collaboration, including:

[0005] The collected multi-source training data is time-series aligned and aggregated to generate a time-series operation sequence containing scene state data and operation instruction data.

[0006] A scene evolution model is established by calling a behavior recognition intelligent agent, and the state changes of the virtual training environment are deduced in multiple steps based on the time sequence of operations to obtain the state of the virtual training environment.

[0007] For a preset reference instruction set, the reachability of the reference instruction set relative to risk constraints and task objective conditions is evaluated in the state of the simulated virtual training environment. The recoverable operation margin at the current moment is calculated, and a recoverability decay index is generated based on the decreasing trend of the recoverable operation margin in the time-series operation sequence.

[0008] The target matching agent is invoked to calculate the task achievement degree of the recovered decay index and generate a comprehensive training score by combining the resource consumption data of the training process.

[0009] Based on a preset pass / fail benchmark, the comprehensive training score is mapped to pass or fail.

[0010] The beneficial effects of this invention include: by constructing a recoverability decay index that reflects the trend of operational space contraction, the cumulative risk of irreversible operations in high-risk scenarios is effectively quantified, fundamentally overcoming the defect of high efficiency masking low safety in traditional weighted evaluation. The use of a computational mechanism based on deterministic sampling and rolling deduction solves the problem of difficulty in converging or reproducing safety boundary calculations in complex dynamic environments, achieving stable evaluation of high-dimensional continuous operating conditions. Furthermore, by fusing the unidimensional efficiency index of the decay index with the judgment logic based on boundary regression, the randomness and ambiguity of the evaluation results are eliminated, ensuring absolute consistency and traceability of evaluation results in the same training process, significantly improving the rigor and reliability of training assessments. Attached Figure Description

[0011] Figure 1 This is a flowchart of the training behavior analysis method based on multi-agent collaboration of the present invention. Detailed Implementation

[0012] The subject matter described herein will now be discussed with reference to exemplary embodiments. It should be understood that these embodiments are discussed only to enable those skilled in the art to better understand and implement the subject matter described herein, and changes may be made to the function and arrangement of the elements discussed without departing from the scope of this specification. Various processes or components may be omitted, substituted, or added as needed in the examples. Furthermore, features described in some examples may be combined in other examples.

[0013] like Figure 1As shown, the training behavior analysis method based on multi-agent collaboration includes:

[0014] The collected multi-source training data is time-series aligned and aggregated to generate a time-series operation sequence containing scene state data and operation instruction data.

[0015] A scene evolution model is established by calling a behavior recognition intelligent agent, and the state changes of the virtual training environment are deduced in multiple steps based on the time sequence of operations to obtain the state of the virtual training environment.

[0016] For a preset reference instruction set, the reachability of the reference instruction set relative to risk constraints and task objective conditions is evaluated in the state of the simulated virtual training environment. The recoverable operation margin at the current moment is calculated, and a recoverability decay index is generated based on the decreasing trend of the recoverable operation margin in the time-series operation sequence.

[0017] The target matching agent is invoked to calculate the task achievement degree of the recovered decay index and generate a comprehensive training score by combining the resource consumption data of the training process.

[0018] Based on a preset pass / fail benchmark, the comprehensive training score is mapped to pass or fail.

[0019] Preferably, the collected multi-source training data undergoes time-series alignment and aggregation processing to generate a time-series operation sequence containing scene state data and operation instruction data, including:

[0020] Set a fixed sampling time interval , define the first The discrete time windows are ,in The moment the practical training begins. For time indexing;

[0021] The scene state data within each discrete time window is calculated using the following formula. The operation instruction data Resource consumption data and operating cost data And aggregate to generate the time-series operation sequence. :

[0022] ;

[0023] in, Indicates the first A discrete time window; These represent the original environment interaction events, original operation events, original resource call events, and original system performance events within the window, respectively. This indicates retrieving the value of the last event in the sequence; This indicates that the event count is retrieved; Indicates the duration of the event; This represents the vector concatenation operation; This indicates taking the average of the values; This represents the total length of the sequence.

[0024] The sampling time interval is a fixed time length used to divide the discrete time window. A value of 200 milliseconds is preferred to balance data timeliness and computational efficiency; values from 100 to 500 milliseconds are also applicable.

[0025] The start time of the training is the specific point in time when the training system starts up and begins recording data. It can be obtained through the timestamp module of the training system.

[0026] Scene state data is a dataset that reflects the state of the virtual training environment and related objects, aggregated within a discrete time window. It can be obtained through the environmental sensors and state monitoring modules of the virtual training system.

[0027] Operation command data is a dataset of control commands issued by the trainee to the virtual training environment, aggregated within a discrete-time window. It can be acquired through the command acquisition module of the training operation equipment, such as the command recording unit of a joystick, keyboard, or touchscreen.

[0028] Resource consumption data is a dataset aggregated within a discrete time window, showing the consumption of training-related resources by trainees. It can be obtained through the resource usage monitoring module of the training system.

[0029] The runtime cost data is a dataset of performance metrics aggregated within a discrete-time window during the operation of the training system. It can be obtained through the performance monitoring tools of the training system.

[0030] The raw environmental interaction events within the current time window are the set of unprocessed interaction events that occur in the virtual training environment within a specific discrete time window. These can be captured in real-time using the training system's environmental event recording module.

[0031] The raw operation events within the current time window are the unprocessed set of events generated by the trainee performing operations within a specific discrete time window. These can be obtained through the event log module of the operation device.

[0032] The raw resource call events within the current time window are the unprocessed set of events generated when trainees call training resources within a specific discrete time window. These can be obtained through the call logs recorded by the resource management module.

[0033] The raw system performance events within the current time window are the set of unprocessed performance-related events generated during the runtime of the training system within a specific discrete time window. These can be obtained through the logs of system performance monitoring tools.

[0034] The t-th discrete time window is divided based on the sampling time interval and contains a segment of the training process within a specific time period. Its time range is from the start time of the training plus t minus 1 multiplied by the sampling time interval to the start time of the training plus t multiplied by the sampling time interval.

[0035] Scene state data aggregation includes selecting the state value of the last recorded environmental interaction event within the current time window, because the last state reflects the actual environmental situation at the end of the window. When data is missing, the state value of the previous window is used to avoid interruptions in subsequent calculations due to data gaps. For example, if a device pressure value is not recorded in a certain window, the pressure value from the previous window is used.

[0036] Operation instruction data aggregation includes retrieving the value of the last operation event instruction within the current window, as the last operation reflects the trainee's final operational intent within the window. Forward filling is used when there are no records to ensure the continuity of the operation instruction sequence. For example, if the trainee has not performed any operation in a certain window, the operation instructions from the previous window are used.

[0037] Resource consumption data aggregation includes calculating cumulative frequency and cumulative duration because these two indicators comprehensively reflect the intensity of resource usage and, compared to other statistics such as the mean, better reflect the actual situation of resource consumption. For example, if a trainee accesses a help document 3 times within a window, with a cumulative duration of 60 seconds, the resource consumption data records these two values.

[0038] The aggregation of runtime cost data includes using the arithmetic mean, because the arithmetic mean can balance performance indicators at different time points within a window, accurately reflecting the overall operating status of the system within the window. For example, if the system frame rates within a window are 30, 32, and 28 respectively, the arithmetic mean of 30 is the runtime cost-related indicator value for that window.

[0039] Multi-source data time-series alignment includes dividing discrete windows based on fixed sampling time intervals. This is to unify data from different sources and with different timestamps to the same time dimension, ensuring the accuracy of subsequent multi-source data fusion calculations. For example, with a sampling time interval of 200 milliseconds, all data are aggregated according to this window to achieve time-series alignment.

[0040] The optimal sampling time interval is between 100 and 500 milliseconds, with a default value of 200 milliseconds. This range balances data granularity and system computational load; too short an interval will increase computational pressure, while too long an interval will result in the loss of critical details.

[0041] The specific fields of the scenario state data include simulation object state, risk state variables, task condition variables, external disturbance and fault injection parameters, constraint boundary variables, and task stage variables. These fields can comprehensively characterize the state of the training environment and meet the needs of subsequent analysis.

[0042] Operation command data is divided into continuous control quantities and discrete action groups. Continuous control quantities include knob rotation angle and joystick displacement, while discrete action groups include opening, closing, resetting, and confirmation. Discrete action groups need to be encoded into numerical form and included in the data set.

[0043] The resource consumption data covers resource types including help documents, operation tools, and reference materials. Each resource is recorded with the number of times it is called and the duration of use, ensuring the comprehensiveness of resource consumption statistics.

[0044] The system performance metrics corresponding to the operational cost data include network latency in milliseconds, packet loss rate percentage, frame rate in frames per second, client CPU utilization percentage, and critical interface response time in milliseconds. These metrics comprehensively reflect the system's operational status.

[0045] The specific implementation of vector concatenation is to concatenate the scene state data, operation instruction data, resource consumption data, and running cost data in that order. During concatenation, it is ensured that the dimensions of each data vector match; if the dimensions do not match, zero-padding is performed to unify the dimensions.

[0046] The rule for handling outliers in raw event data is to prune values that exceed the physically reasonable range to the maximum or minimum value within that range. For example, if the reasonable range for network latency is 0 to 1000 milliseconds, values exceeding this range are uniformly set to 1000 milliseconds.

[0047] The logic for correcting out-of-order original events is to sort them by their own timestamps from smallest to largest, and then include them in the corresponding time window to ensure the accuracy of the events in the time dimension.

[0048] Preferably, the behavior recognition intelligent agent includes:

[0049] State transition fitting network and uncertainty estimation network ;

[0050] The state transition fitting network The formula for calculating a single-step state transition used to fit the scene evolution model is as follows:

[0051]

[0052] in, This refers to the current state data of the scene. This refers to the operation instruction data at the current moment. For the predicted scene state data at the next moment, For network parameters;

[0053] The uncertainty estimation network The formula used to estimate the local prediction error magnitude of the scene evolution model is as follows:

[0054] ;

[0055] in, The predicted value is the non-negative magnitude of local uncertainty. This is the original output from the network. For network parameters, For smoothing nonlinear activation functions;

[0056] The behavior recognition agent determines the optimal parameters by optimizing the following objective function. and :

[0057] ;

[0058] in, The Euclidean norm is represented by the summation symbol, which iterates over the historical training sample set.

[0059] The parameters of the state transition fitting network are the set of learnable parameters used in the state transition fitting network to fit the state transition of a scenario in one step.

[0060] Uncertainty estimation network parameters are the set of learnable parameters used in uncertainty estimation networks to estimate the magnitude of prediction error.

[0061] The next moment's scene state prediction is the prediction result of the scene state at the next moment, output by the state transition fitting network after receiving the current scene state data and operation instruction data.

[0062] The predicted value of the local uncertainty magnitude is a non-negative estimate of the state prediction error magnitude output by the uncertainty estimation network after processing by a smoothing nonlinear activation function.

[0063] The raw output of the uncertainty estimation network is the original computation result of the uncertainty estimation network before it has been processed by the activation function.

[0064] The behavior recognition agent employs a dual-network collaborative design: a state transition fitting network learns the evolution of scene states under operational commands and outputs specific state predictions; an uncertainty estimation network simultaneously learns the magnitude of prediction errors and outputs uncertainty estimates. Working together, they provide both the core prediction of state evolution and a reliable reference for the prediction results, avoiding the problem of a single prediction network lacking error quantification. For example, given the current equipment pressure value and adjustment commands, the state transition fitting network predicts the pressure value at the next moment, while the uncertainty estimation network outputs the range of error magnitude for that prediction.

[0065] The activation function used in uncertainty estimation networks is a smooth nonlinear activation function (soft-addition function) because this function maps the network's original output to a non-negative value, and the output curve is smooth, avoiding gradient vanishing or numerical abrupt changes caused by step activation. Compared to the potential neuron death problem of the Corrected Linear Unit (ReLU) and the gradient saturation problem of the Logistic Function (Sigmoid), the soft-addition function is more suitable for scenarios requiring continuous non-negative outputs, such as uncertainty magnitude.

[0066] The training objectives for the dual networks are as follows: The state transition fitting network aims to minimize the sum of squared Euclidean distance errors in the state predictions, ensuring the predicted values are as close as possible to the true state values and guaranteeing prediction accuracy. The uncertainty estimation network aims to minimize the sum of squared differences between the prediction error and the uncertainty magnitude, ensuring the uncertainty estimate accurately matches the actual prediction error and avoiding underestimation or overestimation of the error. For example, if the actual prediction error is 5, the uncertainty estimate should be as close as possible to 5, rather than 3 or 7.

[0067] The dual-network joint training based on historical training data samples involves first training the state transition fitting network until convergence, and then training the uncertainty estimation network after fixing its parameters. This step-by-step joint training method ensures that the state transition fitting network first grasps the core state evolution laws, providing reliable error samples for the uncertainty estimation network, avoiding optimization conflicts caused by training the two networks simultaneously, and improving overall training efficiency and model performance.

[0068] The multilayer perceptron's structure includes two hidden layers for both the state transition fitting network and the uncertainty estimation network. The first hidden layer contains 256 neurons, and the second hidden layer contains 256 neurons. The input layer dimension is the sum of the scene state data dimension and the operation command data dimension. The output layer dimensions are the scene state data dimension (state transition fitting network) and 1 (uncertainty estimation network), respectively. The network does not include dropout layers and uses fully connected layers to connect the neurons.

[0069] The training hyperparameters include: the optimizer is the adaptive momentum estimation algorithm (Adam); the learning rate is preferably 0.001, with a range of 0.0001 to 0.01; the batch size is fixed at 1024; the number of training epochs is fixed at 50; and the weight decay coefficient is fixed at 0.00001 to suppress overfitting.

[0070] The size of the historical training data sample set includes: the sample set must contain at least 1000 valid training trajectory data, and the time step length of each trajectory must be no less than 100. The sample set must cover training data from trainees with different training scenarios and different operational levels to ensure the diversity of data distribution and avoid model overfitting to a single scenario or operation mode.

[0071] The numerical stability thresholds for soft addition functions include: when the input value is greater than 20, it automatically switches to linear calculation, that is, the output value equals the input value, avoiding numerical overflow caused by exponential operations; when the input value is less than or equal to 20, it calculates according to the standard formula to ensure the smoothness of the output.

[0072] The network parameters are initialized using the Xavier initialization method to initialize the weight parameters of each layer of the network, ensuring that the variance of the input and output is consistent and avoiding training difficulties caused by initial weights being too large or too small. Bias parameters are uniformly initialized to 0.

[0073] The early stopping rules during training include: setting a validation set, which comprises 20% of the total sample set; calculating the loss value on the validation set after each training epoch; stopping training if the validation set loss value does not decrease for five consecutive epochs and using the current optimal network parameters; and ending training after 50 epochs if the validation set loss value continues to decrease.

[0074] Preferably, a scene evolution model is established by invoking a behavior recognition intelligent agent, and the state changes of the virtual training environment are deduced in multiple steps based on the time-series operation sequence to obtain the deduced state of the virtual training environment, including:

[0075] Define the current time. The scene state data extracted from the time-series operation sequence is , get included The default reference instruction set for each instruction ;

[0076] For each reference instruction in the reference instruction set The state transition fitting network is used Perform according to the following formula Step-by-step rolling prediction generates a set of state sequences for simulating the virtual training environment. :

[0077] ;

[0078] in, This is the prediction step index, with a value range of 0 to... The preset number of deduction steps; Indicates the application of reference instructions Under the condition of the first The state of the step-by-step deduction; This refers to the state transition fitting network with fixed parameters.

[0079] The preset number of prediction steps is a fixed number of prediction steps set during multi-step prediction. It is preferably 10, so that combined with a sampling time interval of 200 milliseconds, 10 steps correspond to a prediction window of 2 seconds, which can fully cover the impact of short-term operations on the scene state, while avoiding error accumulation caused by too many steps.

[0080] The reference instruction set is a pre-defined set containing multiple control instructions used to traverse the impact of different operations on the scene state. Preferably, it contains a set of 256 instructions, generated using a low-difference sequence to uniformly cover the operation space, balancing computational efficiency and coverage integrity.

[0081] The i-th reference instruction in the reference instruction set is a single control instruction in the reference instruction set, and its format and dimensions are consistent with those of the training operation instructions.

[0082] The state of the l-th step under the i-th reference instruction is the virtual training environment state obtained by applying the i-th reference instruction and predicting iteratively for l steps.

[0083] Multi-step rolling prediction logic based on fixed reference instructions: A single reference instruction runs through all prediction steps, rather than changing the instruction at each step, in order to focus on the continuous impact of the instruction on the scenario state. For example, if a reference instruction is to increase the valve opening, then all 10 prediction steps are based on this instruction to observe the continuous trend of state changes.

[0084] The selection of the initial prediction state includes extracting the current scenario state data from the time-series operation sequence, because this data can truly reflect the environmental conditions at the starting point of the simulation, ensuring that the simulation results are consistent with the actual training process. For example, if the equipment pressure in the current scenario state is 1.2 MPa, this value is the initial pressure value for the simulation.

[0085] The iterative prediction process includes: the Nth step prediction state is generated based on the N-1th step prediction state and the same reference command, in order to simulate the continuous effect of the operation command. For example, if the first step prediction yields a pressure of 1.3 MPa, the second step prediction, based on this pressure and the same command, will continue to predict 1.4 MPa, which conforms to the continuous influence law of actual operation.

[0086] The generation of the reference instruction set includes: using a low-discrepancy sequence (Sobol sequence), which can cover the operation space more uniformly than a pseudo-random sequence. During generation, the sequence is set to not be shuffled, the seed is 0, and the results are reproducible. The instruction dimension is consistent with the operation instruction data dimension.

[0087] The reference instruction set includes: the number of reference instructions is preferably 256. This number can balance coverage integrity and computational load. Too many will increase computational pressure, while too few will not be able to fully traverse the operation space.

[0088] The preset number of simulation steps can be set to: preferably 10, but can be adjusted to 15 if the operation response is slow in the training scenario, or to 8 if the operation response is fast. The core is to cover the main short-term impact of the operation.

[0089] The handling of abnormal states during the simulation includes: when the simulated state exceeds the physically reasonable range, numerical clipping is used to limit the state value to a preset maximum or minimum value. For example, if the reasonable temperature range for the equipment is -20 to 300 degrees Celsius, and the simulation result exceeds this range, it will be clipped to the boundary of that range.

[0090] The format consistency requirement between reference instructions and actual operation instructions is as follows: discrete action groups need to be encoded as score vectors in the range of [0,1], and continuous control quantities should maintain their original numerical range to ensure that the dimensions and numerical ranges of reference instructions and actual operation instructions are completely consistent, thus avoiding logical conflicts in deduction.

[0091] Preferably, for a preset reference instruction set, the attainability of the reference instruction set relative to risk constraints and task objective conditions is evaluated in the simulated virtual training environment, and the recoverable operational margin at the current moment is calculated, including:

[0092] Construct a risk assessment function to characterize the risk constraints. With the objective evaluation function characterizing the objective conditions of the task :

[0093] ;

[0094] in, This is scene state data; Normalized risk characteristics; The target feature for normalization; and These are the corresponding weighting coefficients;

[0095] For each reference instruction in the reference instruction set Based on the state sequence of its corresponding virtual training environment The recoverable weight of the reference instruction is calculated using the following formula. :

[0096] ;

[0097] in, and These are the preset security threshold and the target threshold, respectively; For penalty weighting; For the future The mean value of the predicted local uncertainty magnitude within the step is calculated using the following formula: ;

[0098] Calculate the number of valid recoverable instructions using the following formula. and the recoverable margin of the current operation. :

[0099] ;

[0100] in, This represents the total number of reference instructions in the reference instruction set.

[0101] The weighting coefficients of the risk assessment function are non-negative coefficients used to adjust the importance of each risk feature in the risk assessment function. Preferably, they are uniformly distributed, meaning that the sum of all weighting coefficients is 1 and they are all equal, thus ensuring that each risk feature participates in the assessment fairly when there is no explicit priority.

[0102] The weighting coefficients of the objective evaluation function are non-negative coefficients used to adjust the importance of each objective feature in the objective evaluation function. Preferably, they are uniformly distributed, that is, the sum of all weighting coefficients is 1 and they are all equal, thus ensuring that each objective feature participates in the evaluation fairly when there is no explicit priority.

[0103] The normalized risk feature function is a function obtained by normalizing the original risk-related features in the scenario state data. Features related to alarm level, danger distance, violation count, and accident chain progress are preferred for construction, as these features comprehensively reflect the risk status of the training scenario.

[0104] The normalized target feature function is a function obtained by normalizing the original target-related features in the scene state data. Features related to step completion, quality error, and key parameter deviation are preferred for construction, as these features comprehensively reflect the progress of the training task.

[0105] The dimension of risk features is the total number of weighted coefficients in the risk assessment function, i.e., the number of risk features. A value of 4 is preferred to balance comprehensiveness of the assessment with computational efficiency; too many can lead to redundancy, while too few can result in overlooking key risk points.

[0106] The dimension of the target features is the total number of weight coefficients in the target evaluation function, i.e., the number of target features. A value of 3 is preferred to balance comprehensive evaluation with computational efficiency; too many dimensions can lead to redundancy, while too few can result in missing key target points.

[0107] The preset safety threshold is a critical value used to determine whether the risk assessment results meet safety requirements. A value of 0.7 is preferred because this balances the stringency of safety constraints with the feasibility of practical training; values below this threshold indicate excessive risk.

[0108] The preset target threshold is a critical value used to determine whether the target evaluation results have reached the expected progress. A value of 0.6 is preferred because this value reasonably defines the minimum requirements for target progress; a value below this indicates insufficient target achievement.

[0109] The weight of the risk penalty item is a positive number used to adjust the degree of influence of the risk penalty item in the calculation of the recoverable weight. A value of 2 is preferred because in high-risk training, the importance of risk constraints needs to be emphasized, and this weight can effectively amplify the penalty caused by exceeding the risk limit.

[0110] The weight of the target penalty term is a positive number used to adjust the degree of influence of the target penalty term in the calculation of the recoverable weight. Ideally, it should be 1, thus forming a reasonable balance with the risk penalty term, emphasizing both goal achievement and risk constraints.

[0111] The weight of the uncertainty penalty term is a positive number used to adjust the degree of influence of the uncertainty penalty term in the calculation of the recoverable weight. It is preferably 1 to balance the impact of prediction error and avoid evaluation distortion due to excessive uncertainty.

[0112] The recoverable weight of the i-th reference instruction is a value that quantifies the degree to which the i-th reference instruction satisfies risk constraints and objective requirements within a future simulation step.

[0113] The mean of the local uncertainty amplitude in the next k steps is the arithmetic mean of the predicted local uncertainty amplitude values of the k-step deduced state corresponding to the i-th reference instruction.

[0114] The number of effective recoverable instructions is the sum of the recoverable weights of all reference instructions in the reference instruction set, reflecting the total number of effective safe instructions available at the current moment.

[0115] The recoverable margin at the current moment is a value obtained by logarithmically transforming the number of valid recoverable instructions, and is used to quantify the size of the system's recoverable space at the current moment.

[0116] The weighted continuous function of the risk assessment function and the target assessment function includes: integrating multiple normalized features into a single continuous function using a weighted summation form, rather than a discrete function or a single feature function, in order to achieve continuous quantitative assessment of risk and target status. For example, the risk assessment function integrates features such as alarm level and danger distance, while the target assessment function integrates features such as step completion degree and quality error, which can comprehensively reflect the scenario status.

[0117] The triple penalty fusion for recoverable weighting includes simultaneously incorporating risk penalty, target penalty, and uncertainty penalty, rather than a single penalty dimension. This is to comprehensively consider the impact of instructions on risk, targets, and prediction reliability. For example, although a reference instruction can advance the target, it may trigger high risk and high uncertainty; the triple penalty will reduce its recoverable weight.

[0118] The activation processing of the penalty term includes: using a smooth nonlinear activation function (soft addition function) to process the difference between the indicator value and the threshold, instead of directly taking the absolute value or squaring it. This is because the function can achieve a smooth transition of the penalty, avoiding abrupt changes in the penalty near the threshold. For example, when the risk assessment value is slightly lower than the safety threshold, the penalty term increases slowly, rather than surging instantaneously.

[0119] The reversible negative exponential mapping of weights includes converting the weighted penalty sum into weights in the range of 0 to 1. This aims to achieve a monotonic mapping where the larger the penalty, the smaller the weight, and the weight values are concentrated within a reasonable range. For example, when the weighted penalty sum is 0, the weight is 1, and the weight gradually decreases as the penalty sum increases, intuitively reflecting the effectiveness of the instruction.

[0120] The effective recoverable instruction count includes the sum of the recoverable weights of all reference instructions, rather than a count or taking the maximum value, because the summation comprehensively reflects the cumulative recoverability of all effective instructions. For example, if 10 instructions each have a recoverable weight of 0.8, the summation reflects the overall recoverable space, while the count only reflects the number of effective instructions.

[0121] The logarithmic transformation of recoverable operation margin includes: compressing the scale of the effective recoverable instruction count through logarithmic transformation to avoid excessive differences in margin values due to differences in instruction set size. For example, when the effective recoverable instruction count is 10 and 100, the difference is reduced after logarithmic transformation, which facilitates subsequent trend analysis of timing sequences.

[0122] The risk feature function includes: risk features such as alarm level, danger distance, violation count, and incident chain progress. Normalization uses min-max normalization, which involves subtracting the minimum value from the original feature value and dividing by the difference between the maximum and minimum values, mapping it to the interval 0 to 1. For example, if the original danger distance range is 0 to 10 meters, and the distance at a certain moment is 5 meters, then the normalized value is 0.5.

[0123] The target feature function includes: target features such as step completion degree, quality error, and key parameter deviation. Normalization uses min-max normalization; step completion degree is directly mapped to 0 to 1, while quality error and key parameter deviation are normalized after taking their reciprocals. For example, if the original range of quality error is 0 to 0.2, and the error at a certain moment is 0.1, then the normalized value is 0.5.

[0124] Weighting coefficients include: if the job competency model explicitly provides weights, they are used directly; otherwise, uniform weights are used. For example, when the risk feature dimension is 4, each αj is 0.25; when the target feature dimension is 3, each βj is 0.333.

[0125] The preset safety threshold and target threshold values are set as follows: the preset safety threshold is preferably set to 0.7, and the preset target threshold is preferably set to 0.6. These values can be adjusted according to the training scenario, with an adjustment range of 0.5 to 0.8. For example, in low-risk training scenarios, the safety threshold can be lowered to 0.6, while in high-requirement scenarios, it can be raised to 0.8.

[0126] The range of penalty weights includes: risk penalty weights ranging from 1 to 3, and target penalty and uncertainty penalty weights ranging from 0.5 to 2, which can be adjusted according to the training priority. For example, in purely safety-oriented training, the risk penalty weight can be adjusted to 3, and in target-oriented training, the target penalty weight can be adjusted to 2.

[0127] The base of the logarithmic transformation includes the use of the natural logarithm, i.e., the base is the natural constant (approximately 2.718), which makes the natural logarithm smoother during scaling and conforms to common practices in mathematical modeling.

[0128] The normalization function includes: the minimum-maximum normalization formula, where the normalized value equals the original value minus the minimum value, divided by the difference between the maximum and minimum values. If the original feature has no clearly defined range, the boundary is determined based on the statistical maximum and minimum values of historical training data. For example, if the maximum dangerous distance in historical data is 15 meters and the minimum is 0, then at a certain moment, when the distance is 7.5 meters, the normalized value is 0.5.

[0129] Preferably, a recoverability decay index is generated based on the decreasing trend of the recoverable operation margin in the time-series operation sequence, including:

[0130] For each moment in the time sequence of operations The recoverable operational margin Perform three-point moving average smoothing to obtain the smoothed recoverable margin. :

[0131] ;

[0132] Among them, the boundary points at the beginning and end of the sequence are handled by numerical copying;

[0133] Calculate the single-step decay at each time step using the following formula. :

[0134] ;

[0135] in, This indicates that the operation takes a non-negative value, and the cumulative decay only occurs when the indicator decreases.

[0136] The recoverability decay index is calculated using the following formula. :

[0137] ;

[0138] in, The total length of the time steps in the timing operation sequence.

[0139] The smoothed recoverable margin is the value obtained by applying a three-point moving average filter to the original recoverable margin, which is used to reduce noise interference in the original data.

[0140] The single-step decay amount at the current moment is the difference between the smoothed value at the previous moment and the smoothed value at the current moment in the smoothed recoverable operation margin sequence. Only non-negative values are retained, which are used to quantify the degree of contraction of the recoverable space at a single moment.

[0141] The recoverability decay index is the arithmetic mean of the single-step decay at all times during the entire time-series operation sequence, used to comprehensively reflect the overall shrinkage trend of recoverable space during the training process.

[0142] The total length of the time steps in the timing operation sequence is the total number of discrete time windows divided based on the sampling time interval, reflecting the time span of the training process.

[0143] Smoothing of recoverable operational margins: Three-point moving average filtering is chosen over other methods such as five-point filtering and exponential smoothing because three-point filtering can effectively suppress noise while preserving the temporal trend of the data to the greatest extent. For example, if the original margin data has occasional fluctuations, the three-point mean can smooth out the fluctuations and more accurately reflect the decay pattern.

[0144] The single-step decay amount includes: only accumulating the positive difference between the smoothed value of the previous time step and the current time step, and counting negative values as zero, because decay is only required when the recoverable space shrinks. For example, if the smoothed value of the previous time step is 5 and the current time step is 4, the difference of 1 is included in the decay; if the current time step is 6, the difference of -2 is not included, which conforms to the core definition of decay.

[0145] Recoverable decay indicators include using the arithmetic mean of the decay rates at each step, rather than other statistics such as the sum or median, because the average can reflect the decay intensity evenly throughout the entire training process. For example, if the decay is significant in the early stages of training and stabilizes in the later stages, the average can comprehensively reflect the overall trend and avoid the sum being affected by the duration of the training.

[0146] Smoothing of sequence boundaries includes using numerical duplication at the beginning and end of the sequence, instead of zero-padding or extrapolation, to avoid filtering distortion caused by a lack of adjacent data at boundary points. For example, the first value of the sequence is duplicated as the preceding dummy value, and the last value is duplicated as the following dummy value, ensuring that the filtering results at boundary points are reasonable.

[0147] The calculation of the moving average filter includes: the smoothed recoverable margin is equal to the sum of the original margins at the previous time step, the current time step, and the next time step, divided by 3. Boundary point processing is as follows: the previous dummy value at the first time step is equal to itself, and the next dummy value at the last time step is equal to itself. For example, for the sequence M1, M2, M3, M4, after smoothing, M1'=(M1+M1+M2) / 3, M2'=(M1+M2+M3) / 3, M3'=(M2+M3+M4) / 3, and M4'=(M3+M4+M4) / 3.

[0148] The reasonable range for the recoverability decay index includes: typically 0 to 1, but varies depending on the training scenario. The reasonable range for low-risk training scenarios is 0 to 0.3, and the reasonable range for high-risk training scenarios is 0 to 0.5. Exceeding this range indicates that the trainee's operation is at higher risk.

[0149] The requirements for smoothing to suppress outliers include: after three-point moving average filtering, the impact of outliers on the overall sequence should be reduced by more than 50%. For example, if an outlier appears at a certain moment in the original data that is twice the normal range, the value at that point should be close to the normal range after filtering.

[0150] Preferably, the target matching agent includes:

[0151] The indicator statistics module is used to analyze the time-series operation sequence. Extract key indicator data that can reflect the completion of practical training tasks, and generate corresponding practical training result indicator vectors. ;

[0152] The benchmark comparison module is used to calculate the training result index vector. With the preset ideal target threshold vector Weighted Euclidean distance between The calculation formula is:

[0153] ;

[0154] in, The pre-defined diagonal matrix of indicator weights Denotes the Euclidean norm;

[0155] The matching degree generation module is used to generate a matching degree based on the weighted Euclidean distance. With the aforementioned recoverability decay index Generate the task achievement rate The calculation formula is:

[0156] ;

[0157] in, It is an exponential function. This is a pre-defined positive strength coefficient that incorporates the recoverability decay index into the task achievement level.

[0158] The training result index vector is a vector composed of key indicators extracted from the time-series operation sequence, which can reflect the completion of the training task.

[0159] The ideal target threshold vector is a preset vector composed of indicator thresholds that embody the ideal completion standard of the training task. Preferably, it is a set of indicator thresholds determined based on job competency standards or industry norms to ensure the reasonableness and achievability of the target.

[0160] The indicator weight diagonal matrix is preset and used to adjust the importance of each key indicator in the distance calculation. It is preferably a matrix with diagonal elements between 0.1 and 1.0 to highlight the influence of core indicators and avoid secondary indicators interfering with the evaluation results.

[0161] The weighted Euclidean distance is the Euclidean distance between the training result index vector and the ideal target threshold vector after being weighted by the index weight diagonal matrix. It is used to quantify the difference between the actual result and the ideal target.

[0162] The task completion rate is a comprehensive indicator that reflects the quality and safety level of the training task, obtained by combining the basic matching score and the safety reduction coefficient.

[0163] The strength coefficient of the recoverability decay index is preset and is used to adjust the positive number of the recoverability decay index's impact on task achievement. A value of 1 is preferred to balance the weights of safety reduction and basic matching score, preventing a single factor from excessively dominating the result.

[0164] The two-factor fusion of task achievement includes multiplying the basic matching score by a safety reduction coefficient, rather than adding or combining them, to ensure that the safety level directly reduces the task achievement. For example, in a training exercise, even if the basic matching score is high, if the recoverability decay index is large, the safety reduction coefficient will lower the final task achievement, thus avoiding using goal achievement as the sole evaluation criterion.

[0165] The calculation of the base matching score includes: taking the negative value of the weighted Euclidean distance and then performing an exponential mapping. This is because a smaller distance indicates a closer match to the ideal target, and the exponential mapping can convert the distance into a score in the range of 0 to 1, achieving a reasonable mapping where the smaller the distance, the higher the score. For example, when the distance is 0, the base matching score is 1, and the score gradually decreases as the distance increases.

[0166] The calculation of the safety reduction factor includes: weighting the recoverability decay index and taking the negative value for exponential mapping. This is to ensure that the larger the recoverability decay index is, the smaller the reduction factor is, thus reflecting the weakening of the safety risk on the degree of task achievement. For example, when the recoverability decay index is 0, the reduction factor is 1, and the reduction factor gradually decreases as the index increases.

[0167] The indicator weighting diagonal matrix includes a method that uses weighted Euclidean distance to highlight the impact of key indicators, rather than equal-weighted distance, because different indicators have varying degrees of importance to task completion. For example, in surgical training, the indicator of operational accuracy has a higher weight than operational speed, ensuring that core competencies are given priority consideration.

[0168] The training results indicator vector consists of indicators such as: completion rate of key steps, pass rate of operation quality, control rate of violation frequency, and progress towards target achievement. For example, in power emergency training, indicators include fault diagnosis completion rate, pass rate of operation procedures, and compliance rate of emergency response time.

[0169] The ideal target threshold vector is formulated based on industry standards, job competency requirements, or expert review results. For example, in industrial safety training, the ideal threshold for the completion rate of key steps is set at 100%, and the ideal threshold for the violation control rate is set at 0%.

[0170] The rules for the element values in the indicator weight diagonal matrix are as follows: the weight elements of core indicators range from 0.7 to 1.0, the weight elements of secondary indicators range from 0.1 to 0.3, and the weight elements of general indicators range from 0.4 to 0.6. For example, in surgical training, the weight of the operation accuracy indicator is set to 0.9, and the weight of the operation time indicator is set to 0.2.

[0171] The intensity coefficient can be set to a value of 1, which is preferred but can be adjusted from 0.5 to 2.0 depending on the training scenario. For high-risk training scenarios, the intensity coefficient can be adjusted to 1.5 to 2.0 to enhance the impact of safety factors; for ordinary scenarios, it can be adjusted to 0.5 to 1.0.

[0172] The base of the exponential mapping includes the use of the natural constant (approximately 2.718), because the natural exponential mapping allows the fractions to be smoothly distributed in the interval between 0 and 1, avoiding extreme fluctuations in the fractions.

[0173] Preferably, the target matching agent is invoked to calculate the task achievement degree by fusing the recoverability decay index, and a comprehensive training score is generated by combining the resource consumption data of the training process, including:

[0174] Calculate the normalized time cost using the following formulas respectively. Resource costs and system operating costs :

[0175] ;

[0176] in, The total length of the timing operation sequence. The sampling time interval, For reference duration; For a moment Resource consumption data The norm; They are time points latency, packet loss rate, and frame rate The corresponding weights; This is the normalization function;

[0177] The comprehensive training score is generated using the following formula. :

[0178] ;

[0179] in, The degree of task achievement. This refers to the index of recoverability degradation; The weighting coefficients are positive. It is the numerical stability constant; This is the Sigmoid function.

[0180] The normalized time cost is calculated based on the total duration of the time sequence operation and the preset reference duration, reflecting the degree to which the training time deviates from the standard duration.

[0181] The normalized resource cost is calculated based on the cumulative total resource consumption and total duration in the time-series operation sequence, reflecting the intensity of resource consumption per unit time.

[0182] The normalized system operating cost is calculated based on the average value of the operating cost data of the training system in the time-series operation sequence, reflecting the stability of the system operation during the training process.

[0183] The reference duration is preset and serves as a standard duration for measuring whether the training time is reasonable. A preferred duration is 600 seconds, thus balancing task completion requirements with efficiency demands, taking into account the standard task durations for most high-risk training exercises.

[0184] The weighting coefficients for system operating costs are preset and used to adjust the importance of latency, packet loss rate, and frame rate in the calculation of system operating costs. A uniform distribution is preferred, with each of the three coefficients being one-third, thus ensuring fair participation of each performance indicator in the evaluation when there is no explicit priority.

[0185] The weighting coefficients for the overall score are preset and used to adjust the positive numbers representing the impact of task achievement, recoverability decay index, and the three types of costs on the overall score. The preferred positive weighting coefficient is 3, the negative weighting coefficient for the recoverability decay index is 3, and the negative weighting coefficients for the other three types of costs are all 1, to highlight the core importance of task achievement and safety risk, while balancing efficiency and resource consumption.

[0186] The numerical stability constant is a preset value used to avoid extremely small positive numbers that could cause anomalies in logarithmic calculations when the task achievement rate is zero. It is preferably 10 to the power of -6, as this value is small enough not to affect the calculation results while effectively preventing numerical calculation errors.

[0187] The comprehensive training score is a score ranging from zero to one obtained by weighting and aggregating the task achievement rate, the recoverability decay index, and the three types of costs, and mapping it through an S-shaped function. It reflects the overall effect of the training.

[0188] Latency is the time delay of data transmission during the operation of the training system, measured in milliseconds. It can be obtained through the latency detection module of a system performance monitoring tool.

[0189] Packet loss rate is the percentage of data transmission lost during the operation of the training system. It can be obtained using network performance monitoring tools.

[0190] The frame rate is the number of image frames displayed per second during the operation of the training system, measured in frames per second. It can be obtained through the performance monitoring module of the graphics processing unit.

[0191] The normalization calculation of the three types of costs includes: Time cost, normalized based on the reference duration, to unify the comparison standard for different training durations and avoid distortion in time consumption assessment caused by differences in the duration of the tasks themselves; Resource cost, normalized based on the total duration, to reflect the intensity of resource consumption per unit time, rather than the total consumption, thus better reflecting resource utilization efficiency; and System operation cost, normalized based on a weighted average of multiple indicators, to comprehensively evaluate the multi-dimensional performance of the system and avoid a single performance indicator reflecting the system state in a one-sided way. For example, if Training A has a total duration of 900 seconds and a reference duration of 600 seconds, the time cost normalized to 1.5, intuitively showing that the time consumption exceeds the standard.

[0192] The weighted aggregation of the comprehensive training score includes: a positively weighted logarithmic value of task achievement, a negatively weighted logarithmic value of the recoverability decay index, and logarithmic values of the three types of costs. The non-linear weighting is used to compress the numerical scale through logarithmic transformation, while simultaneously reflecting the evaluation logic that higher achievement is better, and lower decay and cost are better. For example, when the task achievement is close to 1, the logarithmic transformation still maintains effective discrimination; however, as the decay index increases, the negative weighting will significantly lower the comprehensive score.

[0193] Numerical stabilization strategies include incorporating a numerical stability constant into the logarithmic calculation of task achievement. This is because task achievement may be zero, and directly taking the logarithm would render the calculation meaningless. This constant ensures the stability of the calculation process. For example, when task achievement is zero, adding the constant results in 10 to the power of negative 6, and the logarithmic calculation result remains a valid value.

[0194] The S-shaped function maps the weighted sum to the interval between zero and one. This is to standardize the overall score, making the scores of different training sessions comparable, while avoiding extreme values. For example, when the weighted sum is any value, after mapping by the S-shaped function, it will all fall between zero and one, which facilitates the setting and comparison of subsequent passing benchmark values.

[0195] The recommended duration is 600 seconds, which can be adjusted according to the training scenario, ranging from 300 to 1200 seconds. For short training scenarios such as simple equipment operation, 300 seconds can be used; for long training scenarios such as complex surgical simulation, 1200 seconds can be used.

[0196] The implementation of the normalization function includes: using min-max normalization to map the original data to the interval between zero and one. For example, if the original range of latency is 0 milliseconds to 1000 milliseconds, and the latency at a certain moment is 500 milliseconds, then the normalized value is 0.5; if the original range of frame rate is 1 frame per second to 60 frames per second, and the frame rate at a certain moment is 30 frames per second, then the normalized value is 0.5.

[0197] The system operation cost weighting coefficient can be set within a range of 0.1 to 0.5, preferably uniformly distributed. This can be adjusted based on the system performance requirements of the training exercise. For latency-sensitive training such as flight simulation, the latency weight can be set to 0.5, and other indicators to 0.25.

[0198] The range of values for the comprehensive scoring weighting coefficients includes: a positive weighting coefficient of 2 to 4, a negative weighting coefficient of 2 to 4 for the recoverability decay index, and negative weighting coefficients for the other three types of costs ranging from 0.5 to 1.5. For scenarios with high safety requirements, the weighting coefficient for the recoverability decay index can be adjusted to 4; for scenarios with high efficiency requirements, the weighting coefficient for the time cost can be adjusted to 1.5.

[0199] The value of the numerical stability constant includes: preferably 10 to the power of -6, but 10 to the power of -5 or 10 to the power of -7 can also be used. The key is to ensure that the value is small enough and does not affect the calculation results.

[0200] The normalized upper and lower bounds of various metrics in the system operation cost include: the normalized upper bound of latency is set to 1000 milliseconds, and the lower bound is 0 milliseconds; the normalized upper bound of packet loss rate is set to 100%, and the lower bound is 0%; the normalized upper bound of frame rate is set to 60 frames per second, and the lower bound is 1 frame per second.

[0201] S-shaped functions include: using the standard S-shaped function, where the output value is equal to 1 divided by 1 plus the negative power of the natural constant. This form enables smooth numerical mapping and ensures a natural transition in scoring.

[0202] Preferably, based on a preset pass / fail benchmark, the comprehensive training score is mapped to pass or fail, including:

[0203] Obtain the preset qualified benchmark value This value is based on A historical training sample The following boundary regression model was used to calculate:

[0204] ;

[0205] in, For the first The comprehensive score of the training based on a historical training sample. For the first Instructor ratings for a historical practical training sample. This is the preset critical pass score. For regression coefficients, and This is a numerical truncation function. To obtain the optimal regression coefficients;

[0206] Using step function Generate binary results using the following formula. :

[0207] ;

[0208] in, This is the currently calculated comprehensive training score; when the input value is greater than or equal to zero, Output 1 if the output is 1, otherwise output 0.

[0209] when When the output is qualified, the output is qualified; when At that time, the output was not qualified.

[0210] The passing benchmark value is obtained by linear regression mapping based on historical training samples and numerical truncation. It is used to determine whether the comprehensive training score is qualified.

[0211] The historical training sample size is the total number of valid historical training samples used to calculate the qualified benchmark value. Ideally, it should be no less than 200 to ensure the reliability of the regression model fit; too few samples can easily lead to distorted benchmark values.

[0212] The comprehensive training score of the j-th historical sample is the comprehensive score result of the j-th historical training sample.

[0213] The instructor's score for the j-th historical sample is the score given by the instructor after manually assessing the j-th historical training sample. It can be obtained through the manual scoring input module of the training assessment system.

[0214] The preset critical pass score is the manual scoring threshold for determining whether the practical training is satisfactory. A value of 2 is preferred to align with the instructor scoring system of 1 to 5 points, clearly defining the critical pass standard.

[0215] The linear regression coefficients are regression parameters obtained by fitting the historical sample's comprehensive training score and instructor score using the least squares method.

[0216] The binary result is a 0 or 1 obtained by mapping the difference between the comprehensive training score and the passing benchmark value through a step function, and is used to directly determine whether it is qualified or not.

[0217] The boundary regression calculation of the passing benchmark includes: establishing a mapping relationship between the comprehensive training scores and instructor scores of historical samples using linear regression, rather than directly setting a fixed threshold. This is to ensure that the benchmark adapts to the evaluation standards of different training scenarios. For example, by fitting a large number of historical samples, the benchmark can reflect the passing standard in actual assessments, rather than being subjectively set.

[0218] Numerical truncation strategies include projecting regression coefficients to the range of 0.01 to 1.0 and projecting acceptable benchmark values to the range of 0 to 1. This is to avoid outliers generated by regression fitting causing benchmark values to exceed reasonable ranges. For example, if the benchmark value obtained from regression is 1.2, it will be truncated to 1 to ensure that the benchmark value conforms to the range of the comprehensive score.

[0219] The binary mapping rule, including the use of a step function, determines a score as qualified if the difference is greater than or equal to zero, and unqualified otherwise. This is to achieve clear binary judgment and avoid ambiguity. For example, if the overall score is 0.6 and the benchmark value is 0.5, a difference of 0.1 is considered qualified; a score of 0.4 would be considered unqualified.

[0220] The minimum requirement for the number of historical training samples is: preferably 200, and no less than 100. The sample size needs to cover trainees of different skill levels and various training scenarios to ensure even data distribution and improve the generalization ability of the regression model.

[0221] The threshold score can be set as follows: preferably 2, suitable for instructor rating systems ranging from 1 to 5 points, where 1 point is failure, 2 points is borderline pass, and 3 to 5 points is pass or above. If other rating systems are used, the threshold value can be adjusted proportionally.

[0222] The projection range of linear regression coefficients includes: the slope term in the regression coefficients is projected to the range of 0.01 to 1.0; the intercept term has no additional projection requirements, but it must be ensured that the final baseline value is between 0 and 1 after truncation. For example, if the fitted result of the slope term is 1.2, it becomes 1.0 after projection; if the slope term is 0.005, it becomes 0.01 after projection.

[0223] The instructor evaluation criteria include a 1-5 point system, with 1 point indicating that the goal was not achieved at all and there are serious safety issues, 2 points indicating that the goal was basically achieved but there are safety hazards, 3 points indicating that the goal was achieved and safety compliance was met, 4 points indicating good, and 5 points indicating excellent.

[0224] The requirements for the fitting quality of linear regression include: the coefficient of determination (R-squared) after fitting should not be less than 0.7, ensuring a strong linear correlation between the comprehensive training score and the instructor's score, and avoiding unreliable benchmark values due to poor fitting.

[0225] The determination of the boundary points of the step function includes: when the overall training score equals the passing benchmark value, it is judged as passing. For example, if the overall score is 0.5, the benchmark value is 0.5, the difference is 0, and a passing output command is triggered.

[0226] The embodiments of this example have been described above. However, this example is not limited to the specific implementation methods described above. The specific implementation methods described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms based on the guidance of this example, and all of them are within the protection scope of this example.

Claims

1. A training behavior analysis method based on multi-agent collaboration, characterized in that, include: The collected multi-source training data is time-series aligned and aggregated to generate a time-series operation sequence containing scene state data and operation instruction data. A scene evolution model is established by calling a behavior recognition intelligent agent, and the state changes of the virtual training environment are deduced in multiple steps based on the time sequence of operations to obtain the state of the virtual training environment. For a preset reference instruction set, the achievability of the reference instruction set relative to risk constraints and task objective conditions is evaluated in a simulated virtual training environment, and the recoverable operational margin at the current moment is calculated, including: Construct a risk assessment function to characterize the risk constraints and a target assessment function to characterize the task objective conditions. Both the risk assessment function and the target assessment function are weighted continuous functions based on the scenario state data. For each reference instruction in the reference instruction set, the recoverable weight of the reference instruction is calculated based on the state of the virtual training environment at a set of future moments corresponding to the reference instruction. The recoverable weight is based on a weighted sum of the risk penalty, target penalty, and uncertainty penalty in all future simulation steps and a negative exponential mapping. The risk penalty term is obtained by processing the difference between the calculated value of the state of the simulated virtual training environment under the risk assessment function and a preset safety threshold using a smoothing nonlinear activation function; the target penalty term is obtained by processing the difference between the calculated value of the state of the simulated virtual training environment under the target assessment function and a preset target threshold using the smoothing nonlinear activation function; and the uncertainty penalty term is obtained by the mean value of the predicted local uncertainty magnitude of the state output of the simulated virtual training environment by the uncertainty estimation network. The recoverable weights of all reference instructions in the reference instruction set are summed to obtain the number of effective recoverable instructions; Logarithmically transform the number of valid recoverable instructions to obtain the recoverable operation margin at the current moment; Based on the decreasing trend of the recoverable operation margin in the time-series operation sequence, a recoverability decay index is generated. The target matching agent is invoked to calculate the task achievement degree of the recovered decay index and generate a comprehensive training score by combining the resource consumption data of the training process. Based on the comprehensive evaluation agent, the comprehensive training score is mapped to either qualified or unqualified.

2. The training behavior analysis method based on multi-agent collaboration according to claim 1, characterized in that, The collected multi-source training data undergoes time-series alignment and aggregation to generate a time-series operation sequence containing scene state data and operation instruction data, including: Set a sampling time interval, and divide the entire training process into continuous discrete time windows based on the sampling time interval; For the scenario state data, the state value of the last recorded environmental interaction event within the current discrete time window is selected as the state at that moment. If there is a missing record within the current discrete time window, the state value of the previous discrete time window is used. For the operation instruction data, the last recorded operation event instruction value within the current discrete time window is selected as the instruction at that moment; Based on the resource consumption data during the training process, calculate the cumulative number and duration of resource call events within the current discrete time window; Calculate the arithmetic mean of the system performance indicators within the current discrete time window based on the operational cost data of the training system; The scene state data, operation instruction data, resource consumption data, and operation cost data within the same discrete time window are aggregated and arranged in chronological order to generate the time-series operation sequence.

3. The training behavior analysis method based on multi-agent collaboration according to claim 2, characterized in that, Behavior recognition intelligent agents, including: A state transition fitting network is used to construct the prediction part in the scene evolution model. The state transition fitting network adopts a multilayer perceptron and is configured to receive the scene state data and the operation instruction data at the current moment, and output the predicted value of the scene state data at the next moment. An uncertainty estimation network is used to construct the deviation magnitude part in the scene evolution model. The uncertainty estimation network adopts a multilayer perceptron and is configured to receive the scene state data and the operation instruction data at the current moment, and output a non-negative local uncertainty magnitude prediction value through a smooth nonlinear activation function. The state transition fitting network and the uncertainty estimation network are trained based on a sample set of historical training data. The training objectives are to minimize the sum of squared Euclidean distance errors in the state prediction and to minimize the sum of squared differences between the Euclidean distance errors and the predicted values of the local uncertainty magnitude, respectively.

4. The training behavior analysis method based on multi-agent collaboration according to claim 3, characterized in that, A behavior recognition intelligent agent is invoked to establish a scene evolution model, and based on the time-series operation sequence, the state changes of the virtual training environment are deduced in multiple steps to obtain the deduced state of the virtual training environment, including: Extract the scene state data at the current moment from the time-series operation sequence and use it as the initial prediction state for the deduction process; Obtain a preset set of reference instructions, and for each reference instruction in the set, perform iterative prediction using the state transition fitting network: In the first prediction step of the iterative prediction, the initial prediction state and the current reference instruction are input into the state transition fitting network to obtain the inference state of the first step. In the Nth prediction step, the inference state obtained in the previous step and the current reference instruction are input into the state transition fitting network to obtain the inference state of the current step; where N≥2; Repeatedly execute iterative predictions according to the preset number of deduction steps to generate a set of future time-based virtual training environment states corresponding to each reference instruction.

5. The training behavior analysis method based on multi-agent collaboration according to claim 4, characterized in that, Based on the decreasing trend of the recoverable operational margin in the time-series operation sequence, a recoverability decay index is generated, including: The recoverable operation margin calculated at each time step in the time-series operation sequence is subjected to moving average filtering to generate a smoothed recoverable operation margin sequence. For each moment in the smoothed recoverable operation margin sequence, calculate the difference between the smoothed value of the previous moment and the smoothed value of the current moment. If the difference is greater than zero, use the difference as the single-step decay amount of the current moment. If the difference is less than or equal to zero, use the single-step decay amount of the current moment as zero. Calculate the arithmetic mean of the single-step decay at all times during the entire process of the time-series operation sequence, and use this arithmetic mean as the recoverability decay index.

6. The training behavior analysis method based on multi-agent collaboration according to claim 5, characterized in that, Target matching agents include: The indicator statistics module is used to extract key indicator data that can reflect the completion status of the training task from the time-series operation sequence and generate the corresponding training result indicator vector. The benchmark comparison module is used to calculate the weighted Euclidean distance between the training result index vector and the preset ideal target threshold vector. The matching degree generation module is used to generate the task achievement degree based on the weighted Euclidean distance and the recoverability decay index. The task achievement score is generated as follows: the negative value of the weighted Euclidean distance is exponentially mapped to obtain the basic matching score; the negative value of the recoverability decay index is exponentially mapped to obtain the safety reduction coefficient; and the basic matching score is multiplied by the safety reduction coefficient to obtain the final task achievement score.

7. The training behavior analysis method based on multi-agent collaboration according to claim 6, characterized in that, The target matching agent is invoked to calculate the task achievement degree by fusing the recoverability decay index, and a comprehensive training score is generated by combining the resource consumption data of the training process, including: Based on the total duration of the time-series operation sequence and the preset reference duration, the normalized time cost is calculated. Based on the cumulative total amount and total duration of resource consumption data in the time-series operation sequence, the normalized resource cost is calculated. Based on the average value of the running cost data of the training system in the time-series operation sequence, the normalized system running cost is calculated. The task achievement rate, the recoverability decay index, the time cost, the resource cost, and the system operation cost are weighted and aggregated to generate the comprehensive training score. The weighted aggregation logic is as follows: positively weight the logarithm of the task achievement degree, and negatively weight the logarithm of the recoverability decay index, the logarithm of the time cost, the logarithm of the resource cost, and the logarithm of the system operation cost. The weighted sum is then input into an S-shaped function to map and obtain the comprehensive training score with a value between zero and one.

8. The training behavior analysis method based on multi-agent collaboration according to claim 7, characterized in that, Based on the comprehensive evaluation agent, the comprehensive training score is mapped to pass or fail, including: The comprehensive evaluation agent is used to compare the passing benchmark value with the comprehensive training score. The passing benchmark value is calculated by establishing a linear regression mapping relationship based on the comprehensive training score and manual assessment score of the historical training sample set, and then substituting it into the preset critical passing score. The difference is obtained by subtracting the passing benchmark value from the overall training score. If the difference is greater than or equal to zero, the comprehensive evaluation agent outputs a pass instruction; If the difference is less than zero, the comprehensive evaluation agent outputs a non-compliance instruction.

Citation Information

Patent Citations

Safety practical training method and system based on digital cloud platform
CN120598744A
New energy large model intelligent centralized control and data platform system
CN121643226A

Patent Information

Abstract

Description

Patent Citations

Safety practical training method and system based on digital cloud platform

New energy large model intelligent centralized control and data platform system