A talent incentive strategy generation method based on deep reinforcement learning
By constructing a multi-source heterogeneous data perception layer and an improved GAIL model, combined with a digital twin simulation environment, accurate incentive strategies are generated. This solves the problems of instability in incentive strategy generation and insufficient long-term strategy optimization in existing methods, achieving accurate simulation and controllable risk screening of incentive strategies, and improving the robustness and adaptability of the incentive model.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- THIRD INSTITUTE OF OCEANOGRAPHY STATE OCEANI C ADMINISTRATION
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Existing methods for generating talent incentive strategies based on deep reinforcement learning face challenges such as the implicit and unpredictable psychological state of employees, the dynamic evolution of organizational collaboration environment, and the difficulty in balancing incentive costs and benefits. They are difficult to achieve comprehensive modeling and long-term strategy optimization, and lack pre-run verification in virtual simulation environments, which affects the credibility and usability of decision-making.
A multi-source heterogeneous data perception layer is constructed, and an improved GAIL model based on PPO-penalty is adopted. Combined with digital twin simulation and strategy selection, precise incentive strategies are generated through state feature encoding, dual constraint factor control and multi-step forward return prediction, so as to achieve long-term return assessment and risk controllable screening.
It significantly improves the adaptability of incentive strategies and the long-term effectiveness of decision-making, solves the problems of strategy blindness and model training instability, and enhances the robustness of incentive models and their adaptability in complex management scenarios.
Smart Images

Figure CN122243431A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and deep reinforcement learning, and in particular to a method for generating talent incentive strategies based on deep reinforcement learning. Background Technology
[0002] Deep reinforcement learning technology, with its ability to adaptively optimize strategies and maximize long-term benefits in complex systems, has been widely applied in recent years in fields such as intelligent decision support, dynamic resource allocation, and behavior prediction, becoming an important development direction for the digital transformation of talent management. However, in practical applications, talent incentive management scenarios face many challenges, such as the implicit and unpredictable psychological state of employees, the dynamic evolution of organizational collaboration environments, and the difficulty in balancing incentive costs and benefits. The deployment effectiveness of incentive strategy generation methods based on deep reinforcement learning is still constrained by many factors.
[0003] Currently, most talent incentive methods rely on single-dimensional performance data input, making it difficult to fully utilize multi-source heterogeneous information such as the frequency of interaction in collaborative networks, workload intensity, and project milestone achievement, resulting in a lack of comprehensiveness in modeling talent career status. Some systems only use static rules or simple linear weighted models to generate incentive strategies, ignoring the cumulative benefits of incentive strategies over long-term time spans, the combined effects of individual risk preference differences, and the internal pressure transmission paths within the organization, thus limiting the adaptive optimization capabilities of incentive strategies. Furthermore, the generation process of incentive decisions lacks pre-testing and verification in virtual simulation environments, making it difficult to provide managers with deterministic predictions and risk assessments after strategy implementation, affecting the credibility and usability of incentive decisions.
[0004] Furthermore, the reward mechanisms of existing intelligent models in talent incentives are mostly based on the immediate feedback of a single time step, failing to combine tree search algorithms to forward extrapolate future benefits over multiple steps. This leads to the model potentially ignoring long-term performance growth and cost overruns when pursuing short-term retention rate improvements. It is difficult to adapt to scenarios where talent value is constantly changing and business objectives are dynamically adjusted, seriously affecting the model's practical value and stability in real work environments.
[0005] Therefore, how to provide a method for generating talent incentive strategies based on deep reinforcement learning is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention
[0006] One objective of this invention is to propose a talent incentive strategy generation method based on deep reinforcement learning. This invention fully integrates key steps such as multi-source heterogeneous data perception, state feature encoding, improved GAIL model decision-making based on PPO-penalty, digital twin simulation and strategy selection, and constructs an intelligent incentive generation process with data tensor standardization, deep mining of latent features, policy distribution deviation constraints, collaborative control of first- and second-order dual constraint factors, and multi-step forward return prediction. This enables accurate simulation, long-term benefit evaluation, and risk-controllable selection of talent incentive strategies in dynamic organizational environments. This invention possesses advantages such as expert-like strategy generation, stable updates under dual constraints, strong forward-looking simulation, and balanced risk-reward considerations. It can significantly improve the adaptability of incentive strategies in complex management scenarios, the robustness of long-term decision-making and incentive models, thereby effectively solving problems such as blind strategy exploration, unstable model training, and neglect of long-term cumulative returns in existing methods.
[0007] A method for generating talent incentive strategies based on deep reinforcement learning according to an embodiment of the present invention includes the following steps:
[0008] S1. Construct a multi-source heterogeneous data perception layer for research institutes, collect behavioral and performance data of talents, and fuse and map them into a high-dimensional tensor state space to form a real-time state vector.
[0009] S2. Establish a dynamic reward mechanism, define the agent's action space as a discrete incentive strategy, construct a nonlinear instantaneous reward function, and quantitatively evaluate the effectiveness of the incentive strategy in a short period.
[0010] S3. Using the preset historical incentive decision sequence of outstanding managers as prior knowledge, an improved GAIL model based on PPO-penalty is adopted to generate simulated incentive strategies, calculate the trust domain constraint penalty term, and introduce a dual constraint factor when constructing the total objective function in combination with the immediate reward function. At the same time, the form distribution and update direction of the incentive strategy are standardized, and pre-training and iterative optimization are carried out.
[0011] S4. Input the real-time state vector into the pre-trained improved PPO-penalty-based GAIL model, mine the centrality features and pressure transmission paths of talents in the organizational collaboration network, fuse behavioral data and performance data, output state codes and preliminary policy probability distributions, and sample to generate candidate policy sets.
[0012] S5. Input the state code into the digital twin simulation environment, combine it with the initial strategy probability distribution, and use the Monte Carlo tree search algorithm to perform multi-step forward inference to simulate the future behavior trajectory and performance change curve of talents after receiving different incentive strategies, and predict the cumulative reward return.
[0013] S6. Based on the predicted cumulative reward return and the preset risk threshold, construct screening indicators to select the optimal incentive strategy from the candidate strategy set that maximizes long-term returns and has controllable risks.
[0014] S7. Based on the selected optimal incentive strategy, apply changes or modifications to employee compensation incentives and training policies.
[0015] Optionally, S1 specifically includes:
[0016] S11. Call the data acquisition interface to read the talent's behavioral data and performance data. The behavioral data includes the frequency of interaction in the collaborative network and the length of service. The performance data includes the KPI completion rate and the achievement of project milestones. Map the behavioral data and performance data to a pre-defined multi-dimensional tensor structure and fill each data item into the corresponding dimension position of the tensor.
[0017] S12. Perform linear normalization on the filled multidimensional tensor structure to scale the values to a preset range, generate a high-dimensional tensor state space, and arrange the high-dimensional tensor state space into a one-dimensional data sequence in a preset order to form a real-time state vector.
[0018] Optionally, S2 specifically includes:
[0019] S21. A pre-defined action space set for intelligent agents is used to convert salary adjustment range, achievement reward ratio, job promotion opportunities and graduate and postdoctoral quota allocation into pre-defined unique integer identifiers, and the integer identifiers are stored in the action space set.
[0020] S22. Read the talent status records in the talent management system. If the talent is employed, set the talent retention rate to 1. If the talent has left, set the talent retention rate to 0. Read the performance appraisal score of the current period and the performance appraisal score of the previous period, calculate the difference between the two and record it as the performance gain.
[0021] S23. Read the corresponding salary adjustment amount, performance bonus amount, cost increase amount due to job promotion and training expense amount from the financial database, and calculate the sum as the total incentive cost;
[0022] S24. Multiply the talent retention rate and performance gain by the preset first weight coefficient and second weight coefficient respectively and sum them to obtain the total benefit. Subtract the total incentive cost from the total benefit to calculate the net value of the immediate reward. Input the net value of the immediate reward into the Sigmoid nonlinear activation function for calculation and output the immediate reward value to establish the immediate reward function.
[0023] S25. When the preset short period of time after the implementation of the incentive strategy is reached, read the employee status and performance appraisal score of the talent management system, recalculate the performance gain and total incentive cost, and input the recalculated talent retention rate, performance gain and total incentive cost into the instant reward function, and output the specific value as the quantitative evaluation result.
[0024] Optionally, S3 specifically includes:
[0025] S31. Call the database interface to read the pre-stored historical incentive decision sequence of outstanding managers, decompose each incentive decision record in the incentive decision sequence into state data and action data, combine them to generate expert state-action pairs, and input the expert state-action pairs into the discriminator network as positive sample labels.
[0026] S32. Input the real-time state vector of the current multidimensional tensor state space into the generator network. After passing through the hidden layers of the preset multi-layer neural network, perform multi-layer preset matrix multiplication and superimpose biases. After ReLU activation function operation, perform Softmax function processing to output the predicted probability values of each action dimension and combine them to form a simulated excitation strategy.
[0027] S33. Read the incentive decision sequence of historical outstanding managers, count the frequency of each action data, divide the frequency by the total frequency of actions in the current state and normalize it to obtain the actual occurrence probability of the corresponding action in the expert state-action pair, and input it into the discriminator network at the same time as the predicted probability value of the simulated incentive strategy, calculate the cross-entropy loss function value between the two, mark the cross-entropy loss function value as the strategy distribution deviation, and use the strategy distribution deviation as the trust domain constraint penalty term.
[0028] S34. Construct an improved overall objective function formula for the GAIL model based on PPO-penalty. In the overall objective function formula, set the sum of environmental reward terms and penalty terms. The environmental reward terms are substituted with the output value of the immediate reward function, and the penalty terms are substituted with the trust domain constraint penalty terms.
[0029] S35. Extract the current preset weight parameter vector of the generator network, calculate the first and second derivatives of the total objective function formula with respect to the weight parameter vector of the generator network, obtain the loss function by calculating the expected value of the second derivative matrix, calculate the first derivative matrix of the weight parameter vector with respect to the loss function to obtain the Fisher information matrix, perform eigenvalue decomposition on the Fisher information matrix, obtain the eigenvalue with the largest value among all eigenvalues, and record the largest eigenvalue as the second curvature constraint parameter;
[0030] S36. Read the predicted probability value of the action output by the generator network and the actual occurrence probability of the corresponding action in the expert state-action pair. Take the natural logarithm of the predicted probability value of each action and calculate the ratio with the actual occurrence probability. Multiply the ratio by the actual occurrence probability to obtain the relative entropy component of each action. Perform an accumulation and summation operation on the relative entropy components of all actions to obtain the KL divergence value and record it as a first-order distribution constraint parameter.
[0031] S37. Set the first preset coefficient and the second preset coefficient, multiply the first-order distribution constraint parameter by the first preset coefficient to obtain the first-order constraint component, multiply the second-order curvature constraint parameter by the second preset coefficient to obtain the second-order constraint component, and add the first-order constraint component and the second-order constraint component to obtain the composite constraint value.
[0032] S38. Calculate the original gradient value of the overall objective function with respect to the weight parameters of the generator network. Divide the original gradient value by the composite constraint value to obtain the corrected gradient value. Use the corrected gradient value to perform addition and subtraction operations on the weight parameters of the generator network to complete one parameter update. Repeat the forward propagation and parameter update steps until the preset number of iterations is reached to complete the pre-training and iterative optimization.
[0033] Optionally, S4 specifically includes:
[0034] S41. Input the real-time state vector into the pre-trained improved GAIL model based on PPO-penalty, read the collaborative network interaction frequency collected in step S1, construct the organizational collaborative network adjacency matrix with talent identifiers as nodes and interaction frequency values as edge weights, calculate the algebraic sum of all values in each row of the adjacency matrix, obtain the total connection weight between each node and other nodes, compare the total connection weight with the preset centrality threshold, and if the value is greater than the threshold, mark it as a centrality feature.
[0035] S42. In the adjacency matrix of the organizational collaboration network, identify the directed edges corresponding to the non-zero elements, traverse and search along the direction of the directed edges from the starting node to the ending node, record all node sequences with path length greater than the preset number of layers, count the nodes in the node sequence whose in-degree value is greater than the preset convergence value, and sum the in-degree value and the node layer number by weight to obtain the pressure transmission path.
[0036] S43. Input the working years data collected in step S1 into the preset occupational status judgment function, calculate the ratio of working years to standard years, determine the numerical label of occupational status type according to the preset interval of the ratio value, read the historical data sequence of project milestone achievement, calculate the variance of the sequence value, and mark the variance value as risk preference tendency.
[0037] S44. Read the interaction frequency value of the collaborative network. If the interaction frequency value is less than the preset lower limit and the working years value is greater than the preset upper limit, then assign the implicit turnover probability to 1; otherwise, assign it to 0. Concatenate the numerical label of the occupational status type, the variance value, and the assignment of the implicit turnover probability, and output the status code.
[0038] S45. Input the state code into the generator network, process it through the preset fully connected layer matrix operation and activation function, output the predicted probability value corresponding to each integer identifier in the action space, sort all the predicted probability values in descending order, extract the target probability value in the first preset number of positions of the sort, read the integer identifier corresponding to the target probability value, convert the integer identifier into an activation policy, and combine them to form a candidate policy set.
[0039] Optionally, S5 specifically includes:
[0040] S51. Input the state code into the initial state register of the digital twin simulation environment, set the simulation time step counter to zero, read the candidate strategy set, use each integer identifier contained in the candidate strategy set as a child node of the root node of the Monte Carlo tree search algorithm, initialize the access count variable of each child node to 0, and initialize the reward accumulation variable of each child node to 0.
[0041] S52. For each child node, read the integer identifier corresponding to the child node, query the action space set in step S21, convert the integer identifier into specific numerical parameters of salary adjustment range, achievement reward ratio, job promotion opportunity and postgraduate and postdoctoral quota allocation, input the specific numerical parameters into the input interface of the digital twin simulation environment, call the state update function of the digital twin simulation environment, calculate the state data of the next moment after the execution of the incentive strategy, extract the talent's employment status and performance appraisal score from the state data of the next moment, and use the instant reward function formula in step S24 to calculate the single-step reward value of the first moment.
[0042] S53. Starting from the obtained state data of the next moment, a random number between 0 and 1 is generated, and an action is randomly selected from the action space set as the subsequent action. The subsequent action is input into the digital twin simulation environment to obtain new state data. The reward value in the new state is calculated using the instant reward function, and the simulation time step counter is incremented by one.
[0043] S54. Repeatedly execute the operations of randomly selecting actions, inputting the environment, calculating rewards and updating counters until the simulation time step counter equals the preset maximum step size. Then, add up all the single-step reward values from the first moment to the maximum step size to obtain the total reward value for a single simulation.
[0044] S55. Repeat the random selection and extrapolation calculation process of step S54 until the number of repetitions reaches the preset total number of simulations. Sum the total reward value of each single extrapolation obtained from each execution to obtain the total reward accumulation value. Divide the total reward accumulation value by the total number of simulations and perform the division operation. Record the resulting value as the predicted cumulative reward return.
[0045] Optionally, S6 specifically includes:
[0046] S61. Read the predicted cumulative reward return, compare all the predicted cumulative reward return values, select the predicted cumulative reward return with the largest value, record the corresponding incentive strategy as the maximum return candidate strategy, extract the calculation results of each calculation when calculating the total reward value of a single simulation in step S54, calculate the standard deviation of all the total reward values of a single simulation for each incentive strategy, and use the standard deviation value as a risk volatility indicator.
[0047] S62. Compare the risk volatility index with the preset risk threshold, eliminate incentive strategies whose risk volatility index value is greater than the preset risk threshold, and retain incentive strategies whose risk volatility index value is less than or equal to the preset risk threshold, and record them as the set of strategies to be screened.
[0048] S63. Find the target strategy with the largest predicted cumulative reward value from the set of strategies to be screened. If there are multiple target strategies with the same value, randomly select one of them and mark the selected target strategy as the optimal incentive strategy.
[0049] Optionally, S7 specifically includes: applying changes or modifications to employee compensation incentives and training policies based on the selected optimal incentive strategy.
[0050] The beneficial effects of this invention are:
[0051] This invention addresses the challenges of cold start difficulties, unstable policy updates, and susceptibility to local optima in traditional reinforcement learning for talent incentives by constructing a multi-source heterogeneous data perception layer for research institutions and deploying an improved PPO-penalty-based GAIL model. It utilizes the incentive decision sequences of historically successful managers as prior knowledge, generates simulated incentive policies through a generator, and calculates the policy distribution deviation from the prior knowledge. The improved PPO-penalty-based GAIL model's trust domain constraint penalty term is combined with the immediate reward function to construct the overall objective function. During model training, a dual constraint factor is introduced: a first-order distribution constraint based on KL divergence and a second-order curvature constraint based on the eigenvalues of the Fisher information matrix. This dual constraint factor corrects the generator network's original gradient, adaptively adjusting the parameter update step size and direction to ensure the morphological distribution of the incentive policy conforms to expert experience and the update process remains stable. The output candidate policy set is input into a digital twin simulation environment for Monte Carlo tree search forward inference to predict cumulative reward returns and select the optimal policy based on a risk threshold. Ultimately, this approach enables the effective inheritance of expert management wisdom and iterative optimization of incentive strategies, thereby significantly improving model convergence speed, the robustness of strategy output, and decision-making adaptability in complex organizational environments. Attached Figure Description
[0052] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:
[0053] Figure 1 This is a flowchart of a talent incentive strategy generation method based on deep reinforcement learning proposed in this invention;
[0054] Figure 2 This is a flowchart of the strategy distribution bias calculation and dual constraint factor gradient correction for the GAIL model based on the improved PPO-penalty proposed in this invention.
[0055] Figure 3 This is a flowchart of the multi-step forward extrapolation and risk volatility index screening process for digital twin simulation based on Monte Carlo tree search proposed in this invention. Detailed Implementation
[0056] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.
[0057] refer to Figures 1-3 A method for generating talent incentive strategies based on deep reinforcement learning includes the following steps:
[0058] S1. Construct a multi-source heterogeneous data perception layer for research institutes, and synchronously collect behavioral data and performance data of talents. The behavioral data includes the frequency of interaction in the collaborative network and the length of service. The performance data includes the KPI completion rate and the achievement of project milestones. These data are then fused and mapped into a high-dimensional tensor state space to form a real-time state vector and serve as the real-time state vector.
[0059] S2. Establish a dynamic reward mechanism, define the action space of the agent as a discrete incentive strategy, the incentive strategy includes salary adjustment range, achievement reward ratio, job promotion opportunities and allocation of graduate and postdoctoral positions, construct a nonlinear instant reward function, the instant reward function integrates talent retention rate and performance gain, and quantitatively evaluate the effectiveness of the incentive strategy in the short period.
[0060] S3. Using the preset historical incentive decision sequence of outstanding managers as prior knowledge, an improved GAIL model based on PPO-penalty is adopted. A generator generates simulated incentive strategies, and a discriminator calculates the policy distribution deviation between the simulated incentive strategies and the prior knowledge as a trust domain constraint penalty term. When constructing the total objective function in combination with the immediate reward function, a dual constraint factor is introduced, which includes a first-order distribution constraint based on KL divergence and a second-order curvature constraint based on the eigenvalues of the Fisher information matrix. At the same time, the morphological distribution and update direction of the incentive strategy are standardized, and pre-training and iterative optimization are performed.
[0061] S4. Input the real-time state vector into the pre-trained improved GAIL model based on PPO-penalty, and mine the centrality features and pressure transmission paths of talents in the organizational collaboration network through a temporal graph attention network. Integrate and process behavioral data and performance data to deeply encode the current career state type, risk preference and implicit turnover probability of talents. Output the state code containing multi-dimensional implicit features and the preliminary strategy probability distribution based on the current state. Sample and generate a candidate strategy set containing a preset number of incentive actions.
[0062] S5. Input the state code into the digital twin simulation environment, combine it with the initial strategy probability distribution, and use the Monte Carlo tree search algorithm to perform multi-step forward inference to simulate the future behavior trajectory and performance change curve of talents after receiving different incentive strategies, and predict the cumulative reward return.
[0063] S6. Based on the predicted cumulative reward return and the preset risk threshold, construct screening indicators to select the optimal incentive strategy from the candidate strategy set that maximizes long-term returns and has controllable risks.
[0064] S7. Based on the selected optimal incentive strategy, apply it to changes or modifications to employee compensation incentive and training policies.
[0065] This invention significantly improves the accuracy of talent incentives and the intelligence level of human resource management in research institutions. By constructing a multi-source heterogeneous data perception layer for research institutions, it achieves deep fusion and mapping of behavioral and performance data, enhancing the ability to express employees' professional status and risk propensity. An improved GAIL model for PPO-penalty is introduced, utilizing historical high-performing manager decision sequences as prior knowledge, and combining dual constraint factors to simultaneously regulate the morphological distribution and update direction of incentive strategies, effectively solving the instability and bias problems in traditional strategy learning. By mining the centrality characteristics and pressure transmission paths of talent in the collaborative network through a time-series graph attention network, it can deeply identify implicit turnover probabilities. Combining a digital twin simulation environment with a Monte Carlo tree search algorithm for multi-step forward extrapolation, it predicts future performance changes and cumulative returns, selecting the optimal strategy from candidate strategies that maximizes long-term benefits while maintaining controllable risk. This method realizes the transformation of incentive strategies from post-event response to pre-event prediction, significantly improving talent retention and organizational efficiency, reducing incentive costs and the risk of core talent loss, and possessing extremely high practical value.
[0066] In this embodiment, S1 specifically includes:
[0067] S11. Call the data acquisition interface to read the talent's behavioral data and performance data. The behavioral data includes the frequency of interaction in the collaborative network and the length of service. The performance data includes the KPI completion rate and the achievement of project milestones. Map the behavioral data and performance data to a pre-defined multi-dimensional tensor structure and fill each data item into the corresponding dimension position of the tensor.
[0068] S12. Perform linear normalization on the filled multidimensional tensor structure to scale the values to a preset range, generate a high-dimensional tensor state space, and arrange the high-dimensional tensor state space into a one-dimensional data sequence in a preset order to form a real-time state vector.
[0069] In this embodiment, S2 specifically includes:
[0070] S21. A pre-defined action space set for intelligent agents is used to convert salary adjustment range, achievement reward ratio, job promotion opportunities and graduate and postdoctoral quota allocation into pre-defined unique integer identifiers, and the integer identifiers are stored in the action space set.
[0071] S22. Read the talent status records in the talent management system. If the talent is employed, set the talent retention rate to 1. If the talent has left, set the talent retention rate to 0. Read the performance appraisal score of the current period and the performance appraisal score of the previous period, calculate the difference between the two and record it as the performance gain.
[0072] S23. Read the corresponding salary adjustment amount, performance bonus amount, cost increase amount due to job promotion and training expense amount from the financial database, and calculate the sum as the total incentive cost;
[0073] S24. Multiply the talent retention rate and performance gain by the preset first weight coefficient and second weight coefficient respectively and sum them to obtain the total benefit. Subtract the total incentive cost from the total benefit to calculate the net value of the immediate reward. Input the net value of the immediate reward into the Sigmoid nonlinear activation function for calculation and output the immediate reward value to establish the immediate reward function.
[0074] S25. When the preset short period of time after the implementation of the incentive strategy is reached, read the employee status and performance appraisal score of the talent management system, recalculate the performance gain and total incentive cost, and input the recalculated talent retention rate, performance gain and total incentive cost into the instant reward function, and output the specific value as the quantitative evaluation result.
[0075] In this embodiment, S3 specifically includes:
[0076] S31. Call the database interface to read the pre-stored historical incentive decision sequence of outstanding managers, decompose each incentive decision record in the incentive decision sequence into state data and action data, combine them to generate expert state-action pairs, and input the expert state-action pairs into the discriminator network as positive sample labels.
[0077] S32. Input the real-time state vector of the current multidimensional tensor state space into the generator network. After passing through the hidden layers of the preset multi-layer neural network, perform multi-layer preset matrix multiplication and superimpose biases. After ReLU activation function operation, perform Softmax function processing to output the predicted probability values of each action dimension and combine them to form a simulated excitation strategy.
[0078] S33. Read the incentive decision sequence of historical outstanding managers, count the frequency of each action data, divide the frequency by the total frequency of actions in the current state and normalize it to obtain the actual occurrence probability of the corresponding action in the expert state-action pair, and input it into the discriminator network at the same time as the predicted probability value of the simulated incentive strategy, calculate the cross-entropy loss function value between the two, mark the cross-entropy loss function value as the strategy distribution deviation, and use the strategy distribution deviation as the trust domain constraint penalty term.
[0079] S34. Construct an improved overall objective function formula for the GAIL model based on PPO-penalty. In the overall objective function formula, set the sum of environmental reward terms and penalty terms. The environmental reward terms are substituted with the output value of the immediate reward function, and the penalty terms are substituted with the trust domain constraint penalty terms.
[0080] S35. Extract the current preset weight parameter vector of the generator network, calculate the first and second derivatives of the total objective function formula with respect to the weight parameter vector of the generator network, obtain the loss function by calculating the expected value of the second derivative matrix, calculate the first derivative matrix of the weight parameter vector with respect to the loss function to obtain the Fisher information matrix, perform eigenvalue decomposition on the Fisher information matrix, obtain the eigenvalue with the largest value among all eigenvalues, and record the largest eigenvalue as the second curvature constraint parameter;
[0081] S36. Read the predicted probability value of the action output by the generator network and the actual occurrence probability of the corresponding action in the expert state-action pair. Take the natural logarithm of the predicted probability value of each action and calculate the ratio with the actual occurrence probability. Multiply the ratio by the actual occurrence probability to obtain the relative entropy component of each action. Perform an accumulation and summation operation on the relative entropy components of all actions to obtain the KL divergence value and record it as a first-order distribution constraint parameter.
[0082] S37. Set the first preset coefficient and the second preset coefficient, multiply the first-order distribution constraint parameter by the first preset coefficient to obtain the first-order constraint component, multiply the second-order curvature constraint parameter by the second preset coefficient to obtain the second-order constraint component, and add the first-order constraint component and the second-order constraint component to obtain the composite constraint value.
[0083] S38. Calculate the original gradient value of the overall objective function with respect to the weight parameters of the generator network. Divide the original gradient value by the composite constraint value to obtain the corrected gradient value. Use the corrected gradient value to perform addition and subtraction operations on the weight parameters of the generator network to complete one parameter update. Repeat the forward propagation and parameter update steps until the preset number of iterations is reached to complete the pre-training and iterative optimization.
[0084] This implementation introduces an improved PPO-penalty GAIL model, combined with a dual constraint mechanism, to achieve pre-training and iterative optimization of incentive strategies. Historical high-performing manager decision sequences are transformed into expert state-action pairs as positive samples for the discriminator. A generator network outputs simulated strategies based on real-time state vectors and calculates predicted probabilities. Cross-entropy loss quantifies the strategy distribution bias as a confidence domain penalty term, and a total objective function incorporating environmental rewards and penalties is constructed. Second-order curvature constraints are obtained using Fisher information matrix eigenvalue decomposition, and first-order distribution constraints are calculated using KL divergence, forming a composite constraint value to correct the gradient. This method effectively addresses the instability problem during strategy training by standardizing the morphological distribution and update direction of incentive strategies, ensuring that the generated incentive strategies conform to expert decision-making logic and possess high confidence, significantly improving the model's convergence and long-term decision-making performance in complex human resource scenarios.
[0085] The improved GAIL model based on PPO-penalty proposed in this invention maintains consistency with the traditional model in terms of basic architecture. Both adopt a generative adversarial imitation learning framework, where a generator network simulates expert policies, a discriminator network distinguishes between genuine and false policies, and gradient updates are performed using the proximal policy optimization mechanism of the PPO algorithm. Furthermore, both introduce a penalty term to constrain the policy update magnitude, thereby ensuring the stability of the training process.
[0086] The difference lies in that this invention introduces a dual constraint mechanism based on the Fisher information matrix and KL divergence in steps S35 and S36, breaking through the limitations of traditional models that rely solely on a single KL divergence constraint or a simple trust domain. Building upon the traditional model's calculation of the policy distribution bias, step S35 of this invention delves into calculating the second-order derivative matrix, extracting the largest eigenvalue through eigenvalue decomposition as the second-order curvature constraint parameter; step S36 simultaneously calculates the first-order distribution constraint parameter. Step S37 weights and synthesizes the two into a composite constraint value, which is then used in step S38 to correct the original gradient.
[0087] The beneficial effect of this improvement lies in the fact that by integrating the dual constraints of first-order distribution and second-order curvature, the model can more accurately perceive the geometric shape of the loss function. The second-order curvature constraint effectively suppresses drastic oscillations in the parameter update direction, avoiding policy collapse caused by excessive gradient updates. This design, while maintaining the accuracy of imitating the decisions of historically successful managers, significantly improves the stability and convergence speed of incentive strategy generation, ensuring the output of high-quality, executable decision-making solutions in complex and ever-changing talent management scenarios.
[0088] In this embodiment, S4 specifically includes:
[0089] S41. Input the real-time state vector into the pre-trained improved GAIL model based on PPO-penalty, read the collaborative network interaction frequency collected in step S1, construct the organizational collaborative network adjacency matrix with talent identifiers as nodes and interaction frequency values as edge weights, calculate the algebraic sum of all values in each row of the adjacency matrix, obtain the total connection weight between each node and other nodes, compare the total connection weight with the preset centrality threshold, and if the value is greater than the threshold, mark it as a centrality feature.
[0090] S42. In the adjacency matrix of the organizational collaboration network, identify the directed edges corresponding to the non-zero elements, traverse and search along the direction of the directed edges from the starting node to the ending node, record all node sequences with path length greater than the preset number of layers, count the nodes in the node sequence whose in-degree value is greater than the preset convergence value, and sum the in-degree value and the node layer number by weight to obtain the pressure transmission path.
[0091] S43. Input the working years data collected in step S1 into the preset occupational status judgment function, calculate the ratio of working years to standard years, determine the numerical label of occupational status type according to the preset interval of the ratio value, read the historical data sequence of project milestone achievement, calculate the variance of the sequence value, and mark the variance value as risk preference tendency.
[0092] S44. Read the interaction frequency value of the collaborative network. If the interaction frequency value is less than the preset lower limit and the working years value is greater than the preset upper limit, then assign the implicit turnover probability to 1; otherwise, assign it to 0. Concatenate the numerical label of the occupational status type, the variance value, and the assignment of the implicit turnover probability to output a status code containing multidimensional implicit features.
[0093] S45. Input the state code into the generator network, process it through the preset fully connected layer matrix operation and activation function, output the predicted probability value corresponding to each integer identifier in the action space, sort all the predicted probability values in descending order, extract the target probability value in the first preset number of positions of the sort, read the integer identifier corresponding to the target probability value, convert the integer identifier into an incentive policy, and combine them to form a candidate policy set containing a preset number of incentive actions.
[0094] In this embodiment, S5 specifically includes:
[0095] S51. Input the state code into the initial state register of the digital twin simulation environment, set the simulation time step counter to zero, read the candidate strategy set, use each integer identifier contained in the candidate strategy set as a child node of the root node of the Monte Carlo tree search algorithm, initialize the access count variable of each child node to 0, and initialize the reward accumulation variable of each child node to 0.
[0096] S52. For each child node, read the integer identifier corresponding to the child node, query the action space set in step S21, convert the integer identifier into specific numerical parameters of salary adjustment range, achievement reward ratio, job promotion opportunity and postgraduate and postdoctoral quota allocation, input the specific numerical parameters into the input interface of the digital twin simulation environment, call the state update function of the digital twin simulation environment, calculate the state data of the next moment after the execution of the incentive strategy, extract the talent's employment status and performance appraisal score from the state data of the next moment, and use the instant reward function formula in step S24 to calculate the single-step reward value of the first moment.
[0097] S53. Starting from the obtained state data of the next moment, a random number between 0 and 1 is generated, and an action is randomly selected from the action space set as the subsequent action. The subsequent action is input into the digital twin simulation environment to obtain new state data. The reward value in the new state is calculated using the instant reward function, and the simulation time step counter is incremented by one.
[0098] S54. Repeatedly execute the operations of randomly selecting actions, inputting the environment, calculating rewards and updating counters until the simulation time step counter equals the preset maximum step size. Then, add up all the single-step reward values from the first moment to the maximum step size to obtain the total reward value for a single simulation.
[0099] S55. Repeat the random selection and extrapolation calculation process of step S54 until the number of repetitions reaches the preset total number of simulations. Sum the total reward value of each single extrapolation obtained from each execution to obtain the total reward accumulation value. Divide the total reward accumulation value by the total number of simulations and perform the division operation. Record the resulting value as the predicted cumulative reward return.
[0100] In this embodiment, S6 specifically includes:
[0101] S61. Read the predicted cumulative reward return, compare all the predicted cumulative reward return values, select the predicted cumulative reward return with the largest value, record the corresponding incentive strategy as the maximum return candidate strategy, extract the calculation results of each calculation when calculating the total reward value of a single simulation in step S54, calculate the standard deviation of all the total reward values of a single simulation for each incentive strategy, and use the standard deviation value as a risk volatility indicator.
[0102] S62. Compare the risk volatility index with the preset risk threshold, eliminate incentive strategies whose risk volatility index value is greater than the preset risk threshold, and retain incentive strategies whose risk volatility index value is less than or equal to the preset risk threshold, and record them as the set of strategies to be screened.
[0103] S63. Find the target strategy with the largest predicted cumulative reward value from the set of strategies to be screened. If there are multiple target strategies with the same value, randomly select one of them and mark the selected target strategy as the optimal incentive strategy.
[0104] In this embodiment, S7 specifically includes: applying the selected optimal incentive strategy to changes or modifications to employee compensation incentive and training policies.
[0105] Example 1: To verify the feasibility of this invention in the field of incentive and retention of high-level talents in national-level research institutes, this invention was applied to the talent management and research performance digital platform of a comprehensive marine science research institution directly under a national ministry. This institution, as a core force in my country's marine science and technology innovation, mainly undertakes research tasks in key areas such as deep-sea biological research, global change and regional marine response, and marine biodiversity and ecosystem protection. It has over 430 high-level researchers, including academicians of the Chinese Academy of Engineering, recipients of the National Science Fund for Distinguished Young Scholars, and provincial and ministerial-level talent programs. It also has the authority to grant master's degrees and jointly trains doctoral students with top domestic universities, with over 360 master's and doctoral students currently enrolled. The researchers generally possess high academic qualifications and professional titles, and frequently undertake deep-sea scientific expeditions, working in challenging environments with high research pressure. Traditional scientific research performance evaluation mainly relies on annual paper statistics and project completion acceptance. The incentive methods are singular and the feedback cycle is long. It is difficult to cope with the complex psychological changes of researchers during the critical period of project breakthroughs, the fatigue period of voyages, and the bottleneck period of professional title promotion. This leads to increased professional burnout among some young backbone members, and even the risk of personnel loss at key project nodes. Moreover, the traditional job allocation model cannot accurately reflect the implicit contributions of researchers in team collaboration, and it is difficult to activate the innovation vitality of scientific research teams.
[0106] In practical applications, the method of this invention first integrates multiple business systems within the institution, including research project management, financial accounting, personnel files, and research vessel operation scheduling, to comprehensively collect multi-source heterogeneous data on researchers' ongoing project progress, funding execution rate, academic output, at-sea operation time, number of students supervised, and academic affiliations. It also incorporates operational environment parameters from deep-sea scientific expeditions as stress correction factors. The system cleans and maps this data into a high-dimensional tensor state space, constructing a digital twin of each researcher's professional status. Subsequently, the system calls a pre-trained, improved GAIL model based on PPO-penalty. This model uses the institution's historical incentive decision-making paths for "Outstanding Contribution Experts" and "Excellent Mentors" as prior knowledge, combined with real-time status characteristics such as whether researchers are currently in a rest period or facing a critical period for professional title evaluation, to generate a draft of a personalized incentive strategy encompassing dynamic adjustments to performance bonuses, preferential research subsidies, optimization of postgraduate enrollment quotas, and recommendations for domestic and international academic exchange opportunities. To ensure the scientific validity and adaptability of the strategy, the generated strategy was input into a digital twin simulation environment to simulate changes in researchers' output, team collaboration stability, and turnover intention fluctuations over a future period after the strategy's implementation. The system utilizes a Monte Carlo tree search algorithm for multi-step forward inference to predict the cumulative benefits and potential risks under different strategy paths. Ultimately, it selects the optimal incentive plan that ensures the smooth implementation of major national scientific research projects while maximizing researchers' professional fulfillment, and pushes it to the human resources and research management departments for collaborative implementation. Table 1 below shows the comparison data between the method of this invention and traditional incentive methods based on annual static assessments in the management and incentive effects on key core scientific research personnel:
[0107] Table 1. Comparison of the effectiveness of the present invention and traditional methods in the incentive management of scientific research personnel.
[0108]
[0109] Based on the comparative data shown in Table 1, it can be seen that the talent incentive optimization method based on reinforcement learning and digital twins proposed in this invention demonstrates excellent management efficiency in the specific scenario of research institutes. The effect is particularly significant in controlling the turnover rate of core talent. For young applied oceanography researchers facing significant research pressure, the six-month turnover rate under the traditional assessment model is as high as 5.2%. This is mainly due to the lack of timely psychological guidance and resource support for young researchers facing pressure in project applications and bottlenecks in professional title promotion. However, after applying this invention, this indicator dropped to 1.1%. The system automatically recommends personalized incentive combinations such as "visiting scholar exchange opportunities" or "appropriate reduction of teaching tasks" by identifying the "professional burnout" state of young talents in advance, effectively stabilizing the young research team and ensuring the sustainability of research team building.
[0110] Regarding the quality of scientific research output, this invention significantly improves the completion rate of national-level projects and the average output of high-level papers per researcher through a dynamic incentive mechanism. Taking deep-sea basic research experts as an example, the completion rate of projects with excellent or good results using traditional methods is 78.5%, while the method of this invention raises it to 92.4%. This is thanks to the system's ability to intelligently match incentive measures such as "data computing resource subsidies" or "research assistant support" based on key milestones in research projects, such as the data processing period after a voyage. This alleviates the non-research workload of researchers, allowing them to focus more on tackling core scientific problems and thus produce more high-quality results.
[0111] Furthermore, this invention represents a qualitative leap in the timeliness and accuracy of management response. Traditional human resource assessments often rely on annual statistics, which are lengthy and lagging, while this invention can shorten the generation time of incentive strategies to the minute level, achieving "on-demand incentives" and "precision drip irrigation." Researcher satisfaction scores have jumped from an average of over 70 points to over 90 points, fully demonstrating the practical value of this method in respecting the laws of scientific research and caring for researchers. The significant improvement in the input-output ratio of human resources also proves that, within a limited research funding budget, optimizing the allocation of incentive resources through scientific algorithms can unleash greater potential for scientific innovation. Overall, this invention provides intelligent and quantifiable technical support for solving the problems of "difficult evaluation, difficult incentives, and difficult retention" in talent management in national-level research institutions, and powerfully promotes the innovation-driven development of marine scientific research.
Claims
1. A method for generating talent incentive strategies based on deep reinforcement learning, characterized in that, Includes the following steps: S1. Construct a multi-source heterogeneous data perception layer for research institutes, collect behavioral and performance data of talents, and fuse and map them into a high-dimensional tensor state space to form a real-time state vector. S2. Establish a dynamic reward mechanism, define the agent's action space as a discrete incentive strategy, construct a nonlinear instantaneous reward function, and quantitatively evaluate the effectiveness of the incentive strategy in a short period. S3. Using the preset historical incentive decision sequence of outstanding managers as prior knowledge, an improved GAIL model based on PPO-penalty is adopted to generate simulated incentive strategies, calculate the trust domain constraint penalty term, and introduce a dual constraint factor when constructing the total objective function in combination with the immediate reward function. At the same time, the form distribution and update direction of the incentive strategy are standardized, and pre-training and iterative optimization are carried out. S4. Input the real-time state vector into the pre-trained improved PPO-penalty-based GAIL model, mine the centrality features and pressure transmission paths of talents in the organizational collaboration network, fuse behavioral data and performance data, output state codes and preliminary policy probability distributions, and sample to generate candidate policy sets. S5. Input the state code into the digital twin simulation environment, combine it with the initial strategy probability distribution, and use the Monte Carlo tree search algorithm to perform multi-step forward inference to simulate the future behavior trajectory and performance change curve of talents after receiving different incentive strategies, and predict the cumulative reward return. S6. Based on the predicted cumulative reward return and the preset risk threshold, construct screening indicators to select the optimal incentive strategy from the candidate strategy set that maximizes long-term returns and has controllable risks. S7. Based on the selected optimal incentive strategy, apply changes or modifications to employee compensation incentives and training policies.
2. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S1 specifically includes: S11. Call the data acquisition interface to read the talent's behavioral data and performance data. The behavioral data includes the frequency of interaction in the collaborative network and the length of service. The performance data includes the KPI completion rate and the achievement of project milestones. Map the behavioral data and performance data to a pre-defined multi-dimensional tensor structure and fill each data item into the corresponding dimension position of the tensor. S12. Perform linear normalization on the filled multidimensional tensor structure to scale the values to a preset range, generate a high-dimensional tensor state space, and arrange the high-dimensional tensor state space into a one-dimensional data sequence in a preset order to form a real-time state vector.
3. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S2 specifically includes: S21. A pre-defined action space set for intelligent agents is used to convert salary adjustment range, achievement reward ratio, job promotion opportunities and graduate and postdoctoral quota allocation into pre-defined unique integer identifiers, and the integer identifiers are stored in the action space set. S22. Read the talent status records in the talent management system. If the talent is employed, set the talent retention rate to 1. If the talent has left, set the talent retention rate to 0. Read the performance appraisal score of the current period and the performance appraisal score of the previous period, calculate the difference between the two and record it as the performance gain. S23. Read the corresponding salary adjustment amount, performance bonus amount, cost increase amount due to job promotion and training expense amount from the financial database, and calculate the sum as the total incentive cost; S24. Multiply the talent retention rate and performance gain by the preset first weight coefficient and second weight coefficient respectively and sum them to obtain the total benefit. Subtract the total incentive cost from the total benefit to calculate the net value of the immediate reward. Input the net value of the immediate reward into the Sigmoid nonlinear activation function for calculation and output the immediate reward value to establish the immediate reward function. S25. When the preset short period of time after the implementation of the incentive strategy is reached, read the employee status and performance appraisal score of the talent management system, recalculate the performance gain and total incentive cost, and input the recalculated talent retention rate, performance gain and total incentive cost into the instant reward function, and output the specific value as the quantitative evaluation result.
4. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S3 specifically includes: S31. Call the database interface to read the pre-stored historical incentive decision sequence of outstanding managers, decompose each incentive decision record in the incentive decision sequence into state data and action data, combine them to generate expert state-action pairs, and input the expert state-action pairs into the discriminator network as positive sample labels. S32. Input the real-time state vector of the current multidimensional tensor state space into the generator network. After passing through the hidden layers of the preset multi-layer neural network, perform multi-layer preset matrix multiplication and superimpose biases. After ReLU activation function operation, perform Softmax function processing to output the predicted probability values of each action dimension and combine them to form a simulated excitation strategy. S33. Read the incentive decision sequence of historical outstanding managers, count the frequency of each action data, divide the frequency by the total frequency of actions in the current state and normalize it to obtain the actual occurrence probability of the corresponding action in the expert state-action pair, and input it into the discriminator network at the same time as the predicted probability value of the simulated incentive strategy, calculate the cross-entropy loss function value between the two, mark the cross-entropy loss function value as the strategy distribution deviation, and use the strategy distribution deviation as the trust domain constraint penalty term. S34. Construct an improved overall objective function formula for the GAIL model based on PPO-penalty. In the overall objective function formula, set the sum of environmental reward terms and penalty terms. The environmental reward terms are substituted with the output value of the immediate reward function, and the penalty terms are substituted with the trust domain constraint penalty terms. S35. Extract the current preset weight parameter vector of the generator network, calculate the first and second derivatives of the total objective function formula with respect to the weight parameter vector of the generator network, obtain the loss function by calculating the expected value of the second derivative matrix, calculate the first derivative matrix of the weight parameter vector with respect to the loss function to obtain the Fisher information matrix, perform eigenvalue decomposition on the Fisher information matrix, obtain the eigenvalue with the largest value among all eigenvalues, and record the largest eigenvalue as the second curvature constraint parameter; S36. Read the predicted probability value of the action output by the generator network and the actual occurrence probability of the corresponding action in the expert state-action pair. Take the natural logarithm of the predicted probability value of each action and calculate the ratio with the actual occurrence probability. Multiply the ratio by the actual occurrence probability to obtain the relative entropy component of each action. Perform an accumulation and summation operation on the relative entropy components of all actions to obtain the KL divergence value and record it as a first-order distribution constraint parameter. S37. Set the first preset coefficient and the second preset coefficient, multiply the first-order distribution constraint parameter by the first preset coefficient to obtain the first-order constraint component, multiply the second-order curvature constraint parameter by the second preset coefficient to obtain the second-order constraint component, and add the first-order constraint component and the second-order constraint component to obtain the composite constraint value. S38. Calculate the original gradient value of the overall objective function with respect to the weight parameters of the generator network. Divide the original gradient value by the composite constraint value to obtain the corrected gradient value. Use the corrected gradient value to perform addition and subtraction operations on the weight parameters of the generator network to complete one parameter update. Repeat the forward propagation and parameter update steps until the preset number of iterations is reached to complete the pre-training and iterative optimization.
5. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S4 specifically includes: S41. Input the real-time state vector into the pre-trained improved GAIL model based on PPO-penalty, read the collaborative network interaction frequency collected in step S1, construct the organizational collaborative network adjacency matrix with talent identifiers as nodes and interaction frequency values as edge weights, calculate the algebraic sum of all values in each row of the adjacency matrix, obtain the total connection weight between each node and other nodes, compare the total connection weight with the preset centrality threshold, and if the value is greater than the threshold, mark it as a centrality feature. S42. In the adjacency matrix of the organizational collaboration network, identify the directed edges corresponding to the non-zero elements, traverse and search along the direction of the directed edges from the starting node to the ending node, record all node sequences with path length greater than the preset number of layers, count the nodes in the node sequence whose in-degree value is greater than the preset convergence value, and sum the in-degree value and the node layer number by weight to obtain the pressure transmission path. S43. Input the working years data collected in step S1 into the preset occupational status judgment function, calculate the ratio of working years to standard years, determine the numerical label of occupational status type according to the preset interval of the ratio value, read the historical data sequence of project milestone achievement, calculate the variance of the sequence value, and mark the variance value as risk preference tendency. S44. Read the interaction frequency value of the collaborative network. If the interaction frequency value is less than the preset lower limit and the working years value is greater than the preset upper limit, then assign the implicit turnover probability to 1; otherwise, assign it to 0. Concatenate the numerical label of the occupational status type, the variance value, and the assignment of the implicit turnover probability, and output the status code. S45. Input the state code into the generator network, process it through the preset fully connected layer matrix operation and activation function, output the predicted probability value corresponding to each integer identifier in the action space, sort all the predicted probability values in descending order, extract the target probability value in the first preset number of positions of the sort, read the integer identifier corresponding to the target probability value, convert the integer identifier into an activation policy, and combine them to form a candidate policy set.
6. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S5 specifically includes: S51. Input the state code into the initial state register of the digital twin simulation environment, set the simulation time step counter to zero, read the candidate strategy set, use each integer identifier contained in the candidate strategy set as a child node of the root node of the Monte Carlo tree search algorithm, initialize the access count variable of each child node to 0, and initialize the reward accumulation variable of each child node to 0. S52. For each child node, read the integer identifier corresponding to the child node, query the action space set in step S21, convert the integer identifier into specific numerical parameters of salary adjustment range, achievement reward ratio, job promotion opportunity and postgraduate and postdoctoral quota allocation, input the specific numerical parameters into the input interface of the digital twin simulation environment, call the state update function of the digital twin simulation environment, calculate the state data of the next moment after the execution of the incentive strategy, extract the talent's employment status and performance appraisal score from the state data of the next moment, and use the instant reward function formula in step S24 to calculate the single-step reward value of the first moment. S53. Starting from the obtained state data of the next moment, a random number between 0 and 1 is generated, and an action is randomly selected from the action space set as the subsequent action. The subsequent action is input into the digital twin simulation environment to obtain new state data. The reward value in the new state is calculated using the instant reward function, and the simulation time step counter is incremented by one. S54. Repeatedly execute the operations of randomly selecting actions, inputting the environment, calculating rewards and updating counters until the simulation time step counter equals the preset maximum step size. Then, add up all the single-step reward values from the first moment to the maximum step size to obtain the total reward value for a single simulation. S55. Repeat the random selection and extrapolation calculation process of step S54 until the number of repetitions reaches the preset total number of simulations. Sum the total reward value of each single extrapolation obtained from each execution to obtain the total reward accumulation value. Divide the total reward accumulation value by the total number of simulations and perform the division operation. Record the resulting value as the predicted cumulative reward return.
7. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, S6 specifically includes: S61. Read the predicted cumulative reward return, compare all the predicted cumulative reward return values, select the predicted cumulative reward return with the largest value, record the corresponding incentive strategy as the maximum return candidate strategy, extract the calculation results of each calculation when calculating the total reward value of a single simulation in step S54, calculate the standard deviation of all the total reward values of a single simulation for each incentive strategy, and use the standard deviation value as a risk volatility indicator. S62. Compare the risk volatility index with the preset risk threshold, eliminate incentive strategies whose risk volatility index value is greater than the preset risk threshold, and retain incentive strategies whose risk volatility index value is less than or equal to the preset risk threshold, and record them as the set of strategies to be screened. S63. Find the target strategy with the largest predicted cumulative reward value from the set of strategies to be screened. If there are multiple target strategies with the same value, randomly select one of them and mark the selected target strategy as the optimal incentive strategy.
8. The method for generating talent incentive strategies based on deep reinforcement learning according to claim 1, characterized in that, Specifically, S7 includes: applying changes or modifications to employee compensation incentives and training policies based on the selected optimal incentive strategy.