An asynchronous reinforcement learning training method and system based on distribution matching sample selection
By employing a distribution-matching sample selection and policy-level weighting mechanism, the problem of large variance in importance sampling in asynchronous reinforcement learning is solved, thereby improving the stability and efficiency of training. This approach is suitable for asynchronous and off-policy reinforcement learning scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HEFEI BAIZE ARTIFICIAL INTELLIGENCE TECHNOLOGY CO LTD
- Filing Date
- 2026-04-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing asynchronous reinforcement learning methods tend to introduce large estimation variances during importance sampling, leading to instability in the training process and affecting the model's convergence speed and final performance.
A distribution matching sample selection method is adopted. By constructing a reference distribution, samples with small deviations are selected as high-quality training data. A policy-level weighting mechanism is introduced to optimize the policy using the weighted multi-behavior policy sample data.
It significantly reduces the variance in the importance sampling process, improves the stability and efficiency of training, increases the convergence speed and final performance of the model, is suitable for large-scale model training, and reduces computing power and time costs.
Smart Images

Figure CN122241235A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of reinforcement learning technology, specifically to an asynchronous reinforcement learning training method and system based on distribution matching sample selection. Background Technology
[0002] Existing reinforcement learning methods mainly adopt a synchronous training process, in which data generation and policy optimization are performed sequentially. In this type of method, data is usually generated by the interaction between the current policy and the environment, and then the policy is updated based on the collected data. There is a tight coupling between the two. Therefore, the policy update can only be performed after the data collection is completed, which leads to waiting and blocking in the overall training process, thereby limiting the training throughput and causing problems such as insufficient utilization of computing resources such as GPUs.
[0003] To address the aforementioned issues, existing research has proposed asynchronous reinforcement learning methods. These methods decouple the data generation process of the behavioral policy from the optimization process of the target policy, allowing them to be executed in parallel. This improves the overall training efficiency and scalability of the system. Specifically, the asynchronous mechanism allows multiple data acquisition and policy update modules to run simultaneously, effectively improving resource utilization and accelerating the reinforcement learning training process.
[0004] Existing asynchronous reinforcement learning methods typically use data generated from historical policies directly for model training. Since the data distribution deviates from the current policy, mechanisms such as importance sampling are needed for probability correction. However, these methods are prone to introducing large estimation variances during importance sampling, leading to instability in the training process and affecting the model's convergence speed and final performance.
[0005] To address the above issues, an asynchronous reinforcement learning training method and system based on distribution matching sample selection is proposed. Summary of the Invention
[0006] The purpose of this invention is to provide an asynchronous reinforcement learning training method and system based on distribution matching sample selection. By using this invention, the problem that existing asynchronous reinforcement learning methods in the above-mentioned background are prone to introducing large estimation variance when performing importance sampling, which leads to instability in the training process and affects the model's convergence speed and final performance is solved.
[0007] To achieve the above objectives, the present invention provides the following technical solution: an asynchronous reinforcement learning training method based on distribution matching sample selection, comprising the following steps: The current target strategy is determined, and then an experience replay cache is constructed to store trajectory data generated by multiple historical behavior strategies at different times. The trajectory data includes trajectory reward information. Based on the current target strategy and trajectory reward information, a reference distribution is constructed to reduce the variance of importance sampling estimation; The deviation of each sample in the experience replay buffer from the reference distribution is calculated, and the samples are sorted to select samples with smaller deviations as high-quality training data. High-quality training data is modeled as a policy optimization problem with multiple action policies, and a policy-level weighting mechanism is introduced to weight and fuse the data contributions of each action policy to obtain multi-action policy sample data. An optimization objective is constructed using weighted multi-behavior policy sample data, and the current objective policy is updated using the policy gradient method.
[0008] Furthermore, the specific steps for determining the current target strategy are as follows: In the asynchronous reinforcement learning training iteration process, the current target policy to be optimized in this iteration is clearly defined and set as the benchmark for subsequent sample selection, reference distribution construction and policy optimization; The specific steps for constructing the experience replay cache to store trajectory data generated by multiple historical behavior strategies at different times are as follows: An experience replay cache carrier for storing historical strategy trajectory data is created, resulting in an empty experience replay cache. Trajectory data containing trajectory reward information is generated by interacting with the environment at different times through multiple historical behavior strategies. The collected trajectory data is written into the empty experience replay cache, resulting in an experience replay cache that stores trajectory data at different times for multiple historical behavior strategies.
[0009] Furthermore, the specific steps for constructing a reference distribution to reduce the variance of importance sampling estimation based on the current target strategy and trajectory reward information are as follows: Extract the current target strategy and the trajectory reward information contained in the trajectory data in the experience replay buffer; Using the current target strategy and trajectory reward information as input, the distribution of behavioral strategies is modeled according to the principle of minimizing the variance of importance sampling estimation, and the optimal behavioral distribution form is theoretically characterized to obtain the optimal behavioral distribution modeling result; Based on the optimal behavior distribution modeling results, a reference distribution is constructed to reduce the variance of importance sampling estimation.
[0010] Furthermore, the specific steps for calculating the deviation of each sample in the experience replay buffer from the reference distribution are as follows: Each sample to be evaluated is extracted one by one from the experience replay cache containing stored historical behavior strategy trajectory data, resulting in a single sample to be evaluated within the experience replay cache. Using the reference distribution used to reduce the variance of importance sampling estimation as the evaluation standard, the degree of deviation of the current extracted sample from the reference distribution is calculated, and the deviation value corresponding to a single sample is obtained. For all samples in the experience replay buffer, repeat the sample extraction and calculation operations to complete the deviation calculation of all samples and obtain the set of deviation values for all samples in the experience replay buffer.
[0011] Furthermore, the specific steps for sorting the samples and selecting those with smaller deviations as high-quality training data are as follows: Based on the set of deviation values of all samples, all samples in the experience replay buffer are sorted to obtain a complete sample sequence sorted from smallest to largest deviation. From the complete sample sequence sorted by deviation from smallest to largest, a subset of samples with smaller deviations are selected to form a candidate sample set with smaller deviations, which serves as high-quality training data.
[0012] Furthermore, the specific steps for modeling high-quality training data into a multi-action policy off-policy optimization problem are as follows: Identify the multiple sources of historical behavior strategies corresponding to high-quality training data, and match and associate the high-quality training data with their respective historical behavior strategies to obtain the correspondence between high-quality training data and their respective historical behavior strategies; Based on the correspondence between high-quality training data and their corresponding historical behavioral strategies, a model for off-policy optimization problems involving multiple behavioral strategies is established.
[0013] Furthermore, the specific steps for introducing a strategy-level weighting mechanism to weight and fuse the data contributions of each behavioral strategy to obtain multi-behavioral strategy sample data are as follows: The sample size and data stability of each historical behavior strategy are used as the basis for allocating strategy-level weights, thus obtaining the basis for strategy-level weight allocation. Based on the policy-level weight allocation criteria, the corresponding policy-level weight is calculated for each historical behavior policy, and the policy-level weights corresponding to each historical behavior policy are obtained. Based on the policy-level weights corresponding to each historical behavior policy, the high-quality training data corresponding to each behavior policy are weighted and fused to obtain multi-behavior policy sample data with weighted fusion.
[0014] Furthermore, the specific steps for constructing the optimization objective using the weighted multi-behavior strategy sample data are as follows: Using multi-behavior policy sample data as input data to construct the optimization objective, and constructing the policy optimization objective of asynchronous reinforcement learning through the optimization logic of the policy gradient method; The specific steps for updating the current target policy using the policy gradient method are as follows: The asynchronous reinforcement learning policy optimization objective, constructed based on multi-behavior policy sample data, is used as the basis for updating. The policy gradient method is called, and the gradient of the current target policy is iteratively calculated in combination with the policy optimization objective to obtain the gradient update calculation result of the current target policy. Based on the gradient update calculation results of the current target policy, the current target policy is iteratively optimized to finally obtain the updated current target policy.
[0015] The present invention also proposes another technical solution: an asynchronous reinforcement learning training system based on distribution matching sample selection, comprising: an experience replay caching module, a distribution matching sample filtering module, a multi-behavior policy weighted optimization module, and a policy update module; Experience replay cache module: used to build an experience replay cache area to store trajectory data generated by multiple historical behavior strategies at different times; Distribution matching sample screening module: used to construct a reference distribution based on the current target strategy and trajectory reward information, calculate the degree of deviation between the sample and the reference distribution, and complete the sample screening; Multi-behavior policy weighted optimization module: By modeling the correspondence between high-quality training data and the corresponding historical behavior policies, it is used to model a multi-behavior policy optimization problem and complete policy-level weighted fusion. Policy update module: Used to construct an optimization objective using weighted multi-behavior policy sample data, and update the current objective policy using the policy gradient method.
[0016] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention constructs a reference distribution based on the principle of minimizing the variance of importance sampling through a distribution matching sample screening mechanism, and screens samples that are close to the target strategy to avoid noise accumulation. This can significantly reduce the high variance problem caused by distribution shift and improve the training convergence stability.
[0017] 2. The sample utilization efficiency of this invention is significantly improved, and the same performance can be achieved with fewer samples; the number of training steps and the time consumption are reduced at the same accuracy, which can be adapted to large-scale model training and reduce computing power and time costs.
[0018] 3. This invention adopts a strategy-level adaptive weighting mechanism, which allocates weights based on the number of samples and data stability, breaking through the limitations of single-strategy information and enabling in-depth utilization of historical experience data.
[0019] 4. This invention does not rely on a specific policy gradient algorithm and can be adapted to mainstream optimization methods such as PPO and GRPO, making it suitable for various asynchronous and off-policy reinforcement learning scenarios. Attached Figure Description
[0020] Figure 1 This is a schematic diagram of the method steps of the present invention; Figure 2 This is a modular schematic diagram of the present invention; Figure 3 This paper presents a performance comparison between the asynchronous reinforcement learning framework based on distribution matching of the present invention and benchmark methods. Figure 4 This is a schematic diagram of the architecture of the present invention; Figure 5 This diagram illustrates the comparison of sample efficiency and training time between the asynchronous reinforcement learning framework based on distribution matching of the present invention and the benchmark method. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] Example 1: Please refer to Figure 1 and Figure 4 An asynchronous reinforcement learning training method based on distribution-matched sample selection includes the following steps: S1: Determine the current target strategy, then construct an experience replay cache, storing trajectory data generated by multiple historical behavior strategies at different times. This trajectory data includes trajectory reward information, specifically: The specific steps to determine the current target strategy are as follows: In the asynchronous reinforcement learning training iteration process, the current target policy to be optimized in this iteration is clearly defined and set as the benchmark for subsequent sample selection, reference distribution construction and policy optimization; The specific steps for constructing the experience replay cache to store trajectory data generated by multiple historical behavior strategies at different times are as follows: An experience replay cache carrier for storing historical strategy trajectory data is created, resulting in an empty experience replay cache. Trajectory data containing trajectory reward information is generated by interacting with the environment at different times through multiple historical behavior strategies. The collected trajectory data is written into the empty experience replay cache, resulting in an experience replay cache that stores trajectory data at different times for multiple historical behavior strategies.
[0023] S2: Based on the current target strategy and trajectory reward information, construct a reference distribution to reduce the variance of importance sampling estimation; calculate the deviation of each sample in the experience replay buffer from the reference distribution, sort the samples, and select samples with smaller deviations as high-quality training data, specifically: Based on the current target strategy and trajectory reward information, the specific steps for constructing a reference distribution to reduce the variance of importance sampling estimation are as follows: Extract the current target strategy and the trajectory reward information contained in the trajectory data in the experience replay buffer; Using the current target strategy and trajectory reward information as input, the distribution of behavioral strategies is modeled according to the principle of minimizing the variance of importance sampling estimation, and the optimal behavioral distribution form is theoretically characterized to obtain the optimal behavioral distribution modeling result; Based on the optimal behavior distribution modeling results, a reference distribution is constructed to reduce the variance of importance sampling estimation; The specific steps for calculating the deviation of each sample in the experience replay buffer from the reference distribution are as follows: Each sample to be evaluated is extracted one by one from the experience replay cache containing stored historical behavior strategy trajectory data, resulting in a single sample to be evaluated within the experience replay cache. Using the reference distribution used to reduce the variance of importance sampling estimation as the evaluation standard, the degree of deviation of the current extracted sample from the reference distribution is calculated, and the deviation value corresponding to a single sample is obtained. For all samples in the experience replay buffer, repeat the sample extraction and calculation operations to complete the deviation calculation of all samples and obtain the set of deviation values of all samples in the experience replay buffer. The specific steps for sorting the samples and selecting those with smaller deviations as high-quality training data are as follows: Based on the set of deviation values of all samples, all samples in the experience replay buffer are sorted to obtain a complete sample sequence sorted from smallest to largest deviation. From the complete sample sequence sorted by deviation from smallest to largest, a subset of samples with smaller deviations are selected to form a candidate sample set with smaller deviations. This candidate sample set with smaller deviations serves as high-quality training data, while noise and invalid data are removed.
[0024] S3: Model the high-quality training data as a multi-action policy off-policy optimization problem, and introduce a policy-level weighting mechanism to weight and fuse the data contributions of each action policy to obtain multi-action policy sample data, specifically: The specific steps for modeling high-quality training data as a multi-action policy off-policy optimization problem are as follows: Identify the multiple sources of historical behavior strategies corresponding to high-quality training data, and match and associate the high-quality training data with their respective historical behavior strategies to obtain the correspondence between high-quality training data and their respective historical behavior strategies; Based on the correspondence between high-quality training data and their corresponding historical behavioral strategies, a model for the off-policy optimization problem of multiple behavioral strategies is established. The specific steps for introducing a strategy-level weighting mechanism to weight and fuse the data contributions of each behavioral strategy to obtain multi-behavioral strategy sample data are as follows: The sample size and data stability of each historical behavior strategy are used as the basis for allocating strategy-level weights, thus obtaining the basis for strategy-level weight allocation. Based on the policy-level weight allocation criteria, the corresponding policy-level weight is calculated for each historical behavior policy, and the policy-level weights corresponding to each historical behavior policy are obtained. Based on the policy-level weights corresponding to each historical behavior policy, the high-quality training data corresponding to each behavior policy are weighted and fused to obtain multi-behavior policy sample data with weighted fusion.
[0025] S4: Construct an optimization objective using weighted multi-behavior policy sample data, and update the current objective policy using the policy gradient method, specifically: The specific steps for constructing the optimization objective using weighted multi-behavior strategy sample data are as follows: Using multi-behavior policy sample data as input data to construct the optimization objective, and constructing the policy optimization objective of asynchronous reinforcement learning through the optimization logic of the policy gradient method; The specific steps for updating the current target policy using the policy gradient method are as follows: The asynchronous reinforcement learning policy optimization objective, constructed based on multi-behavior policy sample data, is used as the basis for updating. The policy gradient method is called, and the gradient of the current target policy is iteratively calculated in combination with the policy optimization objective to obtain the gradient update calculation result of the current target policy. Based on the gradient update calculation results of the current target policy, the current target policy is iteratively optimized to finally obtain the updated current target policy, thus achieving stable and efficient asynchronous training.
[0026] This invention introduces a distribution-matching asynchronous sample screening mechanism to evaluate and screen the asynchronous trajectory data in the experience replay pool. This effectively avoids the problem of training noise accumulation caused by directly using all historical asynchronous data in traditional methods, thereby improving the overall reliability and effectiveness of the training data.
[0027] Meanwhile, based on the principle of minimizing the variance of importance sampling estimation, this invention models the distribution of behavior policies and theoretically describes the optimal behavior distribution form, making the selected samples closer to the optimal sampling structure of the target policy at the distribution level. This can significantly reduce the high variance problem caused by distribution offset in the importance sampling process, improve estimation stability, and thus enhance the convergence stability of the asynchronous reinforcement learning training process. These mechanisms together optimize the training effect.
[0028] like Figure 3 As shown, experimental results demonstrate that on the Qwen3-1.7B model, compared to the synchronous reinforcement learning method GRPO, the method of this invention achieves an average inference performance improvement of approximately 12.2% across all test benchmarks; compared to the current state-of-the-art asynchronous reinforcement learning method CISPO, the performance is further improved by approximately 6.4%. On the larger-scale Qwen3-4B model, this invention also achieves stable improvements, with average accuracy improvements of approximately 10.6% and 5.6% compared to GRPO and advanced asynchronous methods, respectively. These results indicate that this invention can stably improve the inference ability of models at different model scales and has good generalization performance.
[0029] Secondly, regarding training efficiency, this invention significantly improves sample utilization efficiency, enabling the model to achieve the same performance level with fewer training samples, thereby effectively reducing training costs and improving data utilization. Simultaneously, in terms of actual training time, this invention can also significantly shorten the training cycle; for example... Figure 5 As shown, under the same target accuracy conditions, the method of the present invention reduces the number of training steps by an average of about 34.7% on the Qwen3-1.7B model compared with existing advanced asynchronous methods. In addition, under the same experimental conditions, the method of the present invention reduces the overall training time by an average of about 37.2% compared with existing advanced asynchronous methods, which significantly improves the time efficiency of the training process and is especially suitable for large-scale model training scenarios.
[0030] Furthermore, this invention proposes a multi-behavioral strategy joint optimization mechanism, which unifies the modeling of data generated from multiple behavioral strategies and introduces an aggregation method based on variance and sample size adaptive weighting. This allows information from strategies from different sources to participate in the overall optimization process with more reasonable weights. This mechanism effectively alleviates the information limitations caused by dependence on a single nearest-policy approach and improves the depth of utilization of historical experience.
[0031] Finally, the overall framework of this invention has good versatility and scalability. It does not depend on a specific policy gradient algorithm and can be effectively adapted when combined with mainstream policy optimization methods such as PPO and GRPO. Therefore, it has strong engineering application value and transferability and can be widely applied in asynchronous reinforcement learning and off-policy reinforcement learning scenarios.
[0032] In summary, this invention has achieved significant technical effects in reducing importance sampling variance, improving asynchronous data utilization efficiency, enhancing multi-behavior strategy fusion capabilities, and improving training stability.
[0033] Example 2: Please refer to Figure 2 and Figure 4 As shown, the present invention also discloses another embodiment: an asynchronous reinforcement learning training system based on distribution matching sample selection, comprising: an experience replay caching module, a distribution matching sample filtering module, a multi-behavior policy weighted optimization module, and a policy update module; Experience replay cache module: used to build an experience replay cache area to store trajectory data generated by multiple historical behavior strategies at different times; Distribution matching sample screening module: used to construct a reference distribution based on the current target strategy and trajectory reward information, calculate the degree of deviation between the sample and the reference distribution, and complete the sample screening; Multi-behavior policy weighted optimization module: By modeling the correspondence between high-quality training data and the corresponding historical behavior policies, it is used to model a multi-behavior policy optimization problem and complete policy-level weighted fusion. Policy update module: Used to construct an optimization objective using weighted multi-behavior policy sample data, and update the current objective policy using the policy gradient method.
[0034] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0035] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. An asynchronous reinforcement learning training method based on distribution-matched sample selection, characterized in that, Includes the following steps: The current target strategy is determined, and then an experience replay cache is constructed to store trajectory data generated by multiple historical behavior strategies at different times. The trajectory data includes trajectory reward information. Based on the current target strategy and trajectory reward information, a reference distribution is constructed to reduce the variance of importance sampling estimation; The deviation of each sample in the experience replay buffer from the reference distribution is calculated, and the samples are sorted to select samples with smaller deviations as high-quality training data. High-quality training data is modeled as a policy optimization problem with multiple action policies, and a policy-level weighting mechanism is introduced to weight and fuse the data contributions of each action policy to obtain multi-action policy sample data. An optimization objective is constructed using weighted multi-behavior policy sample data, and the current objective policy is updated using the policy gradient method.
2. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 1, characterized in that: The specific steps for determining the current target strategy are as follows: In the asynchronous reinforcement learning training iteration process, the current target policy to be optimized in this iteration is clearly defined and set as the benchmark for subsequent sample selection, reference distribution construction and policy optimization; The specific steps for constructing the experience replay cache to store trajectory data generated by multiple historical behavior strategies at different times are as follows: An experience replay cache carrier is created to store historical strategy trajectory data, resulting in an empty experience replay cache. Trajectory data containing trajectory reward information is generated by multiple historical behavior strategies interacting with the environment at different times. The collected trajectory data is written into an empty experience playback buffer, resulting in an experience playback buffer that stores trajectory data at different times with multiple historical behavior strategies.
3. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 2, characterized in that: The specific steps for constructing a reference distribution to reduce the variance of importance sampling estimation based on the current target strategy and trajectory reward information are as follows: Extract the current target strategy and the trajectory reward information contained in the trajectory data in the experience replay buffer; Using the current target strategy and trajectory reward information as input, the distribution of behavioral strategies is modeled according to the principle of minimizing the variance of importance sampling estimation, and the optimal behavioral distribution form is theoretically characterized to obtain the optimal behavioral distribution modeling result; Based on the optimal behavior distribution modeling results, a reference distribution is constructed to reduce the variance of importance sampling estimation.
4. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 3, characterized in that: The specific steps for calculating the deviation of each sample in the experience replay buffer from the reference distribution are as follows: Each sample to be evaluated is extracted one by one from the experience replay cache containing stored historical behavior strategy trajectory data, resulting in a single sample to be evaluated within the experience replay cache. Using the reference distribution used to reduce the variance of importance sampling estimation as the evaluation standard, the degree of deviation of the current extracted sample from the reference distribution is calculated, and the deviation value corresponding to a single sample is obtained. For all samples in the experience replay buffer, repeat the sample extraction and calculation operations to complete the deviation calculation of all samples and obtain the set of deviation values for all samples in the experience replay buffer.
5. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 4, characterized in that: The specific steps for sorting the samples and selecting those with smaller deviations as high-quality training data are as follows: Based on the set of deviation values of all samples, all samples in the experience replay buffer are sorted to obtain a complete sample sequence sorted from smallest to largest deviation. From the complete sample sequence sorted by deviation from smallest to largest, a subset of samples with smaller deviations are selected to form a candidate sample set with smaller deviations, which serves as high-quality training data.
6. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 5, characterized in that: The specific steps for modeling high-quality training data into a multi-action policy off-policy optimization problem are as follows: Identify the multiple sources of historical behavior strategies corresponding to high-quality training data, and match and associate the high-quality training data with their respective historical behavior strategies to obtain the correspondence between high-quality training data and their respective historical behavior strategies; Based on the correspondence between high-quality training data and their corresponding historical behavioral strategies, a model for off-policy optimization problems involving multiple behavioral strategies is established.
7. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 6, characterized in that: The specific steps for introducing a strategy-level weighting mechanism to weight and fuse the data contributions of each behavioral strategy to obtain multi-behavioral strategy sample data are as follows: The sample size and data stability of each historical behavior strategy are used as the basis for allocating strategy-level weights, thus obtaining the basis for strategy-level weight allocation. Based on the policy-level weight allocation criteria, the corresponding policy-level weight is calculated for each historical behavior policy, and the policy-level weights corresponding to each historical behavior policy are obtained. Based on the policy-level weights corresponding to each historical behavior policy, the high-quality training data corresponding to each behavior policy are weighted and fused to obtain multi-behavior policy sample data with weighted fusion.
8. The asynchronous reinforcement learning training method based on distribution matching sample selection according to claim 7, characterized in that: The specific steps for constructing the optimization objective using the weighted multi-behavior strategy sample data are as follows: Using multi-behavior policy sample data as input data to construct the optimization objective, and constructing the policy optimization objective of asynchronous reinforcement learning through the optimization logic of the policy gradient method; The specific steps for updating the current target policy using the policy gradient method are as follows: The asynchronous reinforcement learning policy optimization objective, constructed based on multi-behavior policy sample data, is used as the basis for updating. The policy gradient method is called, and the gradient of the current target policy is iteratively calculated in combination with the policy optimization objective to obtain the gradient update calculation result of the current target policy. Based on the gradient update calculation results of the current target policy, the current target policy is iteratively optimized to finally obtain the updated current target policy.
9. An asynchronous reinforcement learning training system based on distribution-matching sample selection, applied to the asynchronous reinforcement learning training method based on distribution-matching sample selection as described in any one of claims 1-8, characterized in that, include: The module includes an experience replay caching module, a distribution matching sample filtering module, a multi-behavior strategy weighted optimization module, and a strategy update module. Experience replay cache module: used to build an experience replay cache area to store trajectory data generated by multiple historical behavior strategies at different times; Distribution matching sample screening module: used to construct a reference distribution based on the current target strategy and trajectory reward information, calculate the degree of deviation between the sample and the reference distribution, and complete the sample screening; Multi-behavior policy weighted optimization module: By modeling the correspondence between high-quality training data and the corresponding historical behavior policies, it is used to model a multi-behavior policy optimization problem and complete policy-level weighted fusion. Policy update module: Used to construct an optimization objective using weighted multi-behavior policy sample data, and update the current objective policy using the policy gradient method.