A method and system for generating chemical descriptors based on reinforcement learning
By employing a reinforcement learning-based chemical descriptor generation method, which progressively reduces dimensionality and utilizes property prediction feedback for optimization, the problem of generated features lacking clear chemical meaning in existing technologies is addressed. This method achieves automatic optimization of the descriptor generation process, reduces computational complexity, and is suitable for efficient chemical descriptor generation methods. It also solves the problems of relying on human experience or high-cost computation in existing technologies, thereby improving prediction performance and adaptability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANKAI UNIV
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-19
AI Technical Summary
Existing chemical descriptor generation methods rely on manual experience or high-cost computation, making it difficult to adaptively optimize. Furthermore, the generated features lack clear chemical meaning, failing to meet the needs of high-throughput screening and rapid iteration.
A reinforcement learning-based chemical descriptor generation method is adopted, which maps high-dimensional input data into one-dimensional vector descriptors through a stepwise dimensionality reduction process, and optimizes the method using property prediction feedback.
It achieves automatic optimization of descriptor generation strategy, reduces computational complexity, is suitable for high-throughput applications, improves prediction performance, and reduces reliance on human experience and high-cost computation.
Smart Images

Figure CN122245495A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computational chemistry, cheminformatics, and artificial intelligence, and more specifically to a method and system for generating chemical descriptors based on reinforcement learning. Background Technology
[0002] Chemical descriptors are a set of characteristic parameters used to quantitatively characterize the structure, composition, and physicochemical properties of molecules or materials, serving as a crucial bridge between microscopic chemical structures and macroscopic properties. In computational chemistry, quantitative structure-activity relationship (QSAR), reaction performance prediction, and materials and drug screening, chemical descriptors are widely used as inputs to machine learning and statistical models, and their quality directly affects the accuracy and generalization ability of prediction results.
[0003] Existing methods for generating chemical descriptors mainly include those based on two-dimensional molecular topology, geometric descriptors based on three-dimensional spatial conformations, and electronic structure descriptors based on quantum chemical calculations. Two-dimensional descriptors have low computational cost but limited ability to characterize spatial and electronic effects. While three-dimensional and quantum chemical descriptors possess strong physical meaning, they typically rely on high-cost computational methods such as conformational search or density functional theory, resulting in low computational efficiency and difficulty in meeting the application requirements of high-throughput screening and rapid iteration.
[0004] In recent years, some studies have introduced deep learning models to learn the representation of chemical structures in order to reduce reliance on manually designed descriptors. However, the features generated by such methods are usually implicit high-dimensional representations that lack explicit chemical meaning, and the descriptor generation process itself cannot be explicitly optimized according to the specific prediction task, resulting in insufficient interpretability and poor controllability.
[0005] Therefore, existing technologies generally suffer from shortcomings such as fixed descriptor generation strategies, strong reliance on human experience or high-cost computation, and difficulty in adaptive optimization for specific task objectives. There is an urgent need for a new descriptor generation method that can reduce computational costs while ensuring expressive power, and can continuously optimize the descriptor construction process based on feedback from downstream tasks. Summary of the Invention
[0006] In view of the above problems, the present invention is proposed to provide a reinforcement learning-based chemical descriptor generation method and system that overcomes or at least partially solves the above problems, and can be used for molecular or material property prediction, quantitative structure-activity relationship modeling, and data-driven chemical research and design.
[0007] To achieve the above objectives, the present invention adopts the following technical solution:
[0008] In a first aspect, embodiments of the present invention provide a chemical descriptor generation method based on reinforcement learning, comprising the following steps: Step S1. Obtain the environmental input information of the chemical system, including the three-dimensional coordinates and attribute parameters of each atom, and perform structured representation and encoding of the environmental input information to obtain the initial state vector for reinforcement learning decision-making; Step S2. Input the initial state vector obtained in step S1 into the reinforcement learning decision model, execute the reinforcement learning decision, and output the actions and hyperparameter set used to reduce the dimensionality of the current environment input information; the actions include those derived from... and The chain of dimensionality reduction operators is formed, where α is used to determine the selection information of the dimensionality reduction environment function from three dimensions to two dimensions, and β is used to determine the selection information of the dimensionality reduction environment function from two dimensions to one dimension. Step S3. Utilize the action and hyperparameter set output from step S2 to configure a dimensionality reduction environment function, progressively reducing the dimensionality of the environmental input information of the chemical system to generate a one-dimensional vector form of chemical descriptor; Step S4. Input the chemical descriptor generated in step S3 into the property prediction model to predict the properties and obtain the corresponding chemical property prediction results. ; Step S5. Based on the predicted chemical properties and the actual annotations The error is used to generate a reward function, and the reward function is used to optimize the reinforcement learning decision model.
[0009] Furthermore, in step S1, the initial feature information of each atom includes the three-dimensional coordinates and attribute parameters of each atom, wherein the attribute parameters include one or more of atomic number, atomic radius, electronegativity, and number of valence electrons.
[0010] Furthermore, it also includes updating the state of the reinforcement learning decision model; the state update includes: after the reinforcement learning decision model outputs an action and drives the execution of the dimensionality-reduced environment function, obtaining an intermediate representation or a generated chemical descriptor of the dimensionality-reduced environment function, performing vectorized summarization on the intermediate representation or chemical descriptor to obtain a fixed-dimensional summary vector, and fusing the summary vector with the state vector obtained by encoding the environment input to form an updated state vector for subsequent reinforcement learning decisions, wherein the intermediate representation includes at least one of the following or a combination thereof: Aligned 3D coordinate representation or projection parameters, including projection basis, normal vector and projection origin; Neighborhood structure information, including a neighborhood list and local statistics within the neighborhood radius; Three-dimensional continuous field representation or three-dimensional voxel field representation; Two-dimensional projection point set representation; Two-dimensional feature map representation or two-dimensional field representation; Two-dimensional partition statistics, including the mean, maximum or energy statistics after binning or partitioning.
[0011] Further, in step S4, the property prediction model includes a regression prediction model or a classification prediction model, wherein the regression prediction model is used to output the continuous properties of the chemical system, including solubility, band gap and formation energy; the classification prediction model is used to output the discrete labels of the chemical system, including activity and inactivity.
[0012] Furthermore, step S5 specifically includes: Obtain the prediction error between the chemical property prediction results and the actual annotations. Based on the preset mapping relationship between the prediction error and the reward value, convert the prediction error of the current iteration into the corresponding reward value. The preset mapping relationship includes a strictly monotonically decreasing function of the prediction error, so that the smaller the error, the higher the reward value. The obtained reward value is used as environmental feedback in the reinforcement learning decision model and input into the policy optimization algorithm to optimize the action and hyperparameter set by maximizing the expected reward.
[0013] Secondly, embodiments of the present invention provide a chemical descriptor generation system based on reinforcement learning, comprising: Data input and feature processing module: used to acquire the initial feature information of each atom in the chemical system, and after the initial feature information is structurally represented, it is input into the pre-trained large model for encoding to obtain the overall feature vector representing the chemical system, which serves as the initial state vector of the reinforcement learning decision model; Reinforcement learning decision module: This module receives the initial state vector, performs reinforcement learning decisions, and outputs actions and a set of hyperparameters for dimensionality reduction of the current environment input information; the actions include those derived from... and The chain of dimensionality reduction operators is formed, where α is used to determine the selection information of the dimensionality reduction environment function from three dimensions to two dimensions, and β is used to determine the selection information of the dimensionality reduction environment function from two dimensions to one dimension. Dimensionality Reduction Environment Module: Used to configure the dimension reduction environment function according to the action and the determined hyperparameter setting information, and to perform stepwise dimension reduction processing on the initial feature information of each atom in the chemical system to generate a one-dimensional vector form of chemical descriptor z; Property prediction module: used to input the chemical descriptor into the property prediction model to predict the property and obtain the corresponding chemical property prediction results; Reward optimization module: used to generate reward values based on the prediction error of the chemical property prediction results, and to optimize and update the reinforcement learning decision model using the reward values.
[0014] Preferably, in the data input and feature processing module, the initial feature information of each atom acquired includes the three-dimensional coordinates and attribute parameters of each atom, wherein the attribute parameters include one or more of atomic number, atomic radius, electronegativity, and valence electron number.
[0015] Preferably, the reinforcement learning decision module further includes a state update unit. The state update unit is used to obtain the intermediate representation or generated chemical descriptor output by the dimensionality reduction environment function after the dimensionality reduction environment module performs dimensionality reduction based on the action, and to perform vectorized summary on it. Then, it is fused with the state vector obtained by encoding the environment input to generate an updated state vector for subsequent reinforcement learning decision-making. The intermediate representation includes at least one of the following or a combination thereof: aligned three-dimensional coordinate representation or projection parameters, neighborhood structure information, three-dimensional continuous field or voxel field representation, two-dimensional projection point set representation, two-dimensional feature map or two-dimensional field representation, and two-dimensional partition statistics.
[0016] Preferably, in the property prediction module, the property prediction model for predicting the properties of the generated chemical descriptor includes a regression prediction model or a classification prediction model, wherein the regression prediction model is used to output the continuous properties of the chemical system, including solubility, band gap and formation energy; the classification prediction model is used to output the discrete labels of the chemical system, including activity and inactivity.
[0017] Preferably, the reward optimization module specifically performs the following steps: Obtain the prediction error between the chemical property prediction results and the actual annotations. Based on the preset mapping relationship between the prediction error and the reward value, convert the prediction error of the current iteration into the corresponding reward value. The preset mapping relationship includes a strictly monotonically decreasing function of the prediction error, so that the smaller the error, the higher the reward value. The obtained reward value is used as environmental feedback in the reinforcement learning decision model and input into the policy optimization algorithm to optimize the action and hyperparameter set by maximizing the expected reward.
[0018] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a chemical descriptor generation method based on reinforcement learning, which has the following beneficial effects: This invention views the chemical descriptor generation process as a stepwise decision-making process in a dimensionality-reduction environment. A reinforcement learning agent controls the selection of the dimensionality reduction function and its hyperparameter settings through output actions, gradually reducing the high-dimensional input data, which contains three-dimensional spatial information and atomic property information, into one-dimensional vector-like chemical descriptors. The generated descriptors are input into a machine learning model for property prediction, and the reinforcement learning agent is rewarded or penalized based on changes in prediction performance, thereby achieving continuous optimization of the descriptor generation strategy. Specifically, this invention can achieve the following effects: (1) Incorporate the chemical descriptor generation process into the reinforcement learning optimization framework to achieve automatic optimization of the descriptor generation strategy; (2) By gradually reducing the dimensionality, high-dimensional structural information is mapped to low-dimensional descriptors, thereby reducing computational complexity; (3) The descriptor construction method can be dynamically adjusted according to the specific chemical properties of the prediction task to improve prediction performance; (4) It reduces reliance on human experience and high-cost quantum chemical calculations, making it suitable for high-throughput applications; (5) The descriptor generation process is controllable and scalable, and has good engineering application value. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0020] Figure 1 This is a schematic diagram of the overall process of the method provided in the embodiments of the present invention; Figure 2 This is a schematic diagram of the overall system structure provided in an embodiment of the present invention. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] For ease of understanding, the symbols and object definitions in this invention are given first: The environmental input information of a chemical system is denoted as ε, which is derived from... Composed of atoms, .in For the first The three-dimensional coordinates of each atom For the first A vector of attribute parameters for each atom.
[0023] 3D coordinates: .
[0024] Attribute parameter vector: This includes one or more of the following: element type, charge, valence state, atomic radius, mass, number of bonds, and local environment category.
[0025] The goal is to convert ε into a one-dimensional chemical descriptor z using a "dimensionality reduction operator chain" controlled by a reinforcement learning policy, and then input it into a property prediction model to obtain chemical property prediction results.
[0026] The state vector of the reinforcement learning decision model is denoted as s, and the initial state vector is denoted as S. It is the "encoded representation" of the environmental input of the chemical system and is used for policy network decision-making.
[0027] Chemical descriptors are denoted as , is a one-dimensional fixed-length vector obtained by performing three-dimensional to two-dimensional and two-dimensional to one-dimensional processing on the atomic input through the dimensionality reduction environment function chain, and is used as the input for the property prediction model.
[0028] Action recorded as It is used to select and configure the dimensionality reduction environment function chain, and to generate the corresponding hyperparameters.
[0029] refer to Figure 1 This invention discloses a method for generating chemical descriptors based on reinforcement learning, comprising the following steps: Step S1. Obtain the environmental input information of the chemical system, including the three-dimensional coordinates and attribute parameters of each atom, and perform structured representation and encoding of the environmental input information to obtain the initial state vector for reinforcement learning decision-making; Step S2. Input the initial state vector obtained in step S1 into the reinforcement learning decision model, execute the reinforcement learning decision, and output the action and hyperparameter set used to reduce the dimensionality of the current environment input information; Step S3. Utilize the action and hyperparameter set output from step S2 to configure a dimensionality reduction environment function, progressively reducing the dimensionality of the environmental input information of the chemical system to generate a one-dimensional vector form of chemical descriptor; Step S4. Input the chemical descriptor generated in step S3 into the property prediction model to predict the properties and obtain the corresponding chemical property prediction results. ; Step S5. Based on the predicted chemical properties and the actual annotations The error is used to generate a reward function, and the reward function is used to optimize the reinforcement learning decision model.
[0030] Each step is explained in detail below.
[0031] Step S1. Obtain the initial feature information of each atom in the chemical system, represent the initial feature information in a structured manner, and encode the structured representation using a pre-trained large model to obtain the overall feature vector representing the chemical system, which serves as the initial state vector S of the reinforcement learning decision model.
[0032] Although the S encoded by the pre-trained large model is a one-dimensional vector, it is a state "used for decision-making" and not equivalent to a "chemical descriptor used for the prediction task." The core of this invention lies in: policy learning to select / configure a chain of dimensionality reduction operators to generate z that is more suitable for downstream prediction.
[0033] Step S2. Input the initial state vector S into the reinforcement learning decision model, execute the reinforcement learning decision, and output the action. ,in: α is used to configure the dimensionality reduction function from 3D to 2D:
[0034] β is used to configure the dimensionality reduction function from two dimensions to one dimension:
[0035] These actions collectively constitute a chain of dimensionality reduction operators:
[0036] This enables the dimensionality reduction environment module to perform geometric dimensionality reduction and statistical mapping from three-dimensional to two-dimensional and from two-dimensional to one-dimensional on the environment input ε according to the action.
[0037] Specifically, in one implementation, the selection information for the 3D-to-2D dimensionality reduction function includes at least one or a combination of the following: (1) Projection method: used to map three-dimensional coordinates to two-dimensional plane coordinates, such as orthogonal projection, principal axis projection, principal axis projection refers to projection based on the PCA principal direction or inertial principal axis, local tangent plane projection, local tangent plane projection establishes a local coordinate system with the central atom or centroid, learning projection, learning projection outputs the projection basis vector or projection matrix by the strategy.
[0038] (2) Coordinate alignment method: used to ensure consistency of rotation / translation processing, such as centroid translation alignment, PCA principal axis alignment, alignment with specific atoms or specific bond directions, etc.
[0039] (3) Spatial partitioning strategy: the discretization / organization method of the two-dimensional plane, such as regular grid partitioning, polar coordinate partitioning, and adaptive partitioning.
[0040] (4) Fusion methods from point to two-dimensional feature map, such as point cloud rasterization counting / occupancy mapping, kernel density estimation (KDE) to generate density field, and neighborhood aggregation followed by mapping (i.e., first aggregate attributes by neighborhood, then project rasterization).
[0041] (5) Field representation and integral projection method: This method is used to first map a discrete set of atoms into a three-dimensional continuous field representation, and then obtain a two-dimensional field through integration or thickness projection along a specified direction. For example: First, the set of atomic points is expanded into a three-dimensional scalar field, such as a density field, charge field, potential energy field, or vector field, such as a force field or gradient field; then, a two-dimensional representation is generated by using line integral projection (Ray / Line integral) along the projection direction, thickness integration (slab integration) along the normal direction, or Radon transform integral projection; multiple channels can be selected, and fields are constructed and integrally projected separately according to element type / attribute channel.
[0042] In one specific implementation, the hyperparameter settings of the 3D-to-2D dimensionality reduction function include at least one or a combination of the following: (1) Projection plane parameters: projection plane normal vector n or projection basis .
[0043] (2) Projection origin: o, which can be the centroid, a specified atom, or a strategy output.
[0044] (3) Cutting range: Cutting radius R or projection plane range .
[0045] (4) Spatial grid resolution: , grid step size Δ.
[0046] (5) Territory radius: It is used for field selection or local statistics.
[0047] (6) Kernel function type and its parameters: kernel function Types such as Gaussian kernels and their bandwidth / smoothing parameter h.
[0048] (7) Field type selection: density field / charge field / potential energy field / custom attribute field (can be defined by channel).
[0049] (8) Integration methods: Discrete summation / numerical integration (trapezoidal rule, etc.) and integration thickness With step size .
[0050] (9) Attribute mapping parameters: used to map attributes to properties. Parameters mapped to channel weights .
[0051] In one specific implementation, the selection information for the two-dimensional to one-dimensional dimensionality reduction function includes at least one of the following or a combination thereof: (1) Statistical aggregation methods: such as global average pooling, global max pooling, partition pooling, and multi-scale pyramid pooling.
[0052] (2) Bin / zoning method: such as statistical by grid area, by radial bin, by angular bin, by joint (r-θ) bin.
[0053] (3) Spectral decomposition / transformation methods: such as two-dimensional FFT / DCT, low-frequency coefficients, spectral feature statistics, wavelet multi-scale energy, etc.
[0054] In one specific implementation, the hyperparameter settings of the two-dimensional to one-dimensional dimensionality reduction function shall include at least one of the following or a combination thereof: (1) Number of boxes / Number of zones.
[0055] (2) Smoothing parameters: histogram smoothing coefficient λ, quadratic kernel width wait.
[0056] (3) Normalization methods: L1 / L2 normalization, normalization by the number of atoms N, normalization by energy / mass, etc.
[0057] (4) Number of retained features: number of spectral coefficients retained K or number of spectral features retained k.
[0058] (5) Number of multi-scale layers and window parameters: number of scale layers L, window size and step size of each scale, etc.
[0059] The specific process of action encoding and hyperparameter generation in step S2 is as follows: Using actions The index and intra-segment parameters of the selected dimensionality reduction method are obtained, and a corresponding hyperparameter set is generated based on the hyperparameter mapping function. Then, the dimensionality reduction environment function is configured to perform progressive dimensionality reduction processing on the environment input. In one embodiment, to achieve the selection of candidate dimensionality reduction methods and the joint representation of their hyperparameters, the number of candidate methods for 3D to 2D dimensionality reduction is denoted as M, and the normalized parameters output by the policy network are... The method index and segment parameters are determined according to the following rules:
[0060]
[0061] Wherein, the method index Used to select the first A method for dimensionality reduction from three dimensions to two dimensions; the intra-segment parameters The hyperparameters of the selected method are adjusted using the hyperparameter mapping function corresponding to that method. Depend on Generate a set of hyperparameters. Interval mapping can be used to map parameters within a segment to a preset hyperparameter range, such as linear mapping or logarithmic mapping.
[0062] Let N be the number of candidate methods for dimensionality reduction from 2D to 1D, and let the normalized parameters be the output of the policy network. The method index and segment parameters are determined according to the following rules:
[0063]
[0064] Wherein, the method index Used to select the first A method for dimensionality reduction from two dimensions to one dimension; the intra-segment parameters The hyperparameters of the selected method are adjusted using the hyperparameter mapping function corresponding to that method. Depend on Generate a set of hyperparameters. Interval mapping can be used to map parameters within a segment to a preset hyperparameter range, such as linear mapping or logarithmic mapping.
[0065] In this invention, the dimension reduction function operates on the environmental input information of the chemical system. To enhance understandability and feasibility, a class of commonly used dimensionality reduction function implementations is presented: 1. 3D to 2D mapping: In this invention, the transformation from three-dimensional to two-dimensional is divided into two types: the point method and the field method. Here, an example is given for each method.
[0066] 1.1 3D to 2D mapping using projection and kernel density estimation: First, align and project:
[0067]
[0068] Constructing a 2D feature map:
[0069] get .
[0070] in It is the basis matrix that maps two-dimensional coordinates to a three-dimensional plane. It is Parameters mapped to channel weights; It's an indicator function; it returns 1 if the condition is true and 0 if the condition is false. Indicates the first Atoms at the cutting radius Inside, it contributes to the two-dimensional feature map; when This indicates that the atom is outside the clipping range and does not contribute.
[0071] 1.2 3D to 2D mapping via field expansion and integral projection: First, construct a three-dimensional continuous field:
[0072] Then, by performing a thickness integral along the normal direction, we obtain the two-dimensional field:
[0073] in It is a three-dimensional continuous field. It is The parameters mapped to channel weights. It is a kernel function (such as a Gaussian kernel), and h is the bandwidth / smoothing parameter. Coordinates on a two-dimensional plane It is the reference origin of the projection plane in three-dimensional space, such as the center of mass or the location of a specified atom; It is the basis matrix that maps two-dimensional coordinates to a three-dimensional plane. , Represents three-dimensional displacement within a plane; It is the normal vector of the plane; It is the "thickness coordinate" along the normal direction; It is the integral thickness.
[0074] 2. Two-dimensional to one-dimensional mapping: The two-dimensional to one-dimensional transformation uses a partitioned statistical aggregation method to divide the two-dimensional plane into regions. :
[0075] Concatenation yields a fixed-dimensional descriptor ,in .
[0076] Step S3. Based on the action output in step S2 Given a defined dimensionality reduction operator and hyperparameter configuration, the input feature ε is subjected to a stepwise dimensionality reduction process in the dimensionality reduction environment module: from three dimensions to two dimensions to generate a two-dimensional intermediate representation M (or an equivalent two-dimensional field), and then from two dimensions to one dimension to generate a one-dimensional chemical descriptor z.
[0077] Step S4. Input the chemical descriptor z generated in step S3 into the property prediction model to predict the properties and obtain the corresponding chemical property prediction results. The property prediction model can be any supervised learning model or neural network model, including but not limited to multilayer perceptrons, convolutional networks, graph neural network readout heads, or ensemble learning models.
[0078] The representative property prediction model in this invention is as follows: (1) Gradient Boosting Decision Tree (GBDT) models (e.g., XGBoost, LightGBM): When the learning objective is regression loss (such as mean squared error, Huber loss), GBDT belongs to the regression model and outputs continuous properties (such as solubility, band gap, formation energy, etc.).
[0079] When the learning objective is classification loss (such as log loss / cross-entropy), GBDT is a classification model that outputs class probability or class label (such as active / inactive).
[0080] In this embodiment, to emphasize determinism and feasibility, a specific form of Gradient Boosting Decision Tree (GBDT) is preferred: Regression task: Use RMSE or MAE as the evaluation metric; Classification task: Use AUC or F1 as the evaluation metric.
[0081] (2) Multilayer Perceptron (MLP) Neural Network Model: The Multilayer Perceptron (MLP) used for property prediction in this invention is a general structure, and whether it belongs to regression or classification depends on the form of the output layer and the loss function: If the output layer is a linear unit and uses regression loss (MSE / MAE), then it is a regression model; If the output layer is sigmoid / softmax and uses cross-entropy loss, then it is a classification model.
[0082] In this embodiment, a representative setting can be preferably adopted: for example, regression tasks use 2-4 layer MLP to output continuous values; classification tasks use sigmoid to output probabilities.
[0083] Step S5. Construct a reward function based on the error between the predicted chemical properties and the actual labels, and use the reward function to optimize and update the reinforcement learning decision model.
[0084] In this invention, the input for a single descriptor generation corresponds to a chemical system (a molecule / a unit cell or a sample). The policy parameters are iterated during the reinforcement learning training process, or multiple training rounds are performed on the dataset: during the training phase, the policy network continuously updates its parameters across multiple training rounds, enabling it to output better actions when faced with any input chemical system. Reasoning phase: For a given chemical system, only one forward decision output is required. And generate z.
[0085] The form in which the reward function is determined: For regression tasks, define the validation / evaluation error:
[0086] The reward function can be defined as the negative of the loss or a monotonic transformation thereof, for example: or
[0087] If it is necessary to normalize the rewards and avoid differences in scale across different tasks, a linear mapping of a preset interval can be used:
[0088] in , The calibration constants are obtained through prior statistical analysis. Set the upper and lower limits for the preset rewards.
[0089] For classification tasks, it can be made Cross-entropy loss or a function that makes the reward directly equal to a metric such as AUC or F1, for example: or
[0090] In this invention, chemical descriptors serve as input to the property prediction model. Their prediction error characterizes the sufficiency and effectiveness of the descriptor in representing the target property. The prediction error is a direct consequence of the action executed by the environment. The action determines how the dimensionality reduction chain transforms the atomic input ε into the descriptor z, while the prediction error measures whether z represents the target property y sufficiently. Therefore, it essentially evaluates the quality of the descriptor generated by this action, thereby indirectly evaluating the quality of the action (dimensionality reduction selection and hyperparameters).
[0091] Since descriptors are uniquely determined by actions (given input), the reward value constructed from the prediction error is equivalent to an evaluation of the quality of the descriptors generated by that action: a smaller error indicates that the dimensionality reduction chain of the action configuration retains more information relevant to the target property and suppresses irrelevant noise, thus corresponding to a higher reward; a larger error corresponds to a lower reward or penalty. Reinforcement learning increases the probability of actions that generate high-quality descriptors by maximizing the expected reward, thereby achieving adaptive optimization of the combination of dimensionality reduction functions and hyperparameter configuration. Specific implementation method one: S1. Encode the input chemical system using a pre-trained large model (in a preferred embodiment, the pre-trained large model embedding interface is an embedding model interface called through the OpenAI API, used to convert the structured representation into a fixed-dimensional vector representation). The large model receives the initial feature information of each atom in the chemical system and outputs a vector representing the overall features of the chemical system. The vector serves as the initial state S of the reinforcement learning environment.
[0093] S2. Input the initial state S into the reinforcement learning decision module, and the reinforcement learning decision module outputs an action, which includes the selection information of the dimensionality reduction function and its corresponding hyperparameter setting information.
[0094] S3. Based on the aforementioned action, process the input features in the dimensionality reduction environment module. Perform stepwise dimensionality reduction, including 3D to 2D and 2D to 1D operations, to generate chemical descriptors in 1D vector form.
[0095] S4. Input the chemical descriptor into the property prediction module to obtain the corresponding chemical property prediction results.
[0096] S5 generates a reward signal based on the performance changes between the prediction result and the baseline prediction result and the result relative to the previous round of descriptor generation. A positive reward is generated when the prediction performance improves, and a negative reward is generated when the prediction performance decreases.
[0097] S6. Feedback the reward signal to the reinforcement learning decision module to update its policy parameters, thereby optimizing the subsequent chemical descriptor generation process. Steps (2) to (6) are executed N times in a loop. Specific Implementation Method Two: The initial state S is not only generated in step S1, but can also be periodically updated during reinforcement learning training. Specifically, after completing a preset number of iterations from steps S2 to S6, the state representation is re-encoded using the large model based on the intermediate representation of the current dimensionality reduction environment or the generated chemical descriptor to obtain a new initial state S for subsequent reinforcement learning iterations.
[0099] The remaining steps are the same as in Implementation Method 1. Specific implementation method three: In another embodiment, the reward signal in step (5) considers not only the improvement or decrease in prediction performance, but also the stability or generalization performance of the prediction result. Specifically, the reward signal can be constructed based on at least one of the following: The change in prediction error relative to the validation set; Average improvement in prediction performance over multiple consecutive rounds; Constraints on descriptor dimensions or computational cost.
[0101] The remaining steps S1 to S4 and step S6 are the same as in Embodiment 1.
[0102] Based on the same inventive concept, embodiments of the present invention also provide a chemical descriptor generation system based on reinforcement learning, see reference. Figure 2 The system of the present invention includes at least the following modules: (1) Data Input and Feature Processing Module: Used to acquire the initial feature information of each atom in the chemical system. The initial feature information includes the three-dimensional coordinates of the atom and atomic attribute parameters, wherein the attribute parameters may include atomic number, electronegativity, atomic radius, number of valence electrons or a combination thereof. The input can be organized in the form of a point set. This preserves geometric spatial information and attribute information, making it easier for dimensionality reduction operators to directly apply to geometric point cloud structures with attributes.
[0103] This module is also used to structure the input and call the pre-trained large model encoding to output the overall feature vector of the chemical system as the state vector S of the reinforcement learning decision model.
[0104] (2) Reinforcement Learning Decision Module: This module receives the state vector S and outputs the action (α, β). The action is used to select the type and combination order of the dimensionality reduction environment function and to generate or determine the corresponding hyperparameter configuration. This module can be implemented using policy gradient, Actor-Critic, or other reinforcement learning methods. In a preferred embodiment, the policy network outputs the action distribution parameters and samples (α, β), and the reward optimization module updates the policy parameters based on the reward signal.
[0105] To achieve joint control of discrete operator selection and continuous hyperparameter adjustment, a continuous parameter encoding method can be adopted: utilizing action... The index and intra-segment parameters of the selected dimensionality reduction method are obtained, and a corresponding hyperparameter set is generated based on the hyperparameter mapping function. Then, the dimensionality reduction environment function is configured to perform progressive dimensionality reduction processing on the environment input. In one embodiment, to achieve the selection of candidate dimensionality reduction methods and the joint representation of their hyperparameters, the number of candidate methods for 3D to 2D dimensionality reduction is denoted as M, and the normalized parameters output by the policy network are... The method index and segment parameters are determined according to the following rules:
[0106]
[0107] Wherein, the method index Used to select the first A method for dimensionality reduction from three dimensions to two dimensions; the intra-segment parameters The hyperparameters of the selected method are adjusted using the hyperparameter mapping function corresponding to that method. Depend on Generate a set of hyperparameters. Interval mapping can be used to map parameters within a segment to a preset hyperparameter range, such as linear mapping or logarithmic mapping.
[0108] Let N be the number of candidate methods for dimensionality reduction from 2D to 1D, and let the normalized parameters be the output of the policy network. The method index and segment parameters are determined according to the following rules:
[0109]
[0110] Wherein, the method index Used to select the first A method for dimensionality reduction from two dimensions to one dimension; the intra-segment parameters The hyperparameters of the selected method are adjusted using the hyperparameter mapping function corresponding to that method. Depend on Generate a set of hyperparameters. Interval mapping can be used to map parameters within a segment to a preset hyperparameter range, such as linear mapping or logarithmic mapping.
[0111] (3) Dimensionality Reduction Environment Module: Used to implement the actions output by the reinforcement learning decision module. The input features are subjected to progressive dimensionality reduction processing to generate a chemical descriptor z in the form of a one-dimensional vector. The dimensionality reduction environment module includes at least: a set of dimensionality reduction functions from three dimensions to two dimensions and a set of dimensionality reduction functions from two dimensions to one dimension.
[0112] In the specific implementation process, the dimensionality reduction environment module receives continuous action parameters. According to the predefined action-operator mapping rules, the action parameters are mapped to specific combinations of dimensionality reduction operators and their hyperparameter configurations, and dimensionality reduction operations from three dimensions to two dimensions and from two dimensions to one dimension are executed in sequence to form a complete chain of dimensionality reduction operators.
[0113] The 3D-to-2D dimensionality reduction function maps the 3D spatial information in the input features to a 2D representation, and the 2D-to-1D dimensionality reduction function further compresses the 2D representation into a 1D vector representation. The dimensionality reduction environment module explicitly retains intermediate dimensionality reduction results as interpretable intermediate representations during the dimensionality reduction process and supports monitoring, reuse, or analysis of the dimensionality reduction results at each stage, thereby enhancing the interpretability and controllability of the descriptor generation process.
[0114] (4) Property Prediction Module: This module takes the one-dimensional chemical descriptor output by the dimensionality reduction environment module and inputs it into the machine learning model, outputting the corresponding chemical property prediction results. The machine learning model can be any supervised learning model. In this embodiment, a representative model is preferred: for regression tasks, a GBDT regression model (such as LightGBM regression) is preferred for regression prediction of continuous properties (such as solubility, band gap, formation energy, etc.); for classification tasks, a GBDT classification model (such as XGBoost classification) is preferred for classification prediction of discrete labels (such as activity / inactivity). In another embodiment, the machine learning model can be a multilayer perceptron (MLP) neural network model, where the input layer receives a one-dimensional chemical descriptor vector z, the hidden layer uses a nonlinear activation function, and the output layer outputs regression values or classification probabilities according to the task type. The property prediction module can train the model parameters using a training set and calculate prediction indicators for feedback evaluation under validation set or cross-validation conditions.
[0115] The property prediction module and the reinforcement learning decision-making module form a closed-loop interaction through a feedback evaluation mechanism.
[0116] (5) Reward optimization module: It is used to construct a reward function based on the error between the property prediction result and the real label, and feed the reward back to the reinforcement learning decision module to update the policy parameters, so that the policy gradually learns a better combination of dimensionality reduction function and hyperparameter configuration, thereby improving the performance of the descriptor in the downstream prediction task. The reward function is used to measure the predictability of the chemical descriptor generated by the dimensionality reduction operator chain configured by the action for the target property. The smaller the error, the more sufficient the task-related information retained by the descriptor is, and therefore the better the corresponding action is.
[0117] In a preferred implementation method, the reward objective function is defined as follows: For regression tasks, define the validation / evaluation error:
[0118] The reward function can be defined as the negative of the loss or a monotonic transformation thereof, for example: or
[0119] If it is necessary to normalize the rewards and avoid differences in scale across different tasks, a linear mapping of a preset interval can be used:
[0120] in , The calibration constants are obtained through prior statistical analysis. Set the upper and lower limits for the preset rewards.
[0121] For classification tasks, it can be made Cross-entropy loss or a function that makes the reward directly equal to a metric such as AUC or F1, for example: or
[0122] After the reward signal is fed back to the reinforcement learning decision module, the module updates the policy parameters based on the reward signal, thereby achieving iterative optimization of the dimensionality reduction function combination and its hyperparameters. The reward signal is calculated from measurement parameters, which are used to quantify the effectiveness of chemical descriptors in the property prediction task and serve as the optimization objective for reinforcement learning.
[0123] In this method, the measurement parameters specifically include one or more of the following: (1) Chemical property labeling parameters (target values): used as the actual label y for prediction, including but not limited to continuous properties such as solubility, band gap, and formation energy, or discrete labels such as active / inactive.
[0124] (2) Error measurement parameters for regression tasks: used to measure the predicted values The deviation from the true labeled y is used to construct a reward, such as mean squared error (MSE), root mean square error (RMSE), mean absolute error (MAE), or a combination thereof.
[0125] (3) Performance metrics for classification tasks: used to measure the classification prediction effect and construct rewards accordingly, such as cross-entropy loss, accuracy, AUC, F1 score or a combination thereof.
[0126] (4) Optional constraint-type metric parameters: used to introduce constraints on descriptor complexity or computational overhead in the reward, such as descriptor dimension D, dimensionality reduction computation time, memory consumption or dimensionality reduction operator chain length, etc., to form weighted rewards or constraint terms.
[0127] Based on the above measurement parameters, a continuous reward function can be constructed, for example, by setting the reward to negative error for a regression task. For classification tasks, the reward is set to an index such as AUC / F1 or its monotonic transformation value. The reinforcement learning decision module then updates the policy parameters using policy gradient or value function updates, so that actions that produce better measurement parameter performance (e.g., smaller error or higher AUC / F1) under the same or similar state inputs have a higher output probability, thereby continuously improving the performance of the generated chemical descriptors in property prediction tasks.
[0128] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the apparatus disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the description is relatively simple; relevant parts can be referred to the method section.
[0129] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A chemical descriptor generation method based on reinforcement learning, characterized in that, Includes the following steps: Step S1. Obtain the environmental input information of the chemical system, including the three-dimensional coordinates and attribute parameters of each atom, and perform structured representation and encoding of the environmental input information to obtain the initial state vector for reinforcement learning decision-making; Step S2. Input the initial state vector obtained in step S1 into the reinforcement learning decision model, execute the reinforcement learning decision, and output the actions and hyperparameter set used to reduce the dimensionality of the current environment input information; the actions include those derived from... and The chain of dimensionality reduction operators constitutes, among which, Selection information for the dimensionality reduction environment function used to determine the environmental input information from three dimensions to two dimensions. Selection information for a dimensionality-reduced environment function used to determine the input information from two dimensions to one dimension; Step S3. Utilize the action and hyperparameter set output from step S2 to configure a dimensionality reduction environment function, progressively reducing the dimensionality of the environmental input information of the chemical system to generate a one-dimensional vector form of chemical descriptor; Step S4. Input the chemical descriptor generated in step S3 into the property prediction model to predict the properties and obtain the corresponding chemical property prediction results. ; Step S5. Based on the predicted chemical properties and the actual annotations The prediction error is used to generate a reward value, which is then used to optimize the reinforcement learning decision model.
2. The chemical descriptor generation method as described in claim 1, characterized in that, In step S1, the property parameters of each atom include one or more of the following: atomic number, atomic radius, electronegativity, and number of valence electrons.
3. The chemical descriptor generation method as described in claim 1, characterized in that, Step S2 further includes updating the state of the reinforcement learning decision model; the state update includes: after the reinforcement learning decision model outputs an action and drives the execution of the dimensionality reduction environment function, obtaining the intermediate representation or generated chemical descriptor of the dimensionality reduction environment function, performing vectorized summarization on the intermediate representation or chemical descriptor to obtain a fixed-dimensional summary vector, and fusing the summary vector with the state vector obtained by encoding the environment input to form an updated state vector for subsequent reinforcement learning decisions, wherein the intermediate representation includes at least one of the following or a combination thereof: aligned three-dimensional coordinate representation or projection parameters, neighborhood structure information, three-dimensional continuous field or voxel field representation, two-dimensional projection point set representation, two-dimensional feature map or two-dimensional field representation, and two-dimensional partition statistics.
4. The chemical descriptor generation method as described in claim 1, characterized in that, In step S4, the property prediction model includes a regression prediction model or a classification prediction model. The regression prediction model is used to output the continuous properties of the chemical system, including solubility, band gap, and formation energy. The classification prediction model is used to output the discrete labels of the chemical system, including activity and inactivity.
5. The chemical descriptor generation method as described in claim 1, characterized in that, Step S5 specifically includes: Obtain the prediction error between the chemical property prediction results and the actual annotations. Based on the preset mapping relationship between the prediction error and the reward value, convert the prediction error of the current iteration into the corresponding reward value. The preset mapping relationship includes a strictly monotonically decreasing function of the prediction error, so that the smaller the error, the higher the reward value. The obtained reward value is used as environmental feedback in the reinforcement learning decision model and input into the policy optimization algorithm to optimize the action and hyperparameter set by maximizing the expected reward.
6. A chemical descriptor generation system based on reinforcement learning, characterized in that, include: Data input and feature processing module: used to acquire environmental input information of the chemical system, including the three-dimensional coordinates and attribute parameters of each atom, and to encode the environmental input information using a pre-trained large model to obtain an initial state vector for reinforcement learning decision-making; Reinforcement learning decision module: This module receives the initial state vector, performs reinforcement learning decisions, and outputs actions and a set of hyperparameters for dimensionality reduction of the current environment input information; the actions include those derived from... and The chain of dimensionality reduction operators is formed, where α is used to determine the selection information of the dimensionality reduction environment function from three dimensions to two dimensions, and β is used to determine the selection information of the dimensionality reduction environment function from two dimensions to one dimension. Dimensionality Reduction Environment Module: Used to configure the dimension reduction environment function according to the action, and perform stepwise dimension reduction processing on the initial feature information of each atom in the chemical system to generate a chemical descriptor in the form of a one-dimensional vector; Property prediction module: Used to predict the properties of the generated chemical descriptors using a property prediction model, and obtain the corresponding chemical property prediction results; Reward optimization module: used to generate reward values based on the prediction error of the chemical property prediction results, and to optimize and update the reinforcement learning decision model using the reward values.
7. The chemical descriptor generation system as described in claim 6, characterized in that, In the data input and feature processing module, the initial feature information of each atom is acquired, including the three-dimensional coordinates and attribute parameters of each atom. The attribute parameters include one or more of atomic number, atomic radius, electronegativity, and number of valence electrons.
8. The chemical descriptor generation system as described in claim 6, characterized in that, The reinforcement learning decision module further includes a state update unit. The state update unit is used to obtain the intermediate representation or generated chemical descriptor output by the dimensionality reduction environment function after the dimensionality reduction environment module performs dimensionality reduction based on the action, and after vectorizing and summarizing it, it is fused with the state vector obtained by encoding the environment input to generate an updated state vector for subsequent reinforcement learning decision-making. The intermediate representation includes at least one of the following or a combination thereof: aligned three-dimensional coordinate representation or projection parameters, neighborhood structure information, three-dimensional continuous field or voxel field representation, two-dimensional projection point set representation, two-dimensional feature map or two-dimensional field representation, and two-dimensional partition statistics.
9. The chemical descriptor generation system as described in claim 6, characterized in that, In the property prediction module, the property prediction model for predicting the properties of the generated chemical descriptors includes a regression prediction model or a classification prediction model. The regression prediction model is used to output the continuous properties of the chemical system, including solubility, band gap, and formation energy. The classification prediction model is used to output the discrete labels of the chemical system, including activity and inactivity.
10. The chemical descriptor generation system as described in claim 6, characterized in that, The reward optimization module specifically performs the following steps: Obtain the prediction error between the chemical property prediction results and the actual annotations. Based on the preset mapping relationship between the prediction error and the reward value, convert the prediction error of the current iteration into the corresponding reward value. The preset mapping relationship includes a strictly monotonically decreasing function of the prediction error, so that the smaller the error, the higher the reward value. The obtained reward value is used as environmental feedback in the reinforcement learning decision model and input into the policy optimization algorithm to optimize the action and hyperparameter set by maximizing the expected reward.