Training method and device of prediction model of adverse drug reactions, electronic device and computer program product

By oversampling and undersampling adverse drug reaction samples and combining spectral clustering algorithm for subgrouping, the model training bias caused by uneven sample distribution is solved, thereby improving the model's prediction accuracy and generalization ability.

CN122245837APending Publication Date: 2026-06-19PEKING UNIV CHONGQING RES INST OF BIG DATA +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PEKING UNIV CHONGQING RES INST OF BIG DATA
Filing Date
2026-03-06
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies suffer from performance bias during model training due to the uneven distribution of adverse drug reaction samples.

Method used

By oversampling and undersampling the training sample cluster, and combining the spectral clustering algorithm to divide the samples into subgroups, a drug adverse reaction prediction model is constructed.

Benefits of technology

It improves the model's sensitivity and specificity, enhances its generalization ability and prediction accuracy, ensures a balanced sample distribution, and reduces model bias.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245837A_ABST
    Figure CN122245837A_ABST
Patent Text Reader

Abstract

This invention discloses a training method, apparatus, electronic device, and computer program product for a predictive model of adverse drug reactions. The method includes: acquiring a training sample cluster, wherein the training sample cluster includes multiple positive samples and multiple negative samples; oversampling the multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples; undersampling the multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference in the number of candidate negative samples and candidate positive samples does not exceed a preset threshold; and training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster. This invention solves the technical problem in existing technologies where uneven distribution of adverse drug reaction samples leads to potential performance bias during model training.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine learning, and more specifically, to a training method, apparatus, electronic device, and computer program product for a predictive model of adverse drug reactions. Background Technology

[0002] Adverse drug reactions (ADRs) prediction is a crucial aspect of drug safety monitoring and clinical medication decision-making. From a clinical pharmacology perspective, due to the combined influence of multiple factors such as drug-metabolizing enzyme polymorphism, patient physiological and pathological states, and drug interactions, the same drug may exhibit differentiated safety response characteristics in different individuals. This complex inter-individual variability limits the predictive accuracy of traditional ADR risk assessment methods based on population average parameters in individual medication risk prediction scenarios, making it difficult to meet the needs of precision medicine.

[0003] In existing technologies, statistical models such as logistic regression typically assume linear relationships between variables, making it difficult to fully characterize the complex nonlinear mechanisms underlying drug responses. Similarly, traditional prediction models based on homogeneity assumptions struggle to accurately identify risk characteristics in different subgroups when faced with data exhibiting significant individual differences. In contrast, machine learning methods such as random forests possess strong nonlinear modeling capabilities; however, in real-world medical data, the prevalence of ADR events often exhibits significant class imbalance, potentially leading to biased prediction performance during training—for example, a difficulty in balancing specificity and sensitivity. Deep learning models such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) possess automatic feature extraction capabilities, but their training stability and generalization performance are limited by the small sample size, posing certain limitations in personalized ADR risk prediction applications.

[0004] Therefore, how to effectively mitigate the decline in model performance caused by uneven distribution of ADR samples while characterizing individual differences in adverse drug reactions has become a key challenge that needs to be addressed by existing technologies.

[0005] There is currently a lack of effective solutions to the technical problem that the uneven distribution of adverse drug reaction samples in the existing technologies may lead to performance bias during model training. Summary of the Invention

[0006] This invention provides a training method, apparatus, electronic device, and computer program product for a predictive model of adverse drug reactions, in order to at least solve the technical problem that the uneven distribution of adverse drug reaction samples in the prior art may lead to performance bias during model training.

[0007] According to one aspect of the present invention, a method for training a prediction model for adverse drug reactions is provided, comprising: acquiring a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, wherein the positive samples represent sample features that produce adverse drug reactions, and the negative samples represent sample features that do not produce adverse drug reactions; oversampling the multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes: multiple candidate positive samples; undersampling the multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes: multiple candidate negative samples, wherein the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold; and training a prediction model for adverse drug reactions based on the candidate positive sample cluster and the candidate negative sample cluster.

[0008] Optionally, after obtaining the training sample cluster, the method further includes: classifying the training samples in the training sample cluster to obtain multiple candidate sample clusters, wherein the training samples are either the positive samples or the negative samples, and the candidate samples in the same candidate sample cluster are either all positive samples or all negative sample clusters; and using a spectral clustering algorithm to divide each candidate sample cluster into subgroups to obtain multiple candidate subgroups corresponding to each candidate sample cluster.

[0009] Optionally, the spectral clustering algorithm is used to divide each candidate sample cluster into subgroups to obtain multiple candidate subgroups corresponding to each candidate sample cluster, including: constructing the Laplacian matrix of the candidate sample cluster based on the similarity between any two candidate samples in the same candidate sample cluster, wherein the Laplacian matrix is ​​constructed based on a degree matrix and an adjacency matrix, the degree matrix is ​​a diagonal matrix, and each diagonal element of the degree matrix is ​​the sum of the similarities between the same candidate sample and other candidate samples in the candidate sample cluster, and each element in the adjacency matrix represents the similarity between any two candidate samples; performing eigenvalue decomposition on the Laplacian matrix to obtain an ascending sequence of eigenvalues ​​and an eigenvector matrix corresponding to the eigenvalue sequence, wherein the eigenvalues... The sequence includes: multiple feature values; the feature vector matrix includes: feature vectors corresponding to each feature value; in the feature vector matrix, a feature matrix to be clustered corresponding to a preset number of clusters is selected, wherein the feature matrix to be clustered includes: feature vectors corresponding to candidate feature values ​​whose number is equal to the preset number of clusters, the candidate feature values ​​being selected in the feature value sequence in ascending order that conform to the preset number of clusters, and each feature vector in the feature matrix to be clustered represents the feature projection of the candidate sample on the dimension of the preset number of clusters; based on the feature matrix to be clustered, multiple candidate samples in the candidate sample cluster are clustered using the preset number of clusters as the cluster number parameter to obtain multiple candidate subgroups.

[0010] Optionally, before selecting the feature matrix to be clustered corresponding to the preset number of clusters in the feature vector matrix, the method further includes: determining the number of clusters to be calculated based on the maximum difference between two adjacent feature values ​​in the feature value sequence; determining the upper bound of the interval based on the larger value between the calculated number of clusters and the preset number of clusters, and using the preset minimum number of clusters as the lower bound of the interval to obtain the candidate number of clusters interval, wherein the candidate number of clusters interval includes: multiple candidate number of clusters; selecting a candidate feature matrix corresponding to each candidate number of clusters in the feature vector matrix, and performing pre-clustering according to the same candidate number of clusters as the clustering number parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on different candidate number of clusters; evaluating the clustering effect corresponding to each candidate number of clusters based on the multiple clustering results, and determining the candidate number of clusters with the best clustering effect as the preset number of clusters.

[0011] Optionally, before training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster, the method further includes: merging the candidate positive sample cluster and the candidate negative sample cluster into a balanced training cluster, wherein the balanced samples in the balanced training cluster are either the candidate positive samples or the candidate negative samples; determining the nearest neighbor sample corresponding to each candidate negative sample based on the sample characteristics of the candidate negative samples; and removing the candidate negative sample from the candidate negative sample cluster if the nearest neighbor sample corresponding to the candidate negative sample is a candidate positive sample.

[0012] Optionally, oversampling multiple positive samples to obtain a candidate positive sample cluster includes: selecting any one of the positive samples in each preset sample cluster as a first sample; selecting any one of the nearest neighbor samples of the first sample as a second sample in the same preset sample cluster, wherein the nearest neighbor sample is the positive sample that is the nearest neighbor of the first sample; performing linear interpolation based on the first sample and the second sample to generate a synthetic sample; and generating the candidate positive sample cluster based on the positive sample and the synthetic sample.

[0013] Optionally, selecting any one of the nearest neighbor samples of the first sample as the second sample within the same preset sample cluster includes: determining the distance relationship between each positive sample and the first sample within the same preset sample cluster; selecting a preset number of positive samples as the nearest neighbor samples in ascending order of the distance values ​​to obtain a nearest neighbor cluster; and selecting any one of the nearest neighbor samples from the nearest neighbor cluster as the second sample.

[0014] According to another aspect of the present invention, a training apparatus for a prediction model of adverse drug reactions is also provided, comprising: an acquisition module for acquiring a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, the positive samples representing sample features that produce adverse drug reactions, and the negative samples representing sample features that do not produce adverse drug reactions; a first sampling module for oversampling the multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes: multiple candidate positive samples; a second sampling module for undersampling the multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes: multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold; and a training module for training a prediction model of adverse drug reactions based on the candidate positive sample cluster and the candidate negative sample cluster.

[0015] According to another aspect of the present invention, an electronic device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute a training method for a prediction model of adverse drug reactions through the computer program.

[0016] According to another aspect of the present invention, a computer program product is also provided, including computer instructions that, when executed by a processor, implement the steps of a training method for a prediction model of adverse drug reactions.

[0017] The embodiments described above in this application effectively alleviate the problem of scarce positive samples by oversampling positive samples in the training sample cluster, increasing the amount of information reflecting adverse drug reaction characteristics in the dataset and improving the sufficiency of model training. Simultaneously, the undersampling strategy implemented for negative samples preserves the structural representativeness of the negative samples, reduces the class imbalance of the dataset, and avoids excessive reliance on negative instances during model training, thereby reducing model bias. This ensures that the difference in the number of candidate positive sample clusters and candidate negative sample clusters does not exceed a preset threshold, achieving a balance between positive and negative samples. This helps improve the model's sensitivity and specificity, making it more accurate in predicting adverse drug reactions. Training the adverse drug reaction prediction model based on the balanced samples allows the model to learn comprehensive feature associations on a more balanced data distribution, enhancing the model's generalization ability and prediction accuracy. This alleviates the technical problem of performance bias that may exist during model training due to the uneven distribution of adverse drug reaction samples in existing technologies. Attached Figure Description

[0018] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0019] Figure 1 This is a flowchart of a method for training a predictive model for adverse drug reactions according to an embodiment of the present invention;

[0020] Figure 2 This is a schematic diagram of a spectral clustering and hybrid sampling process according to an embodiment of the present invention;

[0021] Figure 3 This is a schematic diagram comparing the data balance effect before and after spectral clustering and mixed sampling optimization according to an embodiment of the present invention;

[0022] Figure 4 This is a schematic diagram comparing the predictive performance of an optimized prediction model for adverse drug reaction subgroups according to an embodiment of the present invention.

[0023] Figure 5 This is a schematic diagram of a training device for a predictive model of adverse drug reactions according to an embodiment of the present invention.

[0024] Figure 6 This is a structural block diagram of a computer terminal according to an embodiment of the present invention. Detailed Implementation

[0025] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0026] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this invention are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of the invention described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0027] According to an embodiment of the present invention, a method for training a predictive model of adverse drug reactions is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0028] Figure 1 This is a flowchart of a training method for a predictive model of adverse drug reactions according to an embodiment of the present invention, such as... Figure 1 As shown, the method includes the following steps:

[0029] Step S102: Obtain a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, where positive samples represent sample features that produce adverse drug reactions and negative samples represent sample features that do not produce adverse drug reactions.

[0030] Step S104: Oversample multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples;

[0031] Step S106: Undersample multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold.

[0032] Step S108: Train the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

[0033] The embodiments described above in this application effectively alleviate the problem of scarce positive samples by oversampling positive samples in the training sample cluster, increasing the amount of information reflecting adverse drug reaction characteristics in the dataset and improving the sufficiency of model training. Simultaneously, the undersampling strategy implemented for negative samples preserves the structural representativeness of the negative samples, reduces the class imbalance of the dataset, and avoids excessive reliance on negative instances during model training, thereby reducing model bias. This ensures that the difference in the number of candidate positive sample clusters and candidate negative sample clusters does not exceed a preset threshold, achieving a balance between positive and negative samples. This helps improve the model's sensitivity and specificity, making it more accurate in predicting adverse drug reactions. Training the adverse drug reaction prediction model based on the balanced samples allows the model to learn comprehensive feature associations on a more balanced data distribution, enhancing the model's generalization ability and prediction accuracy. This solves the technical problem in existing technologies where uneven distribution of adverse drug reaction samples can lead to performance bias during model training.

[0034] In step S102 above, the training sample cluster can be raw data of the target population obtained from an authorized electronic health record (EHR), clinical trial database or drug monitoring system, which includes demographic data, clinical characteristics and laboratory test data, as well as outcome information of specific adverse reactions after drug use.

[0035] Optionally, demographic data includes: patient's biological sex, age, ethnicity, etc.; clinical characteristics include: medical history and behavior (such as smoking, drinking, and past medical history) and physical and functional status (such as body mass index BMI, ideal weight ibw); laboratory test data includes: liver function indicators (such as alanine aminotransferase ALT, aspartate aminotransferase AST), kidney function indicators (such as creatinine Cr), complete blood count indicators (such as white blood cell count, hemoglobin concentration), and complete urine count indicators (such as urine protein PRO, urine glucose GLU); medication data includes drug type, dosage, route of administration, and information on adverse reactions after medication (such as whether hepatotoxicity or allergic reactions have occurred).

[0036] As an optional implementation, demographic data, clinical characteristics data, and laboratory test data are used as input variables. Whether adverse reactions occur after medication is used as a binary target variable. (Positive sample: adverse reaction occurred; negative sample: no adverse reaction occurred), among which Represents the feature dimension. Indicates the dimension of the sample. Indicates the sample number. Indicates the feature number.

[0037] As an optional example, for each variable Calculate the missing value ratio and select an appropriate imputation strategy based on the ratio. For variables with a missing value ratio greater than 5%, use the k-nearest neighbor (KNN) imputation method. For variables with a missing value ratio less than or equal to 5%, determine their data type: if it is a categorical variable, use the mode for single imputation; if it is a continuous variable, first test its distribution characteristics, and select the imputation method based on the normality test results: if the test results support the normality hypothesis (…), then… If the mean is not a valid interpolation value, then the mean is used for single interpolation; otherwise, the median is used for interpolation.

[0038] Optionally, when performing a normality test on a continuous variable, the Shapiro–Wilk test is used if the sample size is less than 5000; the Kolmogorov–Smirnov test is used if the sample size is greater than or equal to 5000.

[0039] It's important to note that the Shapiro–Wilk test is a commonly used method for testing normality, primarily used for small to medium-sized samples (typically fewer than 2000 data points). This method assesses the closeness of the data to a theoretical normal distribution by calculating a series of linear combinations of the sample data, based on the sample values ​​and the expected value of the theoretical distribution. A Shapiro–Wilk test statistic W close to 1 indicates that the sample data is closer to a normal distribution. The test provides a p-value; if the p-value is greater than the set significance level (e.g., 0.05), the null hypothesis cannot be rejected, and the data can be considered to originate from a normal distribution.

[0040] It's important to note that the Kolmogorov–Smirnov test (KS test) is a non-parametric test that can be applied to any type of continuous distribution, not just the normal distribution. It assesses whether data originates from a specific theoretical distribution by comparing the maximum deviation of the sample cumulative distribution function (CDF) from the theoretical CDF. The test statistic D measures the maximum distance between the sample CDF and the theoretical CDF. The KS test also provides a p-value; when the p-value is greater than the significance level, we cannot reject the null hypothesis that the data originates from the specified theoretical distribution.

[0041] As an optional example, correlation and variance analysis are performed on the input variables. The correlation coefficient matrix between all pairs of variables is calculated, with a correlation threshold of 0.8. For feature groups where the absolute value of the correlation coefficient between any two variables exceeds this threshold, their respective correlations with the target variable are further calculated. Features with a higher correlation to the target variable are retained, while other highly correlated features within the group are removed to eliminate the effects of multicollinearity. The variance of all variables is calculated, with a variance threshold of 0.01. Variables with variances below this threshold are considered invalid features containing very little information and having too low a degree of variation, and are directly deleted, further reducing the data dimensionality while retaining valid information.

[0042] As an optional example, after completing the variable selection, each type of variable is re-encoded: for numerical variables, Z-score standardization is used to eliminate the differences in units and ranges of values ​​between different variables; for categorical variables, one-hot encoding is used to convert discrete category information into numerical features that can be used for model training.

[0043] In step S104 above, multiple positive samples are oversampled, which can be done using the SMOTE algorithm.

[0044] It's important to note that SMOTE (Synthetic Minority Over-sampling Technique) is a widely used machine learning preprocessing method for imbalanced datasets, primarily used to increase the representativeness of the minority class (in this context, positive samples, i.e., samples that experienced adverse drug reactions). Specifically, its application to each positive subgroup means that the SMOTE process is performed separately for each positive sample within that subgroup, rather than uniformly across the entire positive sample set. This specific oversampling strategy fully leverages the subgroup structure identified in spectral clustering, enabling finer-tuning of the dataset while maintaining the specificity of each subgroup.

[0045] As an optional example, the basic process of the SMOTE algorithm includes:

[0046] 1) Calculate the distance between samples: For each sample point in a positive subgroup, calculate the distance between it and other sample points.

[0047] 2) Selecting nearest neighbor samples: Randomly select a sample point from the nearest neighbors of each sample point (usually k nearest neighbors are selected, where k is a parameter).

[0048] 3) Generate synthetic samples: Based on the linear interpolation of the original sample points and the nearest neighbor sample points, new synthetic sample points are generated. The interpolation ratio λ is randomly selected from a uniform distribution from 0 to 1.

[0049] 4) Add synthetic samples: Add the generated synthetic sample points to the training dataset to increase the number of positive subgroup samples, thereby reducing the class imbalance problem.

[0050] In the embodiments described above, SMOTE oversampling is performed independently for each positive subgroup in adverse drug reaction prediction. This ensures that while increasing the number of positive samples, the intrinsic characteristics and local structure of each subgroup are preserved, avoiding the subgroup feature ambiguity problem that may occur when traditional SMOTE operates on the overall dataset. In this way, the number of positive samples in each subgroup is increased, while the differences and diversity between subgroups are preserved, which helps to train a more personalized and accurate prediction model.

[0051] In step S106 above, multiple negative samples are undersampled, and the undersampling can be performed according to a preset undersampling ratio.

[0052] Optionally, after training the adverse drug reaction prediction model based on candidate positive sample clusters and candidate negative sample clusters, the clinical data after the user's authorized use of the drug can be input into the adverse drug reaction prediction model. The model can then predict whether the user will experience adverse drug reactions, facilitating early intervention before adverse drug reactions occur, alleviating the user's suffering, and reducing the adverse effects of adverse drug reactions.

[0053] The above embodiments of this application utilize a drug adverse reaction prediction model trained with balanced positive and negative samples, which can provide clinicians with a more reliable drug safety assessment tool, contributing to personalized medication guidance and ensuring patient safety.

[0054] As an optional embodiment, after obtaining the training sample cluster, the method further includes: classifying the training samples in the training sample cluster to obtain multiple candidate sample clusters, wherein the training samples are positive samples or negative samples, and the candidate samples in the same candidate sample cluster are all positive samples or all negative sample clusters; and using a spectral clustering algorithm to divide each candidate sample cluster into subgroups to obtain multiple candidate subgroups corresponding to each candidate sample cluster.

[0055] In the embodiments described above, after obtaining the training sample cluster, the training samples in the cluster are further classified to obtain multiple candidate sample clusters. Next, a spectral clustering algorithm is used to divide each candidate sample cluster into subgroups, resulting in multiple candidate subgroups corresponding to each candidate sample cluster. This can identify and characterize the subgroup structure within positive and negative samples, realizing the division of subgroups with similar phenotypic features. Based on these candidate subgroups, a drug adverse reaction prediction model is trained, enabling the drug adverse reaction prediction model to capture more accurate features of positive and negative samples and make predictions. This enhances the model's generalization ability and prediction accuracy, solving the technical problem in the prior art where the uneven distribution of drug adverse reaction samples leads to potential performance bias during model training.

[0056] Optionally, the candidate sample clusters are divided into: a positive sample cluster that includes only positive samples and a negative sample cluster that includes only negative samples, wherein the candidate subgroup corresponding to the positive sample cluster is a positive subgroup and the candidate subgroup corresponding to the negative sample cluster is a negative subgroup.

[0057] Optionally, when oversampling positive samples in a positive subgroup, oversampling can be performed on positive samples in multiple positive subgroups according to a pre-set oversampling ratio.

[0058] Optionally, when undersampling negative samples in a negative subgroup, undersampling can be performed on negative samples in multiple negative subgroups according to a pre-set undersampling ratio.

[0059] As an optional implementation, a spectral clustering algorithm is used to divide each candidate sample cluster into subgroups, resulting in multiple candidate subgroups for each candidate sample cluster. This includes: constructing a Laplacian matrix for the candidate sample cluster based on the similarity between any two candidate samples within the same cluster. The Laplacian matrix is ​​constructed using a degree matrix and an adjacency matrix. The degree matrix is ​​a diagonal matrix, where each diagonal element represents the sum of similarities between the same candidate sample and other candidate samples within the cluster. Each element in the adjacency matrix represents the similarity between any two candidate samples. Then, eigenvalue decomposition is performed on the Laplacian matrix to obtain an ascending sequence of eigenvalues ​​and corresponding eigenvectors. The matrix consists of an eigenvalue sequence comprising multiple eigenvalues ​​and an eigenvector matrix comprising eigenvectors corresponding to each eigenvalue. Within the eigenvector matrix, a clustering feature matrix corresponding to a preset number of clusters is selected. This clustering feature matrix comprises eigenvectors corresponding to candidate eigenvalues, with the number of candidate eigenvalues ​​equal to the preset number of clusters. The candidate eigenvalues ​​are selected from the eigenvalue sequence in ascending order and conform to the preset number of clusters. Each eigenvector in the clustering feature matrix represents the feature projection of a candidate sample onto the dimension of the preset number of clusters. Based on the clustering feature matrix, multiple candidate samples in the candidate sample cluster are clustered using the preset number of clusters as the clustering parameter to obtain multiple candidate subgroups.

[0060] In the embodiments described above, a deep optimization method based on spectral clustering is used for subgrouping of candidate sample clusters. A Laplacian matrix is ​​constructed by calculating the similarity between any two candidate samples. The core of this matrix consists of a degree matrix and an adjacency matrix. The diagonal elements of the degree matrix represent the sum of similarities between each sample and other samples, while the elements of the adjacency matrix reflect the direct similarity between samples. Subsequently, the Laplacian matrix is ​​subjected to eigenvalue decomposition to obtain an ascending sequence of eigenvalues ​​and the corresponding eigenvector matrix. Next, the eigenvectors corresponding to a preset number of clusters are selected to form the feature matrix to be clustered. This matrix contains the eigenvalues... Several feature vectors, selected in order of size, represent the projection of the samples in a low-dimensional space. Finally, using the information in the feature matrix to be clustered, a clustering algorithm is used to segment the candidate sample clusters, resulting in multiple candidate subgroups with internally similar features. Thus, by using spectral clustering technology to find structures within the data, a refined division of sample subgroups is achieved. This helps to better maintain the characteristics of subgroups during subsequent sampling optimization, thereby improving the accuracy and robustness of the adverse drug reaction prediction model. This solves the technical problem that existing technologies may have performance biases during model training due to the uneven distribution of adverse drug reaction samples.

[0061] As an optional example, suppose the characteristics of any two positive (or negative) samples are represented as follows: and The similarity of samples is calculated based on the Gaussian kernel function and used as the weight. Construct a similarity matrix Characterizing local similarity between samples: ;

[0062] Calculate the degree matrix ,in, It is a diagonal matrix, with diagonal elements representing sample degrees. Construct the normalized Laplace matrix: ;

[0063] For the normalized Laplace matrix Perform eigenvalue decomposition to obtain an ascending sequence of eigenvalues. and the corresponding eigenvector matrix .

[0064] Optionally, after obtaining the optimal number of clusters (i.e., the preset number of clusters). Then, take the first value from the eigenvalue sequence. The smallest non-zero eigenvalues ​​are obtained from the previous... The eigenvector matrix corresponding to the smallest non-zero eigenvalues And for each sample vector Perform L2 normalization: .

[0065] In spectral embedding space The Euclidean distance between samples is calculated, and K-means clustering is performed (using K-means++ as the initial centroid, iterating until the centroid converges) to divide the samples into groups. The subgroups are obtained by calculating the subgroup structure of positive (or negative) samples at the optimal number of clusters.

[0066] It's important to note that in spectral clustering, the normalized eigenvector matrix essentially provides a new representation for each original data point, or rather, coordinates within a newly discovered feature space, often referred to as the "spectral embedding space." When K-means or other clustering algorithms are executed on the eigenvector matrix, the clustering process is based on these row vectors (i.e., the coordinates of the data points in the spectral embedding space), attempting to find the data points within a predetermined number of clusters. The optimal grouping method in 3D space.

[0067] Optionally, the preset number of clusters and similarity calculation method can be flexibly adjusted to adapt to the needs of different application scenarios. Through this series of operations, not only can potential group heterogeneity be effectively revealed, but also a scientific basis can be provided for subsequent SMOTE oversampling and hierarchical undersampling, ensuring the training quality of the prediction model.

[0068] As an optional embodiment, before selecting the feature matrix to be clustered corresponding to the preset number of clusters in the feature vector matrix, the method further includes: determining the number of clusters to be calculated based on the maximum difference between two adjacent feature values ​​in the feature value sequence; determining the upper bound of the interval based on the larger value between the calculated number of clusters and the preset number of clusters, and using the preset minimum number of clusters as the lower bound of the interval to obtain the candidate number of clusters interval, wherein the candidate number of clusters interval includes: multiple candidate number of clusters; selecting a candidate feature matrix corresponding to each candidate number of clusters in the feature vector matrix, and performing pre-clustering according to the same candidate number of clusters as the clustering number parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on different candidate number of clusters; evaluating the clustering effect corresponding to each candidate number of clusters based on the multiple clustering results, and determining the candidate number of clusters with the best clustering effect as the preset number of clusters.

[0069] In the embodiments described above, when using spectral clustering algorithm for subgroup division, the number of clusters is determined by analyzing the maximum difference between two adjacent eigenvalues ​​in the eigenvalue sequence. This method, known as the eigenvalue gap method, aims to automatically select the number of clusters that can reflect the inherent structure of the data. Subsequently, the calculated number of clusters is compared with a preset upper limit for the number of clusters, and the larger of the two values ​​is taken as the upper bound of the candidate cluster number interval. Within this interval, for each candidate cluster number, the corresponding eigenvector matrix is ​​selected, and clustering operations are performed on it to generate candidate subgroups. This not only effectively avoids the bias that may be introduced by subjectively setting the number of clusters, but also ensures the rationality and accuracy of the positive subgroup division, providing a solid foundation for subsequent specific SMOTE oversampling processing. This significantly improves the classification performance and clinical applicability of the adverse drug reaction prediction model on balanced datasets, and solves the technical problem that the uneven distribution of adverse drug reaction samples in the prior art may lead to performance bias during model training.

[0070] Optionally, the clustering effect of each candidate cluster number can be evaluated by combining multiple indicators. The evaluation indicators include, but are not limited to, the silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index. The clustering effect of each candidate cluster number is quantitatively evaluated, and the candidate cluster number with the best clustering effect (or the highest comprehensive index) is taken as the final preset cluster number.

[0071] As an optional embodiment, to avoid subjectivity in selecting the number of clusters, this embodiment adopts a multi-criteria joint decision-making strategy to determine the number of clusters. k The details are as follows:

[0072] (1) Preset the upper limit of the number of clusters (i.e., preset the number of clusters) .

[0073] (2) Calculate the difference between adjacent eigenvalues Determine the number of clusters corresponding to the largest gap (i.e., calculate the number of clusters): .

[0074] It should be noted that calculating the difference between adjacent feature values... This involves calculating the eigenvalue gap, where a large eigenvalue gap often implies a natural segmentation of data points in the feature space, meaning there may be a cluster boundary.

[0075] Optionally, determining the number of clusters corresponding to the maximum gap (i.e., calculating the number of clusters) includes: finding the position m of the eigenvalue in the eigenvalue sequence that makes the eigenvalue gap reach its maximum value, and calculating the number of clusters. The value is m+1.

[0076] (3) Set the preset number of clusters Computing the number of clusters The larger value is used as the upper bound for the number of clusters (i.e., the upper bound of the interval of candidate cluster numbers). .

[0077] (4) For each candidate cluster number Before selection The eigenvectors corresponding to the smallest eigenvalues ​​form a matrix (i.e., the candidate feature matrix). ,right After normalizing the row vectors, the spectral embedding space is then executed. K -means clustering, obtaining positive samples Subgroups.

[0078] (5) Calculate the silhouette coefficient (Silhouette, ), Calinski–Harabasz index ( ) and Davies-Bouldin index ( ).

[0079] in, .

[0080] (6) After normalization, each indicator is combined with equal weight to obtain a comprehensive score, thereby evaluating the clustering effect of each candidate cluster number.

[0081] Optionally, the overall score is: A higher overall score indicates a better clustering effect.

[0082] (7) Select the one with the highest overall score. This serves as the final number of clusters (i.e., the preset number of clusters). .

[0083] It should be noted that, to ensure medical significance and sampling feasibility, the sample size for each subgroup is set to be no less than a predetermined threshold (e.g., 5 cases), and adjacent clusters may be merged or adjusted as necessary to meet clinical interpretability.

[0084] As an optional embodiment, before training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster, the method further includes: removing the candidate negative samples that are the nearest neighbors of the candidate positive samples in the candidate negative sample cluster.

[0085] As an optional embodiment, before training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster, the method further includes: merging the candidate positive sample cluster and the candidate negative sample cluster into a balanced training cluster, wherein the balanced samples in the balanced training cluster are either candidate positive samples or candidate negative samples; determining the nearest neighbor sample corresponding to each candidate negative sample based on the sample characteristics of the candidate negative samples; and removing the candidate negative sample from the candidate negative sample cluster if the nearest neighbor sample corresponding to the candidate negative sample is a candidate positive sample.

[0086] In the embodiments described above, the positive sample cluster obtained through oversampling and the negative sample cluster obtained through undersampling are merged into a balanced training cluster. During the merging process, the Tomek Links algorithm is introduced to identify and remove potential boundary noise samples in the negative sample cluster, optimizing the decision boundary and reducing inter-class noise interference. Specifically, based on the sample characteristics of the candidate negative samples, the nearest neighbor sample corresponding to each candidate negative sample is determined. If the nearest neighbor sample and the candidate negative sample belong to different categories (i.e., one is a candidate positive sample and the other is a candidate negative sample), the candidate negative sample is identified as a noise sample on the inter-class boundary and removed from the negative sample cluster in subsequent operations. This achieves a balanced distribution of class samples and eliminates noise interference on the decision boundary, laying the foundation for building a more stable and accurate adverse drug reaction prediction model. It significantly improves the model's classification performance, especially when dealing with highly imbalanced real-world data, better capturing the subtle characteristics of adverse drug reactions, thereby improving the accuracy and reliability of predictions. This solves the technical problem in existing technologies where uneven distribution of adverse drug reaction samples leads to potential performance bias during model training.

[0087] As an alternative example, the Tomek Links algorithm for cleaning boundary noise includes: if any sample pair satisfy: It belongs to the positive category, such as candidate positive samples. Belongs to the negative category, such as candidate negative samples, and , If a candidate positive sample or a candidate negative sample is the nearest neighbor of another sample, then the sample pair is determined to be located at the inter-class boundary. In this embodiment, only the negative class members are removed, that is, the candidate negative samples are removed to preserve the integrity of the positive samples and reduce the inter-class overlapping area, thereby optimizing the classification decision boundary.

[0088] As an optional embodiment, oversampling multiple positive samples to obtain a candidate positive sample cluster includes: selecting any one positive sample as a first sample from among multiple positive samples in each preset sample cluster; selecting any nearest neighbor sample of the first sample as a second sample within the same preset sample cluster, wherein the nearest neighbor sample is a positive sample that is the nearest neighbor of the first sample; performing linear interpolation based on the first sample and the second sample to generate a synthetic sample; and generating a candidate positive sample cluster based on the positive sample and the synthetic sample.

[0089] The above embodiments of this application further refine the SMOTE oversampling strategy implemented for positive samples, aiming to generate synthetic samples that more closely resemble the actual distribution, thereby enhancing the model's generalization ability. Specifically, firstly, a positive sample is randomly selected from each positive subgroup as the base sample, i.e., the first sample. Next, the k nearest neighbor samples of this sample are identified within the same subgroup as potential second samples, where k is an adaptively selected neighbor number parameter to avoid over-synthesizing noisy samples. Then, linear interpolation is performed based on the first sample and the randomly selected second sample to generate a new synthetic sample. This process is repeated until the target multiple of the number of samples in each subgroup is met. Through this subgroup-by-subgroup, sample-by-sample oversampling method, not only is the number of positive samples effectively increased, alleviating the class imbalance problem of training samples, but it also ensures that the generated synthetic samples can better reflect the characteristic distribution of the original positive samples, avoiding the risk of model overfitting that may be caused by over-synthesis. This solves the technical problem in the prior art where the uneven distribution of adverse drug reaction samples leads to potential performance bias during model training.

[0090] Optionally, the candidate positive samples in the candidate positive sample cluster include: positive samples and synthetic samples.

[0091] Optionally, by adjusting the k value to adapt to the sample density of different subgroups, the quality and diversity of the synthesized samples are ensured. Furthermore, compared to traditional indiscriminate oversampling, the strategy in this embodiment significantly enhances the model's ability to predict individualized adverse drug reaction risks while preserving the internal structure of positive samples, thus improving prediction accuracy and reliability.

[0092] As an optional embodiment, selecting any nearest neighbor sample of the first sample as the second sample within the same preset sample cluster includes: determining the distance relationship between each positive sample and the first sample within the same preset sample cluster; selecting a preset number of positive samples as nearest neighbor samples in ascending order of distance values ​​to obtain a nearest neighbor cluster; and selecting any nearest neighbor sample from the nearest neighbor cluster as the second sample.

[0093] In the embodiments described above, when SMOTE oversampling is applied to positive samples, the distance relationship between each positive sample and the first sample is first determined within the same preset sample cluster. Then, a preset number of positive samples are selected as the nearest neighbors of the first sample in ascending order of distance values, thereby constructing a nearest neighbor cluster. Next, any nearest neighbor sample is randomly selected from the nearest neighbor cluster as the second sample, and a synthetic sample is generated based on this. This strategy ensures that the generation process of the synthetic sample fully considers the local similarity within the subgroup, avoids the increase in noise samples caused by oversampling, and effectively improves the quality and representativeness of the synthetic sample. Through this specific oversampling strategy, the problem of scarce positive samples is alleviated, while ensuring that the training dataset can more accurately reflect the feature structure within different subgroups, thereby improving the prediction accuracy and stability of the model. This solves the technical problem in existing technologies where uneven distribution of adverse drug reaction samples leads to potential performance bias during model training.

[0094] The embodiments described above in this application also enhance the model's ability to learn edge cases, which helps to improve the clarity of classification boundaries and reduce the misclassification rate. In particular, it shows better performance when processing highly heterogeneous clinical data.

[0095] Figure 2 This is a schematic diagram of a process based on spectral clustering and hybrid sampling according to an embodiment of the present invention, as shown below. Figure 2 As shown, for cases with severely imbalanced category distribution, a hybrid sampling strategy combining spectral clustering, specific SMOTE oversampling, and TomekLinks boundary cleaning can be adopted. The specific implementation process is as follows:

[0096] Step S31, sample subgroup division. For the positive sample set and the negative sample set, spectral clustering algorithm is used to divide them into subgroups respectively.

[0097] Step S32, positive subgroup-specific SMOTE oversampling.

[0098] Optionally, for each positive subgroup, the SMOTE algorithm is applied independently for specific oversampling. Within a given subgroup, for any sample within the subgroup... Calculate its 5 nearest neighbor sample set Randomly select a nearest neighbor from them Generate synthetic samples: .

[0099] Repeat the above process until the sample size of each subgroup reaches the set equilibrium target. For example, if the original sample size of a subgroup is [5, 10, 6], it can be oversampled to [a smaller percentage] of the original sample size. times[5 10 ,12 ]( This approach aims to mitigate overall imbalances while preserving the distribution differences among subgroups.

[0100] Step S33, stratified undersampling of negative samples.

[0101] Optionally, for negative samples, stratified undersampling can be performed within each subgroup. For example, if the original subgroup sample size is [35, 90, 60], it can be oversampled to [a smaller percentage] of the original sample size. times[35 90 60 ]( This is to reduce the overall sample size while maintaining the representativeness of the internal structure of the negative samples.

[0102] Step S34, Tomek Links cleaning. After positive oversampling and negative undersampling, the sample sets are combined into a balanced training set, and the Tomek Links method is further used to clean up boundary noise.

[0103] As an alternative example, a machine learning classification model (i.e., a drug adverse reaction prediction model) is trained based on the final balanced training dataset (e.g., constructed based on candidate positive sample clusters and candidate negative sample clusters). This model can be trained using various classification algorithms such as Random Forest, Gradient Boosting, Support Vector Machine (SVM), and Naive Bayes. Five-fold cross-validation is used to train each model and optimize hyperparameters to ensure the stability and generalization ability of the model.

[0104] The above embodiments of this application were validated on a clinical dataset from a hospital. The original sample consisted of 101 cases, including 83 negative samples and 18 positive samples. Using the mixed sampling strategy described in this application, positive samples were oversampled by 2x and negative samples were undersampled by 0.4x, resulting in a balanced training set (negative:positive ratio of 36:34).

[0105] Figure 3 This is a schematic diagram comparing the data balance effects before and after spectral clustering and mixed sampling optimization according to an embodiment of the present invention, as shown below. Figure 3 As shown, Figure 3 The figures in the lower part are the ratios of negative and positive samples before and after using the method provided in this application; Figure 3 The figures in the upper part are comparison diagrams of the data structures obtained by principal component analysis (PCA) before and after using the method provided in this application.

[0106] Figure 4This is a schematic diagram comparing the predictive performance of an optimized prediction model for adverse drug reaction subgroups according to an embodiment of the present invention, as shown below. Figure 4 As shown, four classification models are constructed on the balanced training set and the original dataset, respectively, and their performance is evaluated on the independent test set. Figure 4 The top figures show the XGBoost, Naive Bayes, Support Vector Machine, and Random Forest models obtained using the original data and the training data balanced using the method of this application, respectively, and compare their AUC performance in predicting adverse reactions in subgroups on the discovery set and the test set. Figure 4 The figures below show the performance of XGBoost, Naive Bayes, Support Vector Machine, and Random Forest models, obtained using the original data and the training data balanced using the method of this application, respectively, in predicting adverse reactions in subgroups. Comparing the ROC curves on the test set, it can be seen that the XGBoost model's performance is improved after adopting the data balancing strategy of this application, with the lower bound of the confidence interval of the area under the ROC curve increasing from 0.77 to over 0.8. The prediction performance of the other three models is significantly improved, with the area under the ROC curve increasing from 0.79-0.85 to 0.94-0.95. The results show that all four models exhibit excellent classification performance on the balanced training set. The Naive Bayes model has an AUC of 0.95, the Random Forest model and Support Vector Machine have AUCs of 0.94, and XGBoost has an AUC of 0.89, all significantly better than the performance of models trained on the original dataset. The results show that the hybrid sampling method proposed in this application can effectively improve the class imbalance problem, thereby enhancing the predictive ability of the classification model on real clinical data and providing a reliable basis for clinical decision support.

[0107] It should be noted that AUC is a metric used to evaluate the performance of binary classification models, representing the area under the ROC curve. The ROC curve is plotted with the true positive rate (TPR) on the vertical axis and the false positive rate (FPR) on the horizontal axis.

[0108] TPR (True Positive Rate) = Sensitivity = TP / (TP + FN);

[0109] FPR (False Positive Rate) = 1 - Specificity = FP / (FP + TN);

[0110] Among them, TP, or True Positive, means that the actual sample is positive and the model also predicts it as positive, indicating that the model correctly identified the positive sample; FN, or False Negative, means that the actual sample is positive, but the model incorrectly predicts it as negative, indicating that a positive sample was missed or misclassified as a negative sample; FP, or False Positive, means that the actual sample is negative, but the model incorrectly predicts it as positive, indicating a misdiagnosis or false alarm, which may lead to unnecessary examinations or anxiety; TN, or True Negative, means that the actual sample is negative and the model also correctly predicts it as negative, indicating that the model correctly identified the negative sample.

[0111] Optionally, the range of AUC values ​​includes:

[0112] AUC=1.0: Perfect classifier, all positive examples are ranked before negative examples;

[0113] AUC=0.5: Equivalent to random guessing (no discriminative ability);

[0114] AUC < 0.5: The model performs worse than random (the sign may be reversed).

[0115] AUC > 0.7: The model is generally considered to have good discriminative ability;

[0116] AUC > 0.8: Good;

[0117] AUC>0.9: Excellent.

[0118] According to an embodiment of the present invention, a training device for a prediction model of adverse drug reactions is also provided. It should be noted that the training device for the prediction model of adverse drug reactions can be used to execute the training method for the prediction model of adverse drug reactions in the embodiments of the present invention, and the training method for the prediction model of adverse drug reactions in the embodiments of the present invention can be executed in the training device for the prediction model of adverse drug reactions.

[0119] Figure 5 This is a schematic diagram of a training device for a predictive model of adverse drug reactions according to an embodiment of the present invention, as shown below. Figure 5As shown, the device may include: an acquisition module 52 for acquiring a training sample cluster, wherein the training sample cluster includes multiple positive samples and multiple negative samples, where positive samples represent sample features that produce adverse drug reactions and negative samples represent sample features that do not produce adverse drug reactions; a first sampling module 54 for oversampling multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples; a second sampling module 56 for undersampling multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold; and a training module 58 for training an adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

[0120] It should be noted that the acquisition module 52 in this embodiment can be used to execute step S102 in this application embodiment, the first sampling module 54 in this embodiment can be used to execute step S104 in this application embodiment, the second sampling module 56 in this embodiment can be used to execute step S106 in this application embodiment, and the training module 58 in this embodiment can be used to execute step S108 in this application embodiment. The examples and application scenarios implemented by the above modules and corresponding steps are the same, but are not limited to the content disclosed in the above embodiments.

[0121] The embodiments described above in this application effectively alleviate the problem of scarce positive samples by oversampling positive samples in the training sample cluster, increasing the amount of information reflecting adverse drug reaction characteristics in the dataset and improving the sufficiency of model training. Simultaneously, the undersampling strategy implemented for negative samples preserves the structural representativeness of the negative samples, reduces the class imbalance of the dataset, and avoids excessive reliance on negative instances during model training, thereby reducing model bias. This ensures that the difference in the number of candidate positive sample clusters and candidate negative sample clusters does not exceed a preset threshold, achieving a balance between positive and negative samples. This helps improve the model's sensitivity and specificity, making it more accurate in predicting adverse drug reactions. Training the adverse drug reaction prediction model based on the balanced samples allows the model to learn comprehensive feature associations on a more balanced data distribution, enhancing the model's generalization ability and prediction accuracy. This solves the technical problem in existing technologies where uneven distribution of adverse drug reaction samples can lead to performance bias during model training.

[0122] As an optional embodiment, the apparatus further includes: a classification submodule, used to classify the training samples in the training sample cluster after obtaining the training sample cluster, to obtain multiple candidate sample clusters, wherein the training samples are positive samples or negative samples, and the candidate samples in the same candidate sample cluster are all positive samples or all negative sample clusters; and a partitioning submodule, used to partition each candidate sample cluster into subgroups using a spectral clustering algorithm, to obtain multiple candidate subgroups corresponding to each candidate sample cluster.

[0123] As an optional embodiment, the partitioning submodule includes: a matrix construction unit, used to construct the Laplacian matrix of the candidate sample cluster based on the similarity between any two candidate samples in the same candidate sample cluster, wherein the Laplacian matrix is ​​constructed based on the degree matrix and the adjacency matrix, the degree matrix is ​​a diagonal matrix, and each diagonal element of the degree matrix is ​​the sum of the similarities between the same candidate sample and other candidate samples in the candidate sample cluster, and each element in the adjacency matrix represents the similarity between any two candidate samples; and a matrix decomposition unit, used to perform eigenvalue decomposition on the Laplacian matrix to obtain an ascending sequence of eigenvalues ​​and the corresponding eigenvector matrix, wherein the eigenvalue sequence includes multiple eigenvalues. The feature vector matrix includes: a feature vector corresponding to each feature value; a matrix selection unit, used to select a feature matrix to be clustered corresponding to a preset number of clusters from the feature vector matrix, wherein the feature matrix to be clustered includes: feature vectors corresponding to candidate feature values ​​with a number equal to the preset number of clusters, the candidate feature values ​​being selected from the feature value sequence in ascending order that conform to the preset number of clusters, and each feature vector in the feature matrix to be clustered represents the feature projection of the candidate sample on the dimension of the preset number of clusters; and a clustering unit, used to cluster multiple candidate samples in the candidate sample cluster according to the feature matrix to be clustered, using the preset number of clusters as the clustering number parameter, to obtain multiple candidate subgroups.

[0124] As an optional embodiment, the apparatus further includes: a first determining unit, configured to determine the number of clusters to be calculated based on the maximum difference between two adjacent feature values ​​in the feature value sequence before selecting the feature matrix corresponding to the preset number of clusters in the feature vector matrix; a second determining unit, configured to determine the upper bound of the interval based on the larger value between the calculated number of clusters and the preset number of clusters, and use the preset minimum number of clusters as the lower bound of the interval to obtain a candidate number of clusters interval, wherein the candidate number of clusters interval includes: multiple candidate number of clusters; a pre-clustering unit, configured to select the candidate feature matrix corresponding to each candidate number of clusters in the feature vector matrix, and perform pre-clustering according to the same candidate number of clusters as the number of clusters parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on different candidate number of clusters; and a third determining unit, configured to evaluate the clustering effect corresponding to each candidate number of clusters based on the multiple clustering results, and determine the candidate number of clusters with the best clustering effect as the preset number of clusters.

[0125] As an optional embodiment, the apparatus further includes: a merging submodule, configured to merge the candidate positive sample cluster and the candidate negative sample cluster into a balanced training cluster before training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster, wherein the balanced samples in the balanced training cluster are either candidate positive samples or candidate negative samples; a determining submodule, configured to determine the nearest neighbor sample corresponding to each candidate negative sample based on the sample characteristics of the candidate negative samples; and a removal submodule, configured to remove the candidate negative sample from the candidate negative sample cluster if the nearest neighbor sample corresponding to the candidate negative sample is a candidate positive sample.

[0126] As an optional embodiment, the first sampling module includes: a first selection unit, configured to select any one positive sample as a first sample from among multiple positive samples in each preset sample cluster; a second selection unit, configured to select any one nearest neighbor sample of the first sample as a second sample within the same preset sample cluster, wherein the nearest neighbor sample is a positive sample that is the nearest neighbor of the first sample; a first generation unit, configured to perform linear interpolation operation based on the first sample and the second sample to generate a synthetic sample; and a second generation unit, configured to generate a candidate positive sample cluster based on the positive sample and the synthetic sample.

[0127] As an optional embodiment, the second selection unit includes: a determining subunit, used to determine the distance relationship between each positive sample and the first sample within the same preset sample cluster; a first selection subunit, used to select a preset number of positive samples as nearest neighbor samples in ascending order of distance values ​​to obtain a nearest neighbor cluster; and a second selection subunit, used to select any nearest neighbor sample from the nearest neighbor cluster as the second sample.

[0128] Embodiments of the present invention can provide an electronic device, which can be a computer terminal, and the computer terminal can be any one of a group of computer terminal devices. Optionally, in this embodiment, the computer terminal can also be replaced by a mobile terminal or other terminal device.

[0129] Optionally, in this embodiment, the computer terminal may be located in at least one of a plurality of network devices in a computer network.

[0130] In this embodiment, the computer terminal described above can execute the program code for the following steps in the training method of the adverse drug reaction prediction model: obtaining a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, where positive samples represent sample features that produce adverse drug reactions and negative samples represent sample features that do not produce adverse drug reactions; oversampling multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes: multiple candidate positive samples; undersampling multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes: multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold; and training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

[0131] Figure 6 This is a structural block diagram of a computer terminal according to an embodiment of the present invention, such as... Figure 6 As shown, the computer terminal 60 may include one or more (only one is shown in the figure) processors 62 and memory 64.

[0132] The memory can be used to store software programs and modules, such as the program instructions / modules corresponding to the training method and apparatus for the adverse drug reaction prediction model in this embodiment of the invention. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory, thereby realizing the aforementioned training method for the adverse drug reaction prediction model. The memory may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory may further include memory remotely located relative to the processor, and these remote memories can be connected to the terminal 60 via a network. Examples of such networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0133] The processor can access information and applications stored in the memory via a transmission device to perform the following steps: acquiring a training sample cluster, wherein the training sample cluster includes multiple positive samples and multiple negative samples, where positive samples represent sample characteristics that produce adverse drug reactions and negative samples represent sample characteristics that do not produce adverse drug reactions; oversampling multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples; undersampling multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset threshold; and training an adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

[0134] Optionally, the processor may also execute program code for the following steps: classifying the training samples in the training sample cluster to obtain multiple candidate sample clusters, wherein the training samples are positive or negative samples, and the candidate samples in the same candidate sample cluster are all positive samples or all negative sample clusters; using a spectral clustering algorithm to divide each candidate sample cluster into subgroups to obtain multiple candidate subgroups corresponding to each candidate sample cluster.

[0135] Optionally, the processor may also execute program code for the following steps: Based on the similarity between any two candidate samples in the same candidate sample cluster, construct the Laplacian matrix of the candidate sample cluster, wherein the Laplacian matrix is ​​constructed based on the degree matrix and the adjacency matrix. The degree matrix is ​​a diagonal matrix, and each diagonal element represents the sum of the similarities between the same candidate sample and other candidate samples in the candidate sample cluster. Each element in the adjacency matrix represents the similarity between any two candidate samples. Perform eigenvalue decomposition on the Laplacian matrix to obtain an ascending sequence of eigenvalues ​​and the corresponding eigenvector matrix, wherein the eigenvalue sequence includes multiple eigenvalues... The eigenvalue and eigenvector matrix includes: the eigenvector corresponding to each eigenvalue; in the eigenvector matrix, a clustering feature matrix corresponding to a preset number of clusters is selected, wherein the clustering feature matrix includes: eigenvectors corresponding to each candidate eigenvalue with a number equal to the preset number of clusters, the candidate eigenvalues ​​being selected from the eigenvalue sequence in ascending order and conforming to the preset number of clusters, and each eigenvector in the clustering feature matrix represents the feature projection of the candidate sample on the dimension of the preset number of clusters; based on the clustering feature matrix, multiple candidate samples in the candidate sample cluster are clustered using the preset number of clusters as the clustering number parameter to obtain multiple candidate subgroups.

[0136] Optionally, the processor may also execute program code with the following steps: determining the number of clusters to be calculated based on the maximum difference between two adjacent feature values ​​in the feature value sequence; determining the upper bound of the interval based on the larger value between the calculated number of clusters and the preset number of clusters, and using the preset minimum number of clusters as the lower bound of the interval to obtain the candidate number of clusters interval, wherein the candidate number of clusters interval includes: multiple candidate number of clusters; selecting a candidate feature matrix corresponding to each candidate number of clusters in the feature vector matrix, and performing pre-clustering according to the same candidate number of clusters as the cluster number parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on different candidate number of clusters; evaluating the clustering effect corresponding to each candidate number of clusters based on multiple clustering results, and determining the candidate number of clusters with the best clustering effect as the preset number of clusters.

[0137] Optionally, the processor may also execute program code for the following steps: merging the candidate positive sample cluster and the candidate negative sample cluster into a balanced training cluster, wherein the balanced samples in the balanced training cluster are either candidate positive samples or candidate negative samples; determining the nearest neighbor sample corresponding to each candidate negative sample based on the sample characteristics of the candidate negative samples; and removing the candidate negative sample from the candidate negative sample cluster if the nearest neighbor sample corresponding to the candidate negative sample is a candidate positive sample.

[0138] Optionally, the processor may also execute program code for the following steps: selecting any one positive sample as the first sample from among multiple positive samples in each preset sample cluster; selecting any one of the nearest neighbor samples of the first sample as the second sample within the same preset sample cluster, wherein the nearest neighbor sample is a positive sample that is the nearest neighbor of the first sample; performing linear interpolation based on the first sample and the second sample to generate a synthetic sample; and generating a candidate positive sample cluster based on the positive sample and the synthetic sample.

[0139] Optionally, the processor may also execute program code for the following steps: within the same preset sample cluster, determine the distance relationship between each positive sample and the first sample; select a preset number of positive samples as nearest neighbor samples in ascending order of distance values ​​to obtain the nearest neighbor cluster; select any nearest neighbor sample from the nearest neighbor cluster as the second sample.

[0140] Those skilled in the art will understand that Figure 6 The structure shown is for illustrative purposes only. The computer terminal can also be a smartphone (such as an Android phone, an iOS phone, etc.), a tablet computer, a mobile internet device (MID), a PAD, and other terminal devices. Figure 6 This does not limit the structure of the aforementioned electronic device. For example, computer terminal 60 may also include components that are more... Figure 6 The more or fewer components shown (such as network interfaces, display devices, etc.), or having the same Figure 6 The different configurations shown.

[0141] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a computer program instructing the hardware related to the terminal device. The computer program can be stored in a non-volatile medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc.

[0142] Embodiments of the present invention also provide a non-volatile storage medium. Optionally, in this embodiment, the aforementioned non-volatile storage medium can be used to store the program code executed by the training method of the drug adverse reaction prediction model provided in the above embodiments.

[0143] Optionally, in this embodiment, the non-volatile storage medium may be located in any computer terminal in a group of computer terminals in a computer network, or in any mobile terminal in a group of mobile terminals.

[0144] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: obtaining a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, the positive samples representing sample features that produce adverse drug reactions, and the negative samples representing sample features that do not produce adverse drug reactions; oversampling the multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes: multiple candidate positive samples; undersampling the multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes: multiple candidate negative samples, the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold; and training an adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

[0145] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: classifying the training samples in the training sample cluster to obtain multiple candidate sample clusters, wherein the training samples are positive samples or negative samples, and the candidate samples in the same candidate sample cluster are all positive samples or all negative sample clusters; using a spectral clustering algorithm to divide each candidate sample cluster into subgroups to obtain multiple candidate subgroups corresponding to each candidate sample cluster.

[0146] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: constructing a Laplacian matrix of the candidate sample cluster based on the similarity between any two candidate samples in the same candidate sample cluster, wherein the Laplacian matrix is ​​constructed based on a degree matrix and an adjacency matrix, the degree matrix is ​​a diagonal matrix, each diagonal element of which is the sum of the similarities between the same candidate sample and other candidate samples in the candidate sample cluster, and each element in the adjacency matrix represents the similarity between any two candidate samples; performing eigenvalue decomposition on the Laplacian matrix to obtain an ascending sequence of eigenvalues ​​and the corresponding eigenvector matrix, wherein the eigenvalue sequence... The column includes: multiple eigenvalues; the eigenvector matrix includes: eigenvectors corresponding to each eigenvalue; in the eigenvector matrix, a clustering feature matrix corresponding to a preset number of clusters is selected, wherein the clustering feature matrix includes: eigenvectors corresponding to each candidate eigenvalue with a number equal to the preset number of clusters, the candidate eigenvalues ​​being selected from the eigenvalue sequence in ascending order that conforms to the preset number of clusters, and each eigenvector in the clustering feature matrix represents the feature projection of the candidate sample on the dimension of the preset number of clusters; based on the clustering feature matrix, multiple candidate samples in the candidate sample cluster are clustered using the preset number of clusters as the clustering number parameter to obtain multiple candidate subgroups.

[0147] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: determining the number of clusters to be calculated based on the maximum difference between two adjacent feature values ​​in the feature value sequence; determining the upper bound of the interval based on the larger value between the calculated number of clusters and the preset number of clusters, and using the preset minimum number of clusters as the lower bound of the interval to obtain a candidate number of clusters interval, wherein the candidate number of clusters interval includes: multiple candidate number of clusters; selecting a candidate feature matrix corresponding to each candidate number of clusters in the feature vector matrix, and performing pre-clustering according to the same candidate number of clusters as the cluster number parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on different candidate number of clusters; evaluating the clustering effect corresponding to each candidate number of clusters based on multiple clustering results, and determining the candidate number of clusters with the best clustering effect as the preset number of clusters.

[0148] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: merging the candidate positive sample cluster and the candidate negative sample cluster into a balanced training cluster, wherein the balanced samples in the balanced training cluster are either candidate positive samples or candidate negative samples; determining the nearest neighbor sample corresponding to each candidate negative sample based on the sample characteristics of the candidate negative samples; and removing the candidate negative sample from the candidate negative sample cluster if the nearest neighbor sample corresponding to the candidate negative sample is a candidate positive sample.

[0149] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: selecting any one positive sample as a first sample from among multiple positive samples in each preset sample cluster; selecting any one of the nearest neighbor samples of the first sample as a second sample within the same preset sample cluster, wherein the nearest neighbor sample is a positive sample that is the nearest neighbor of the first sample; performing linear interpolation based on the first sample and the second sample to generate a synthetic sample; and generating a candidate positive sample cluster based on the positive sample and the synthetic sample.

[0150] Optionally, in this embodiment, the non-volatile storage medium is configured to store program code for performing the following steps: within the same preset sample cluster, determining the distance relationship between each positive sample and the first sample; selecting a preset number of positive samples as nearest neighbor samples in ascending order of distance values ​​to obtain the nearest neighbor cluster; and selecting any nearest neighbor sample from the nearest neighbor cluster as the second sample.

[0151] Embodiments of the present invention also provide a computer program product, including a computer program. Optionally, in this embodiment, when the computer program is executed by a processor, it implements the steps of the training method for the drug adverse reaction prediction model provided in the above embodiments.

[0152] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0153] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0154] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.

[0155] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0156] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0157] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a non-volatile storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a non-volatile storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned non-volatile storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0158] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A method of training a prediction model of adverse drug reactions, characterized by, include: A training sample cluster is obtained, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, wherein the positive samples represent sample features that produce adverse drug reactions, and the negative samples represent sample features that do not produce adverse drug reactions; Oversampling is performed on multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples; Undersampling is performed on multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold. A drug adverse reaction prediction model is trained based on the candidate positive sample cluster and the candidate negative sample cluster.

2. The method of claim 1, wherein, After obtaining the training sample cluster, the method further includes: The training samples in the training sample cluster are classified into categories to obtain multiple candidate sample clusters, wherein the training samples are either the positive samples or the negative samples, and the candidate samples in the same candidate sample cluster are either all positive samples or all negative sample clusters. The spectral clustering algorithm is used to divide each candidate sample cluster into subgroups, resulting in multiple candidate subgroups corresponding to each candidate sample cluster.

3. The method of claim 2, wherein, The spectral clustering algorithm is used to divide each candidate sample cluster into subgroups, resulting in multiple candidate subgroups corresponding to each candidate sample cluster, including: Based on the similarity between any two candidate samples in the same candidate sample cluster, a Laplacian matrix of the candidate sample cluster is constructed. The Laplacian matrix is ​​constructed based on a degree matrix and an adjacency matrix. The degree matrix is ​​a diagonal matrix, and each diagonal element of the degree matrix is ​​the sum of the similarities between the same candidate sample and other candidate samples in the candidate sample cluster. Each element in the adjacency matrix represents the similarity between any two candidate samples. The Laplacian matrix is ​​subjected to eigenvalue decomposition to obtain an ascending sequence of eigenvalues ​​and an eigenvector matrix corresponding to the eigenvalue sequence. The eigenvalue sequence includes multiple eigenvalues, and the eigenvector matrix includes an eigenvector corresponding to each eigenvalue. In the feature vector matrix, a clustering feature matrix corresponding to a preset number of clusters is selected, wherein the clustering feature matrix includes: feature vectors corresponding to candidate feature values ​​with a number equal to the preset number of clusters, wherein the candidate feature values ​​are selected from the feature value sequence in ascending order and conform to the preset number of clusters, and each feature vector in the clustering feature matrix represents the feature projection of the candidate sample on the dimension of the preset number of clusters; Based on the feature matrix to be clustered, multiple candidate samples in the candidate sample cluster are clustered using the preset cluster number as the cluster number parameter to obtain multiple candidate subgroups.

4. The method of claim 3, wherein, Before selecting the feature matrix to be clustered corresponding to the preset number of clusters in the feature vector matrix, the method further includes: The number of clusters is determined based on the maximum difference between two adjacent feature values ​​in the feature value sequence. The upper bound of the interval is determined based on the larger value between the calculated cluster number and the preset cluster number, and the lower bound of the interval is determined by the preset minimum cluster number, thus obtaining the candidate cluster number interval, wherein the candidate cluster number interval includes: multiple candidate cluster numbers; In the feature vector matrix, a candidate feature matrix corresponding to each candidate cluster number is selected, and pre-clustering is performed using the same candidate cluster number as the cluster number parameter to obtain multiple clustering results, wherein each clustering result is a candidate subgroup obtained based on a different candidate cluster number; Based on the multiple clustering results, the clustering effect corresponding to each candidate cluster number is evaluated, and the candidate cluster number with the best clustering effect is determined as the preset cluster number.

5. The method of claim 1, wherein, Before training the adverse drug reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster, the method further includes: The candidate positive sample cluster and the candidate negative sample cluster are merged into a balanced training cluster, wherein the balanced sample in the balanced training cluster is either the candidate positive sample or the candidate negative sample; Based on the sample characteristics of the candidate negative samples, determine the nearest neighbor sample corresponding to each candidate negative sample; If the nearest neighbor of a candidate negative sample is a candidate positive sample, the candidate negative sample is removed from the candidate negative sample cluster.

6. The method of claim 1, wherein, Oversampling of multiple positive samples yields a candidate positive sample cluster comprising: In each preset sample cluster, any one of the positive samples is selected as the first sample. Within the same preset sample cluster, any one of the nearest neighbor samples of the first sample is selected as the second sample, wherein the nearest neighbor sample is the positive sample that is the nearest neighbor of the first sample; A synthetic sample is generated by performing linear interpolation based on the first sample and the second sample. Based on the positive samples and the synthetic samples, the candidate positive sample cluster is generated.

7. The method of claim 6, wherein, Within the same preset sample cluster, selecting any one of the nearest neighbor samples of the first sample as the second sample includes: Within the same preset sample cluster, determine the distance relationship between each positive sample and the first sample; According to the distance values ​​in ascending order, a preset number of positive samples are selected as the nearest neighbor samples to obtain the nearest neighbor cluster; Select any one of the nearest neighbor samples from the nearest neighbor cluster as the second sample. 8.A device for training a prediction model of adverse drug reactions, characterized by, include: The acquisition module is used to acquire a training sample cluster, wherein the training sample cluster includes: multiple positive samples and multiple negative samples, wherein the positive samples represent sample features that produce adverse drug reactions, and the negative samples represent sample features that do not produce adverse drug reactions; The first sampling module is used to oversample multiple positive samples to obtain a candidate positive sample cluster, wherein the candidate positive sample cluster includes multiple candidate positive samples; The second sampling module is used to undersample multiple negative samples to obtain a candidate negative sample cluster, wherein the candidate negative sample cluster includes multiple candidate negative samples, and the difference between the number of candidate negative samples and the number of candidate positive samples does not exceed a preset number threshold. The training module is used to train a drug adverse reaction prediction model based on the candidate positive sample cluster and the candidate negative sample cluster.

9. An electronic device comprising a memory and a processor, characterized in that, The memory stores a computer program, and the processor is configured to execute a training method for the prediction model of adverse drug reactions according to any one of claims 1 to 7 through the computer program.

10. A computer program product comprising computer instructions, characterized in that, When the computer instructions are executed by the processor, they implement the steps of the training method for the prediction model of adverse drug reactions according to any one of claims 1 to 7.