Time series data generation method based on clinical knowledge and differential diffusion model
By using a method based on clinical knowledge and differential diffusion models, combining forward and backward diffusion to generate time-series data, the problem of data scarcity in rare diseases is solved. The reliability and rationality of the generated time-series data are improved, making it suitable for actual diagnosis and treatment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- XIANGYA HOSPITAL CENT SOUTH UNIV
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241011A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data synthesis technology, and in particular to a method for generating time-series data based on clinical knowledge and a differential diffusion model. Background Technology
[0002] In some rare diseases or in intensive care units, the number of patients is usually small. For example, for patients with severe infections or sepsis, the disease progression often exhibits high heterogeneity, rapid physiological deterioration, and complex inter-organ interactions. However, high-quality time-series data are extremely scarce in this clinical setting, and the evolution of many critical diseases can only be observed a few times. In addition, due to the need for early intervention, ethical constraints, and incomplete monitoring methods, small sample time-series data of critically ill patients with similar symptoms are often difficult to record completely.
[0003] Therefore, utilizing machine learning techniques for time-series data synthesis has become a natural strategy. This technology aims to automatically synthesize or predict continuous state evolution sequences that conform to medical principles, even with extremely limited training data, to reconstruct the patient's disease progression. However, when the original data is extremely limited, existing generation methods often fail due to training instability or pattern collapse. Particularly when synthesizing time-series data, abrupt, clinically illogical stage transitions can occur. These purely data-driven generators, while potentially exhibiting smoothness in statistical metrics, often produce unconvincing small-sample time-series data, directly limiting their practical application in clinical practice. Summary of the Invention
[0004] Therefore, it is necessary to address the aforementioned technical problems by providing a time-series data generation method based on clinical knowledge and a differential diffusion model. This method can improve the reliability of the generated time-series data.
[0005] A method for generating time-series data based on clinical knowledge and a differential diffusion model, the method comprising:
[0006] S1. Acquire multiple medical data points and extract multiple medical constraints from a multi-source clinical knowledge base; the medical data includes data from at least two time steps.
[0007] S2. Determine the difference between the data at the two time steps of each medical data;
[0008] S3. Based on the preset noise sequence and the difference result, the noise addition result for multiple time steps is determined by forward diffusion; the noise sequence includes the added noise for each time step.
[0009] S4. The noise addition results at each time step are denoised by backdiffusion to obtain the denoised results at each time step, and a penalty value is obtained based on the denoised results and the medical constraints.
[0010] S5. When the penalty value is greater than the penalty threshold, update the noise sequence in S3, and repeat S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and output the target difference result at each time step.
[0011] S6. Based on the data of the first time step in each of the medical data and the corresponding target difference results, obtain the time series data corresponding to each of the medical data; the time series data includes synthetic medical data of multiple time steps.
[0012] In this application, when generating time series data, the difference results of real medical data can be combined with forward diffusion, backward diffusion, and medical constraints, so that the generated time series data not only takes into account the real situation, but also meets the various constraints in the medical constraints, thereby improving the reliability of the time series data.
[0013] In one embodiment, the extraction process of the medical constraints in step S1 includes:
[0014] Text content related to the target disease is extracted from a multi-source clinical knowledge base, and the text content is preprocessed to obtain multiple structured text blocks;
[0015] Based on a preset indicator range extraction template, first information including the indicator range is extracted from each of the text blocks. Based on a preset trend direction extraction template, second information representing the indicator change trend of the target disease at each clinical stage is extracted from each of the text blocks. Based on a preset covariation pattern extraction template, related indicator pairs and the covariation direction of the indicator pairs are extracted from each of the text blocks. The related indicator pairs and the covariation direction of the indicator pairs constitute third information.
[0016] The first information, the second information, and the third information are deduplicated and conflict-resolved to obtain the processed first information, the second information, and the third information;
[0017] Based on the processed first information, second information, and third information, medical constraints are determined.
[0018] In this application, by deduplicating and resolving conflicts in the first, second, and third information, the processed first, second, and third information are obtained. This can eliminate redundant information and obtain first, second, and third information with high recall.
[0019] In one embodiment, determining the medical constraints based on the processed first information, second information, and third information includes:
[0020] In response to the annotation operations on each of the first information, the second information, and the third information, at least one first annotation result for each of the first information, at least one second annotation result for each of the second information, and at least one third annotation result for each of the third information are obtained; the first annotation result is used to characterize whether the first information is valid, the second annotation result is used to characterize whether the second information is valid, and the third annotation result is used to characterize whether the third information is valid.
[0021] Based on each of the first annotation results, the second annotation results, and the third annotation results, invalid first target information is determined from the first information, invalid second target information is determined from the second information, and invalid third target information is determined from the third information;
[0022] In response to the correction operation on the first target information, the second target information, and the third target information, correction information is obtained;
[0023] Based on the corrected information, the valid first information, the valid second information, and the valid third information, medical constraints are determined.
[0024] In this application, corrected information is obtained by responding to correction operations on the first target information, the second target information, and the third target information. Based on the corrected information, the valid first information, the valid second information, and the valid third information, medical constraints are determined. This can correct information in the multi-source clinical knowledge base that deviates from actual clinical observation, thereby making the medical constraints more consistent with the actual situation.
[0025] In one embodiment, determining the medical constraints based on the corrected information and the valid first information, the valid second information, and the valid third information includes:
[0026] In response to the information supplementation operation for the evolution pattern of the target disease, supplementary information characterizing the evolution pattern of the target disease is obtained; the supplementary information includes baseline drift information characterizing the baseline shift of physiological indicators caused by immune aging and underlying diseases, delayed response information characterizing the delayed or weakened response of treatment interventions, and organ coupling information characterizing cross-organ effect constraints.
[0027] Natural language parsing is performed on the supplementary information, the corrected information, and the valid first information, the valid second information, and the valid third information to obtain multiple parsed texts;
[0028] Entities are extracted from each of the parsed texts, and triples corresponding to each parsed text are constructed based on the entities extracted from each of the parsed texts.
[0029] Each of the triples is subjected to conflict resolution processing, and the triples obtained after conflict resolution are used as medical constraints.
[0030] In this application, supplementary information characterizing the evolution of the target disease is obtained through information supplementation operations in response to the evolution of the target disease. The supplementary information includes baseline drift information characterizing the baseline shift of physiological indicators caused by immune aging and underlying diseases, delayed response information characterizing the delayed or weakened response of treatment interventions, and organ coupling information characterizing cross-organ constraints. This can supplement the information on disease evolution patterns that are missing in the multi-source clinical knowledge base but are specific to the target disease patient, thereby making the final determined medical constraints more comprehensive.
[0031] In one embodiment, step S3 includes:
[0032] For each piece of medical data, based on a preset noise sequence and the corresponding difference result, through... or Determine the noise addition results at multiple time steps;
[0033] in, For the difference result, The result of adding noise at time step k. Add noise to time step k. The result of adding noise at time step k-1. This is the cumulative coefficient for time step k. , The noise is standard Gaussian noise, and k is a non-negative integer.
[0034] In this application, for each piece of medical data, based on a preset noise sequence and the corresponding difference result of the medical data, the method is... or Determining the noise-adding results at multiple time steps can effectively smooth the complex distribution of high-dimensional clinical data and reduce the difficulty of generating noise-adding results.
[0035] In one embodiment, the medical constraints include a first constraint corresponding to the range of the indicator, a second constraint corresponding to the trend of the indicator's change, and a third constraint corresponding to the jump threshold. Step S4 includes:
[0036] For each piece of medical data, the noise addition result of each time step corresponding to the medical data, the clinical stage of each time step, and the medical constraints are input into a pre-trained denoising network to obtain the noise to be removed from each piece of noise addition result.
[0037] Based on the noise to be removed according to the aforementioned noise addition results, through The noise-adding results at each time step are denoised to obtain the denoised results at each time step.
[0038] The denoising results and the medical constraints are input into the pre-trained language model to obtain the index range penalty value corresponding to the first constraint, the change penalty value corresponding to the second constraint, and the jump penalty value corresponding to the third constraint.
[0039] The penalty value is obtained by weighted summing of the penalty value for the range of the indicator, the penalty value for the change, and the penalty value for the jump.
[0040] in, The denoising result at time step k-1, Add noise to time step k. The result of adding noise at time step k. This is the cumulative coefficient for time step k. , The noise to be removed from the noise-adding result at time step k. Let k be the variance of time. for z represents random noise.
[0041] In this application, for each piece of medical data, the noise addition results at each time step, the clinical stage at each time step, and the medical constraints are input into a pre-trained denoising network to obtain the noise to be removed from each noise addition result; based on the noise to be removed from each noise addition result, through... The noise-added results at each time step are denoised to obtain the denoised results at each time step. The denoised results and medical constraints are input into the pre-trained language model to obtain the index range penalty value corresponding to the first constraint, the change penalty value corresponding to the second constraint, and the jump penalty value corresponding to the third constraint. The index range penalty value, change penalty value, and jump penalty value are weighted and summed to obtain the penalty value. In this way, the medical constraints can be used as a guide to update the noise sequence, so that the final denoised result meets the biological constraints while maintaining sample diversity.
[0042] In one embodiment, step S6 includes:
[0043] For each piece of medical data, based on the data from the first time step in the medical data and the target difference results, through... The time-series data corresponding to the medical data is obtained; the time-series data includes synthetic medical data from multiple time steps.
[0044] in, For synthetic medical data at time step k+1, This represents the target difference result at time step k; when k is 0, For the data at the first time step in the medical data, when k is greater than 0, Synthetic medical data at time step k.
[0045] In this application, for each piece of medical data, based on the data from the first time step in the medical data and the results of each target difference, through... This allows us to obtain time-series data corresponding to medical data, which simplifies the process of generating time-series data.
[0046] In one embodiment, the time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes:
[0047] Based on the time-series data from each clinical phase, using the formula... Calculate the reliability value of the time series data; S n Indicates clinical stage n, The weight of n in the clinical stage. This represents the conditional probability distribution of the true data regarding the target disease at clinical stage n. D represents the conditional probability distribution of the time-series data representing the clinical stage n of the target disease. KL (||) represents the KL divergence. The target difference result;
[0048] When the reliability value of the time series data in the target stage is greater than or equal to the reliability threshold, the time series data in the target stage is determined to be reliable; the target stage is a stage in the clinical stage.
[0049] In this application, based on time-series data from each clinical stage, a formula is used... Calculate the reliability value of the time series data. This value can be used to determine whether the generated time series data conforms to the clinical characteristics of the corresponding clinical stage, thereby determining whether the time series data should be retained or applied to downstream tasks.
[0050] In one embodiment, the time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes:
[0051] For each piece of medical data, each time step in each clinical stage is taken as a node, and the synthetic medical data at each time step is taken as the node feature. The nodes are connected to obtain a time series diagram of each piece of medical data. The edge weight between two connected nodes in the time series diagram is determined based on the time difference value, which is the time difference between the two connected nodes.
[0052] A time-weighted graph neural network is trained based on each of the time-series graphs to obtain the trained time-weighted graph neural network;
[0053] The preset multiple sets of verification medical data are input into the trained time-series weighted graph neural network to obtain the prediction result for each set of verification medical data.
[0054] Based on the prediction results and the actual results of the verification medical data, the evaluation index of the trained time-weighted graph neural network is determined.
[0055] The validity of the time series data is determined based on the evaluation metrics.
[0056] In this application, for each piece of medical data, each time step in each clinical stage is taken as a node, and the synthetic medical data of each time step is taken as the node feature. By connecting each node, a time series graph of each piece of medical data is obtained. The edge weight between two connected nodes in the time series graph is determined based on the time difference, which is the time difference between the two connected nodes. A time series weighted graph neural network is trained based on each time series graph to obtain a trained time series weighted graph neural network. Multiple preset validation medical data are input into the trained time series weighted graph neural network to obtain the prediction result of each validation medical data. Based on each prediction result and the actual result of each validation medical data, an evaluation index of the trained time series weighted graph neural network is determined. In this way, the validity of the time series data can be determined based on the evaluation index.
[0057] The above-described time-series data generation method based on clinical knowledge and differential diffusion model comprises the following steps: S1. Acquiring multiple medical data sets and extracting multiple medical constraints from a multi-source clinical knowledge base; the medical data sets include data from at least two time steps; S2. Determining the difference results between the two time steps of each medical data set; S3. Determining the noise addition results for multiple time steps through forward diffusion based on a preset noise sequence and difference results; the noise sequence includes the added noise at each time step; S4. Denoising the noise addition results at each time step through backward diffusion to obtain the denoised results for each time step, and obtaining a penalty value based on the denoised results and medical constraints; S5. When the penalty value is greater than the penalty threshold, updating the noise sequence in S3, and repeating S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and outputting the target difference results for each time step; S6. Obtaining the time-series data corresponding to each medical data set based on the data from the first time step of each medical data set and the corresponding target difference results; the time-series data includes synthetic medical data from multiple time steps. When generating time-series data, this method can combine the difference results of real medical data with forward diffusion, backward diffusion, and medical constraints, so that the generated time-series data not only takes into account the real situation, but also meets the various constraints in the medical constraints, thus improving the reliability of the time-series data. Attached Figure Description
[0058] Figure 1 This is an application environment diagram of a time-series data generation method based on clinical knowledge and a differential diffusion model in one embodiment;
[0059] Figure 2 This is a flowchart illustrating a time-series data generation method based on clinical knowledge and a differential diffusion model in one embodiment.
[0060] Figure 3 This is a schematic diagram of the processing flow of a time-weighted graph neural network in one embodiment;
[0061] Figure 4 This is a schematic diagram of the overall process of a time-series data generation method based on clinical knowledge and a differential diffusion model in another embodiment;
[0062] Figure 5 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0063] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0064] The time-series data generation method based on clinical knowledge and differential diffusion models provided in this application can be applied to, for example... Figure 1 In the application environment shown, terminal 102 interacts with server 104 via a wired / wireless channel. A data storage system can store the data that server 104 needs to process. The specific processing steps of server 104 include: S1, acquiring multiple medical data and extracting multiple medical constraints from a multi-source clinical knowledge base; the medical data includes data from at least two time steps; S2, determining the difference result between the data from the two time steps of each medical data; S3, determining the noise addition result of multiple time steps through forward diffusion based on a preset noise sequence and the difference result; the noise sequence includes added noise at each time step; S4, denoising the noise addition result of each time step through backward diffusion to obtain the denoised result of each time step, and obtaining a penalty value based on the denoised result and the medical constraints; S5, when the penalty value is greater than the penalty threshold, updating the noise sequence in S3, and repeating S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and outputting the target difference result of each time step; S6, obtaining the time series data corresponding to each medical data based on the data of the first time step in each medical data and the corresponding target difference result; the time series data includes synthetic medical data from multiple time steps. The terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, etc. The server 104 can be a single server, a server cluster consisting of multiple servers, or a cloud computing center consisting of multiple servers.
[0065] In one embodiment, such as Figure 2 As shown, a method for generating time-series data based on clinical knowledge and a differential diffusion model is presented, and this method is applied to… Figure 1 Taking server 104 as an example, the following steps are included:
[0066] S1. Acquire multiple medical data points and extract multiple medical constraints from a multi-source clinical knowledge base; the medical data includes data from at least two time steps.
[0067] Medical data refers to clinical data recorded during the provision of medical services, health management, or medical research. A single medical data entry includes clinical data collected from a patient with a target disease at at least one clinical stage. A target disease can refer to a disease with limited actual clinical data, or it can refer to a disease with limited clinical data during the critical stage. A clinical stage refers to the stage of disease development, such as early stage, intermediate stage, recovery stage, or critical stage. Data from the same patient at different clinical stages is included in the same medical data entry.
[0068] Medical data is represented as X={x t1 ,x t2 ,…,x tn}, x tn This represents medical data collected during clinical phase n. tn The dimension is multidimensional. For example, x tn This includes dimensions such as vital signs, laboratory test data, organ function indicators, inflammation and immune indicators, etc.
[0069] Each piece of medical data is associated with an outcome label, which represents the current status of the patient to whom the medical data belongs. The current status includes both survival and death.
[0070] It should be noted that the medical data and verification medical data used in this application are all data authorized by the patients.
[0071] A multi-source clinical knowledge base refers to a variety of clinical knowledge bases. These types of multi-source clinical knowledge bases include, but are not limited to, medical patent literature, medical guidelines, textbooks, drug instructions, and clinical record documents. For example, medical patent literature includes journal articles, reviews, and clinical research reports in the target disease area, guidelines for the Save Sepsis Campaign, and intensive care treatment guidelines.
[0072] Medical constraints refer to the rules that define the reasonable scope, evolution trend, and state transition path of the generated synthetic medical data.
[0073] S2. Determine the difference between the data at two time steps for each medical data point;
[0074] The difference result is obtained by subtracting the data from two consecutive time steps. The two time steps are adjacent time steps.
[0075] When calculating the difference results, the data from the two time steps are from the same clinical phase. A difference result can be obtained from the data from both time steps within each clinical phase.
[0076] Furthermore, if the target medical data includes data collected from patients with the target disease at multiple clinical stages, and the collected data has at least two time steps in each clinical stage, then S2, S3, S4 and S5 are executed on the data of each clinical stage respectively to obtain synthetic medical data of multiple time steps in each clinical stage, and the synthetic medical data of each time step in each clinical stage is used as the time series data of the target medical data, where the target medical data is any data in the medical data.
[0077] S3. Based on the preset noise sequence and difference results, the noise addition results for multiple time steps are determined by forward diffusion; the noise sequence includes the added noise for each time step.
[0078] S4. The noise addition results at each time step are denoised by backdiffusion to obtain the denoised results at each time step. Based on the denoised results and medical constraints, the penalty value is obtained.
[0079] The penalty value can be obtained through a large language model. Specifically, by inputting the denoising results and medical constraints at each time step into the large language model, the penalty value corresponding to each denoising result can be output.
[0080] S5. When the penalty value is greater than the penalty threshold, update the noise sequence in S3, and repeat S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and output the target difference results at each time step.
[0081] The penalty threshold is a preset value. The noise sequence in S3 is updated based on the penalty value.
[0082] The target difference result is the denoising result obtained when the penalty value is less than or equal to the penalty threshold. For example, if the penalty value obtained in step S4 in the 1000th iteration is less than the penalty threshold, then the denoising result of each time step obtained in step S4 in the 1000th iteration is the target difference result of each time step.
[0083] S6. Based on the data of the first time step in each medical data and the corresponding objective difference results, obtain the time series data corresponding to each medical data; the time series data includes synthetic medical data from multiple time steps.
[0084] In this context, time-series data maintains statistical consistency with real-world medical data, possesses clinical validity under medical constraints, and can improve the utility of downstream tasks by synthesizing scarce time-series data. For example, predictive models trained with synthesized scarce time-series data perform more intelligently, accurately, and reliably.
[0085] In the aforementioned time series data generation method based on clinical knowledge and differential diffusion model, this method can combine the differential results of real medical data with forward diffusion, backward diffusion, and medical constraints when generating time series data. This ensures that the generated time series data not only takes into account the real situation but also meets the various constraints in the medical constraints, thereby improving the reliability of the time series data.
[0086] In one embodiment, the extraction process of medical constraints in step S1 includes:
[0087] Text content related to the target disease is extracted from a multi-source clinical knowledge base, and the text content is preprocessed to obtain multiple structured text blocks;
[0088] Based on the preset indicator range extraction template, the first information including the indicator range is extracted from each text block. Based on the preset trend direction extraction template, the second information representing the indicator change trend of the target disease at each clinical stage is extracted from each text block. Based on the preset covariation pattern extraction template, the related indicator pairs and the covariation direction of the indicator pairs are extracted from each text block. The related indicator pairs and the covariation direction of the indicator pairs constitute the third information.
[0089] The first, second, and third pieces of information are deduplicated and conflict-resolved to obtain the processed first, second, and third pieces of information.
[0090] Based on the processed first, second, and third information, medical constraints are determined.
[0091] The target disease is the disease to which the desired time-series data belongs. For example, if we need to generate time-series data for disease A, we need to extract text content related to disease A from a multi-source clinical knowledge base.
[0092] Preprocessing of text content includes cleaning, deduplication, and paragraph segmentation to obtain multiple structured text blocks.
[0093] The template for extracting indicator ranges is: "Please extract the clinical normal range, warning range, and critical range of physiological and laboratory indicators related to the target disease from the following text block, and output the indicator name, range type, lower limit, upper limit, and unit in JSON format."
[0094] The trend direction extraction template is "Please extract the changing trends of key indicators of the target disease at each clinical stage from the following text block. The output format is: [Indicator] shows an [rising / falling / rising then falling / fluctuating] trend at [stage transition]".
[0095] The covariation pattern extraction template is: "Please extract the related indicator pairs and the covariation direction of the indicator pairs from the following text block. The output format is: When [Indicator A] [increases / decreases], [Indicator B] [increases / decreases / changes in opposite directions], and label the physiological mechanism basis in the text block."
[0096] The extraction of the first, second, and third information can be achieved through a language model or an API (Application Programming Interface). For example, the text blocks, indicator range extraction template, trend direction extraction template, and covariance pattern extraction template can be input into the language model. The first information can be extracted and output using the indicator range extraction template, the second information can be extracted and output using the trend direction extraction template, and the third information can be extracted and output using the covariance pattern extraction template.
[0097] The first, second, and third pieces of information may be multiple, single, or zero. Each piece of information has a confidence level and a corresponding clinical knowledge base. The clinical knowledge base to which the first piece of information belongs refers to the clinical knowledge base to which the text block from which it was extracted belongs. For example, if first information A is extracted from a text block in clinical knowledge base B, then the clinical knowledge base for the first piece of information is clinical knowledge base B. Similarly, the clinical knowledge base to which the second piece of information belongs refers to the clinical knowledge base to which the text block from which it was extracted belongs; and the clinical knowledge base to which the third piece of information belongs refers to the clinical knowledge base to which the text block from which it was extracted belongs.
[0098] Deduplication refers to removing duplicate information from the first, second, and third pieces of information. The confidence level of the retained information after deduplication is determined based on the confidence level of the duplicate information. For example, if second information A states that "hemoglobin levels are significantly lower in the severe stage than in the early stage," and second information B states that "hemoglobin levels usually show a decreasing trend as the disease progresses," then these two pieces of information are duplicates. The confidence level of both second information A and second information B is 1. Therefore, the confidence level of the retained second information A or second information B after deduplication is (1+1) / 2 = 1.
[0099] Conflict resolution refers to removing the information with lower priority from the clinical knowledge base or merging indicator ranges when faced with two contradictory pieces of information. For example, second information A indicates that "hemoglobin levels are significantly lower in the severe stage than in the early stage," and second information A comes from Medical Guideline 1. Second information B indicates that "hemoglobin levels are abnormally high in the severe stage, exceeding early levels," and second information B comes from Medical Guideline 2. Since Medical Guideline 1 has a higher priority than Medical Guideline 2, second information B is removed. Merging indicator ranges refers to taking the union of the indicator ranges of the two contradictory pieces of information as the indicator range of the new information.
[0100] The processed first, second, and third information includes the threshold range, trend direction, and correlation pattern of various indicators related to the target disease, which provides medical knowledge constraints for the subsequent generation of small-sample time-series data.
[0101] In this embodiment, by deduplicating and resolving conflicts in the first, second, and third information, the processed first, second, and third information are obtained. This can eliminate redundant information and obtain first, second, and third information with high recall.
[0102] In one embodiment, determining medical constraints based on the processed first information, second information, and third information includes:
[0103] In response to the annotation operations for each first piece of information, second piece of information, and third piece of information, at least one first annotation result for each first piece of information, at least one second annotation result for each second piece of information, and at least one third annotation result for each third piece of information are obtained; the first annotation result is used to characterize whether the first piece of information is valid, the second annotation result is used to characterize whether the second piece of information is valid, and the third annotation result is used to characterize whether the third piece of information is valid.
[0104] Based on the first annotation results, the second annotation results, and the third annotation results, invalid first target information is determined from the first information, invalid second target information is determined from the second information, and invalid third target information is determined from the third information;
[0105] In response to the correction operations on the first target information, the second target information, and the third target information, correction information is obtained;
[0106] Based on the corrected information, valid first information, valid second information, and valid third information, medical constraints are determined.
[0107] While the processed first, second, and third information covers a wide range of clinical knowledge, real clinical data is also affected by immune aging, coexistence of multiple comorbidities, and heterogeneous recovery processes. This can lead to discrepancies between the standard rules described in the multi-source clinical knowledge base and actual clinical observations. Therefore, it is necessary to perform annotation and correction operations on the processed first, second, and third information to reduce the deviation between medical constraints and actual conditions.
[0108] Both annotation and correction operations are triggered by the expert panel through the human-computer interaction interface. Since there must be at least one expert, the number of each of the first, second, and third annotation results is the same as the number of experts who triggered the annotation operation.
[0109] "Valid" refers to whether the first / second / third piece of information is clinically applicable. If an expert deems a piece of information inapplicable, an invalidation annotation operation will be triggered; if an expert deems a piece of information applicable, a valid annotation operation will be triggered. Both invalidation and valid annotation operations are annotation operations.
[0110] Determining invalid first target information from the first information includes: if there are one or more annotation results in the first annotation results of the first information that represent invalidity, then the first information is determined to be invalid first annotation information; or, if all the first annotation results of the first information represent invalidity, then the first information is determined to be invalid first target information.
[0111] Determining invalid second target information from the second information includes: if there are one or more annotation results in the second annotation results of the second information that represent invalidity, then the second information is determined to be invalid second target information; or, if all the second annotation results of the second information represent invalidity, then the second information is determined to be invalid second target information.
[0112] Determining invalid third target information from third information includes: if there are one or more annotation results in the third annotation results of the third information that represent invalidity, then the third information is determined to be invalid third target information; or, if all the third annotation results of the third information represent invalidity, then the third information is determined to be invalid third target information.
[0113] The correction operation is mainly used to correct invalid first, second, and third target information. For example, the first target information indicates that the hemoglobin content is between X1 and X2, but the actual hemoglobin content should be between X3 and X4, so correction is needed.
[0114] Furthermore, reliability scores are performed on each corrected piece of information, the valid first piece of information, the valid second piece of information, and the valid third piece of information to obtain the score results for each corrected piece of information, the valid first piece of information, the valid second piece of information, and the valid third piece of information.
[0115] Furthermore, the first, second, and third information are displayed in a structured annotation table.
[0116] In this embodiment, correction information is obtained by responding to correction operations on the first target information, the second target information, and the third target information. Based on the correction information, the valid first information, the valid second information, and the valid third information, medical constraints are determined. This can correct information in the multi-source clinical knowledge base that deviates from actual clinical observation, thereby making the medical constraints more consistent with the actual situation.
[0117] In one embodiment, determining medical constraints based on corrected information and valid first information, valid second information, and valid third information includes:
[0118] In response to the information supplementation operation targeting the evolution pattern of the target disease, supplementary information characterizing the evolution pattern of the target disease is obtained. The supplementary information includes baseline drift information characterizing the baseline deviation of physiological indicators caused by immune aging and underlying diseases, delayed response information characterizing the delayed or weakened response of treatment interventions, and organ coupling information characterizing cross-organ effect constraints.
[0119] Natural language parsing is performed on the supplementary information, corrected information, and valid first, second, and third information to obtain multiple parsed texts.
[0120] Extract entities from each parsed text, and construct triples for each parsed text based on the entities extracted from each parsed text.
[0121] Each triplet is subjected to conflict resolution, and the triplet obtained after conflict resolution is used as the medical constraint.
[0122] The information supplementation process can be performed by experts through a human-computer interaction interface. The supplementary information consists of disease evolution patterns specific to the target patient that are missing from the multi-source clinical knowledge base. This information supplementation process can also reduce the discrepancy between medical constraints and actual conditions.
[0123] Baseline drift information can be understood as missing age range indicators in the existing information (the first information in the corrected information and the effective first information). For example, due to aging, older adults may suffer from inflammation, so their white blood cell count will be much higher than that of younger adults. If the existing information only includes the white blood cell count range for older adults and lacks the white blood cell count range for younger adults, then the white blood cell count range for younger adults should be added.
[0124] Delayed response information can also be understood as missing information on the changing trends of indicators for specific age groups in the existing information (the second information in the corrected information and the effective second information). For example, if the existing information only includes trend information indicating that indicator A changes significantly in older adults on the first day of treatment and on the sixth day after treatment, but lacks information on the changing trends of indicator A in younger adults, then information should be added indicating that indicator A will change significantly in younger adults a few days later.
[0125] Organ coupling information can also be understood as the missing age range of third information in the existing information (the third information in the corrected information and the effective third information). For example, if the existing information only includes the third information corresponding to the elderly, then the third information of the younger should be supplemented.
[0126] Natural language parsing breaks down natural language (supplementary information, corrective information, and valid first, second, and third information), analyzes its grammatical structure and semantic logic, and ultimately transforms it into parsed text that computers can understand and execute.
[0127] The analytical text includes a first analytical text concerning the range of indicators, a second analytical text concerning the direction of indicator changes, and a third analytical text concerning the transfer path between disease stages.
[0128] Entity extraction can be achieved using rule templates and a language model. Specifically, the rule template and parsed text are input into the language model, which outputs the entities in each parsed text. For example, entities extracted from the first parsed text include "indicator name", "lower limit", "upper limit", "unit", and "applicable conditions"; entities extracted from the second parsed text include "indicator name", "trend direction", "trigger condition", and "time window"; and entities extracted from the third parsed text include "source stage", "target stage", "allowed state", "prohibited state", and "time constraint".
[0129] The extracted entities are structured and encoded according to a predefined JSON Schema, resulting in triples corresponding to each parsed text. Examples of triples include [hemoglobin, lower bound, upper bound].
[0130] The conflict resolution process for each triplet includes: labeling the priority and applicable population of the clinical knowledge base to which each triplet belongs; and for two conflicting triplets, removing the triplet with the lower priority in its clinical knowledge base, or merging the indicator ranges, to obtain the triplet after conflict resolution. The process of determining the clinical knowledge base to which a triplet belongs includes: taking any triplet as the target triplet, determining the target parsed text of the target triplet from the parsed text; determining the fourth target information of the target parsed text from supplementary information, corrective information, and valid first, second, and third information; and determining the clinical knowledge base to which the fourth target information belongs as the clinical knowledge base to which the target triplet belongs. Further, if the first target information is corrective information or a correction message, then the clinical knowledge base of the triplet is the clinical record document.
[0131] The conflict resolution process for each triple includes: labeling the confidence level of each triple, and for two contradictory triples, removing the triple with the lower confidence level or merging the indicator ranges to obtain the triples after conflict resolution. The confidence level of the triple can be determined based on the priority of the clinical knowledge base to which the triple belongs.
[0132] The conflict resolution process for each triple includes: when there is a triple derived from valid first / second / third information in two contradictory triples, the triple derived from valid first / second / third information is removed to obtain the conflict-resolved triples.
[0133] In this embodiment, supplementary information characterizing the evolution of the target disease is obtained by responding to information supplementation operations based on the evolution of the target disease. The supplementary information includes baseline drift information characterizing the baseline shift of physiological indicators caused by immune aging and underlying diseases, delayed response information characterizing the delayed or weakened response of treatment interventions, and organ coupling information characterizing cross-organ constraints. This can supplement the information on disease evolution patterns that are missing in the multi-source clinical knowledge base but are specific to the target disease patient, thereby making the final determined medical constraints more comprehensive.
[0134] In one embodiment, multiple rounds of expert discussions are organized to reach a consensus on disagreements among supplementary information, revised information, and valid first, second, and third information. Specifically, a modified Delphi method is employed, involving anonymous voting and feedback loops on disagreeing information until a predetermined percentage (e.g., 80%) of consensus is achieved, yielding the final information used to determine medical constraints.
[0135] In one embodiment, step S3 includes:
[0136] For each piece of medical data, based on a pre-defined noise sequence and the corresponding difference results of the medical data, through... or Determine the noise addition results at multiple time steps;
[0137] in, For the difference result, The result of adding noise at time step k. Add noise to time step k. The result of adding noise at time step k-1. This is the cumulative coefficient for time step k. , The noise is standard Gaussian noise, and k is a non-negative integer.
[0138] Step S3 is performed separately for each piece of medical data.
[0139] In this embodiment, for each piece of medical data, based on a preset noise sequence and the difference result corresponding to the medical data, the following is used: or Determining the noise-adding results at multiple time steps can effectively smooth the complex distribution of high-dimensional clinical data and reduce the difficulty of generating noise-adding results.
[0140] In one embodiment, the medical constraints include a first constraint corresponding to the range of the indicator, a second constraint corresponding to the trend of the indicator's change, and a third constraint corresponding to the jump threshold. Step S4 includes:
[0141] For each piece of medical data, the noise addition results of each time step, the clinical stage of each time step, and the medical constraints are input into a pre-trained denoising network to obtain the noise to be removed from each noise addition result.
[0142] Based on the noise to be removed from each noise addition result, through The noise-adding results at each time step are denoised to obtain the denoised results at each time step;
[0143] The denoising results and medical constraints are input into the pre-trained language model to obtain the index range penalty value corresponding to the first constraint, the change penalty value corresponding to the second constraint, and the jump penalty value corresponding to the third constraint.
[0144] The penalty value is obtained by weighted summing of the penalty values for the range of indicators, changes, and jumps.
[0145] in, The denoising result at time step k-1, Add noise to time step k. The result of adding noise at time step k. This is the cumulative coefficient for time step k. , The noise to be removed from the noise-adding result at time step k. Let k be the variance of time. for z represents random noise.
[0146] The first medical constraint is a triple constructed from the first analytical text concerning the range of indicators; the second medical constraint is a triple constructed from the second analytical text concerning the direction of indicator change; and the third medical constraint is a triple constructed from the third analytical text concerning the transfer path between clinical stages.
[0147] The jump threshold refers to the maximum jump of a certain indicator between two clinical stages.
[0148] The first medical constraint, for each indicator j, enforces the medically effective boundary. , where l j and u j Indicators The upper and lower limits are defined. The first medical constraint prevents invalid output. For example, it prevents negative counts or indicators exceeding lethal physiological limits from appearing in time-series data.
[0149] The second medical constraint can affect the generated target difference result. Apply symbolic consistency constraints g() indicates returning to the clinical stage S under the guidance of the first medical constraint.n The expected direction of change of index j is given by K, which represents the first constraint condition, and sign() is the sign function.
[0150] The third medical constraint can prevent abnormal jumps between clinical stages, and its expression is: Among them, the stage-related threshold This is given by the third medical constraint. The third medical constraint can penalize unreasonable single-step jumps, thereby ensuring that the generated time series data are temporally continuous and clinically reliable small-sample time series data.
[0151] The formula for weighted summation of the indicator range penalty value, change penalty value, and jump penalty value is as follows: , The penalty value. The weight of the penalty value for the indicator range, The penalty value is the range of indicators. To change the weight of the penalty value, To change the penalty value, The weights for the jump penalty values, This is the penalty value for the jump.
[0152] In situations where medical data for the target disease is scarce, purely data-driven denoising may produce illusions, such as unreasonable spikes or sudden phase transitions. Therefore, this application injects structured medical constraints as guidance during the noise addition and denoising processes to update the noisy sequence, ensuring that the denoising results meet biological constraints while maintaining sample diversity.
[0153] In this embodiment, for each piece of medical data, the noise addition results at each time step, the clinical stage at each time step, and the medical constraints are input into a pre-trained denoising network to obtain the noise to be removed from each noise addition result; based on the noise to be removed from each noise addition result, through... The noise-added results at each time step are denoised to obtain the denoised results at each time step. The denoised results and medical constraints are input into the pre-trained language model to obtain the index range penalty value corresponding to the first constraint, the change penalty value corresponding to the second constraint, and the jump penalty value corresponding to the third constraint. The index range penalty value, change penalty value, and jump penalty value are weighted and summed to obtain the penalty value. In this way, the medical constraints can be used as a guide to update the noise sequence, so that the final denoised result meets the biological constraints while maintaining sample diversity.
[0154] In one embodiment, step S6 includes:
[0155] For each piece of medical data, based on the data from the first time step and the difference results of each objective, through... This yields time-series data corresponding to the medical data; the time-series data includes composite medical data from multiple time steps.
[0156] in, For synthetic medical data at time step k+1, This represents the target difference result at time step k; when k is 0, For the data from the first time step in the medical data, when k is greater than 0, Synthetic medical data at time step k.
[0157] In this embodiment, for each piece of medical data, based on the data from the first time step in the medical data and the results of each target difference, through... This allows us to obtain time-series data corresponding to medical data, which simplifies the process of generating time-series data.
[0158] In one embodiment, the time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes:
[0159] Based on time-series data from each clinical stage, using formulas Calculate the reliability value of time series data; S n Indicates clinical stage n, The weight of n in the clinical stage. This represents the conditional probability distribution of the true data for the target disease at clinical stage n. D represents the conditional probability distribution of time series data representing the clinical stage n of the target disease. KL (||) represents the KL divergence. The target difference result;
[0160] The time series data of the target stage are considered reliable if the reliability value is greater than or equal to the reliability threshold; the target stage is a stage in the clinical stage.
[0161] Among them, formula This is the phase-aware manifold calibration formula, which complements clinical constraint guidance. Phase-aware manifold calibration enforces population-level distribution fidelity and coverage, while medical constraints ensure physiological rationality at the sample level. Together, they stabilize diffusion sampling and significantly reduce the likelihood of collapse to a few dominant modes.
[0162] Reliability values can characterize whether the time series data conforms to the clinical characteristics of the corresponding clinical stage, and can also represent the "distance" or "difference" between the time series data distribution and the actual data distribution. If the generated time series data is determined to be reliable, downstream tasks are performed using the generated time series data. If the generated time series data is determined to be unreliable, the time series data is discarded.
[0163] In this embodiment, based on the time-series data of each clinical stage, a formula is used. Calculate the reliability value of the time series data. This value can be used to determine whether the generated time series data conforms to the clinical characteristics of the corresponding clinical stage, thereby determining whether the time series data should be retained or applied to downstream tasks.
[0164] In one embodiment, the time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes:
[0165] For each piece of medical data, each time step of each clinical stage is used as a node, and the synthetic medical data of each time step is used as the node feature. By connecting each node, a time series graph of each piece of medical data is obtained. The edge weight between two connected nodes in the time series graph is determined based on the time difference, which is the time difference between the two connected nodes.
[0166] A time-weighted graph neural network is trained based on each time-series graph to obtain the trained time-weighted graph neural network.
[0167] Multiple sets of pre-set validation medical data are input into the trained time-weighted graph neural network to obtain the prediction result for each set of validation medical data.
[0168] Based on the prediction results and the actual results of the validation medical data, the evaluation index of the trained time-weighted graph neural network is determined.
[0169] The validity of time series data is determined based on evaluation indicators.
[0170] Among them, edge weight The calculation formula is: , This is the time difference. This is the preset attenuation coefficient.
[0171] The temporally weighted graph neural network uses a three-layer GCN (Graph Convolutional Network) and cross-layer residual connections, such as... Figure 3 As shown, the details are as follows:
[0172] First layer: Map the node features of the time series graph to the hidden space, as shown in the following expression: X represents the node features in the time series graph, A is the adjacency matrix, E is the edge weight, and H is the node characteristics. (1) This is the result of feature mapping.
[0173] The second layer further extracts higher-order neighborhood information, as shown in the following expression: H (2) This is for extracting higher-order neighborhood information.
[0174] The third layer: further refines the feature representation, as shown in the following expression: H (3) To deepen the results of feature representation.
[0175] Cross-layer residual connections: H(1), H(2), and H(3) are fused to alleviate the gradient vanishing problem in deep networks while preserving semantic information at different levels, resulting in fused features H. concat The fusion expression is .
[0176] Structured pruning layer: Introducing learnable architectural parameters , The expression is as follows: , For the sigmoid function, For temperature parameters, g i Let be the gating weights of the i-th layer of the network. (This is achieved through...) Determine the feature representation after pruning The pruned feature representations of each layer are spliced together to obtain the spliced feature H. pruned The expression for the feature representation after splicing and pruning each layer is as follows: .
[0177] Global pooling and classification: concatenating features H pruned Perform global average pooling to obtain the graph-level representation h. graph The expression is N is the number of nodes in the time sequence graph, H pruned [i] represents the concatenated features of node i. A classifier is used with a fully connected layer and a sigmoid function to output the prediction results of the time series graph. Simultaneously, the trained temporal weighted graphical neural network can be obtained, where W and b are the model parameters of the temporal weighted graphical neural network. The expression for the loss function of the temporal weighted graphical neural network is: L total L is the total loss used to train the temporally weighted graphical neural network. task The loss is the main task loss, R(g) is the sparse regularization term, and y is the synthetic result corresponding to the time series graph. The synthetic result includes two cases: survival and death. The synthetic result can be determined by the synthetic medical data of the last time step in the time series data. For example, the survival or death of the patient can be determined by the synthetic medical data of the last time step.
[0178] The predicted outcome includes both survival and death; that is, the predicted outcome for each validated medical data point is either survival or death. The actual outcome also includes both survival and death; that is, the actual outcome for each validated medical data point is either survival or death.
[0179] Evaluation metrics include, but are not limited to, AUC (Area Under Curve), F1 Score, Acc (accuracy), Recall (recall rate), and Precision (precision rate).
[0180] When there are multiple types of evaluation indicators, determining the validity of time-series data based on these indicators includes: determining the usability of time-series data when at least one evaluation indicator is greater than its corresponding preset threshold; or, determining the usability of time-series data when every evaluation indicator is greater than its corresponding preset threshold. Validity refers to whether the time-series data is usable in a real clinical setting, that is, whether the time-series data conforms to statistical laws, meets medical constraints, and accurately reflects the disease progression.
[0181] Furthermore, if the target medical data includes clinical data collected from patients with the target disease during the target clinical stage, and each target clinical stage has data from at least two time steps, then synthetic medical data for each time step in the target clinical stage and subsequent clinical stages can be generated using the data from the two time steps collected during the target clinical stage, the medical constraints corresponding to the target clinical stage, and the medical constraints corresponding to subsequent clinical stages. The target clinical stage refers to any clinical stage, and subsequent clinical stages refer to clinical stages following the target clinical stage in the course of disease development. The medical constraints corresponding to the target clinical stage refer to medical constraints used to constrain the range of various indicators, the direction of indicator change, and the path of metastasis during the target clinical stage. Similarly, the medical constraints corresponding to subsequent clinical stages refer to medical constraints used to constrain the range of various indicators, the direction of indicator change, and the path of metastasis during subsequent clinical stages. Specifically, by executing S2, S3, S4, S5, and S6 based on the data collected at two time steps in the target clinical phase and the corresponding medical constraints, synthetic medical data for multiple time steps in the target clinical phase can be obtained; similarly, by executing S2, S3, S4, S5, and S6 based on the data collected at two time steps in the target clinical phase and the corresponding medical constraints in subsequent clinical phases, synthetic medical data for multiple time steps in subsequent clinical phases can be obtained.
[0182] In this embodiment, for each piece of medical data, each time step in each clinical stage is taken as a node, and the synthetic medical data of each time step is taken as the node feature. By connecting each node, a time series graph of each piece of medical data is obtained. The edge weight between two connected nodes in the time series graph is determined based on the time difference, which is the time difference between the two connected nodes. A time series weighted graph neural network is trained based on each time series graph to obtain a trained time series weighted graph neural network. Multiple preset verification medical data are input into the trained time series weighted graph neural network to obtain the prediction result of each verification medical data. Based on each prediction result and the actual result of each verification medical data, the evaluation index of the trained time series weighted graph neural network is determined. In this way, the validity of the time series data can be determined based on the evaluation index.
[0183] This application also provides an application scenario in which the aforementioned time-series data generation method based on clinical knowledge and a differential diffusion model is applied. Specifically, the application of this time-series data generation method based on clinical knowledge and a differential diffusion model in this scenario is as follows:
[0184] like Figure 4 As shown, specifically, multiple medical data points are acquired, and multiple medical constraints are extracted from a multi-source clinical knowledge base. The medical data includes data from at least two time steps. The difference between the data from the two time steps of each medical data point is determined. Based on a preset noise sequence and the difference results, forward diffusion is used to determine the noise addition results for multiple time steps. The noise sequence includes the added noise for each time step. Backdiffusion is used to denoise the noise addition results for each time step, resulting in a denoised result for each time step. Based on the denoised result and the medical constraints, a penalty value is obtained. When the penalty value is greater than a penalty threshold, the noise sequence in S3 is updated, and the noise addition result, denoised result, and penalty value are repeatedly calculated based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold. The target difference results for each time step are then output. Based on the data from the first time step of each medical data point and the corresponding target difference results, the time-series data corresponding to each medical data point is obtained. The time-series data includes synthetic medical data from multiple time steps. Based on the time-series data of each clinical stage, the formula is used... Calculate the reliability value of time series data; S n Indicates clinical stage n, The weight of n in the clinical stage. This represents the conditional probability distribution of the true data for the target disease at clinical stage n. D represents the conditional probability distribution of time series data representing the clinical stage n of the target disease. KL (||) represents the KL divergence. The target difference result is determined by the reliability threshold of the time series data in the target stage. The target stage is defined as a stage within the clinical phase. For each medical data point, each time step in each clinical phase is used as a node, and the synthetic medical data at each time step is used as the node feature. These nodes are connected to obtain a time series graph for each medical data point. The edge weights between two connected nodes in the time series graph are determined based on the time difference, which is the time difference between the two connected nodes. A time series weighted graph neural network is trained based on each time series graph to obtain the trained time series weighted graph neural network. Multiple preset validation medical data points are input into the trained time series weighted graph neural network to obtain the prediction result for each validation medical data point. Based on each prediction result and the actual result of each validation medical data point, an evaluation index for the trained time series weighted graph neural network is determined. Based on the evaluation index, the validity of the time series data is determined.
[0185] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0186] Based on the same inventive concept, this application also provides a time-series data generation apparatus based on clinical knowledge and a differential diffusion model for implementing the time-series data generation method based on clinical knowledge and a differential diffusion model as described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method. Therefore, the specific limitations of one or more embodiments of the time-series data generation apparatus based on clinical knowledge and a differential diffusion model provided below can be found in the limitations of the time-series data generation method based on clinical knowledge and a differential diffusion model described above, and will not be repeated here.
[0187] In one embodiment, a time-series data generation apparatus based on clinical knowledge and a differential diffusion model is provided, which is used to perform the following steps:
[0188] S1. Acquire multiple medical data points and extract multiple medical constraints from a multi-source clinical knowledge base; the medical data includes data from at least two time steps.
[0189] S2. Determine the difference between the data at two time steps for each medical data point;
[0190] S3. Based on the preset noise sequence and difference results, the noise addition results for multiple time steps are determined by forward diffusion; the noise sequence includes the added noise for each time step.
[0191] S4. The noise addition results at each time step are denoised by backdiffusion to obtain the denoised results at each time step. Based on the denoised results and medical constraints, the penalty value is obtained.
[0192] S5. When the penalty value is greater than the penalty threshold, update the noise sequence in S3, and repeat S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and output the target difference results at each time step.
[0193] S6. Based on the data of the first time step in each medical data and the corresponding objective difference results, obtain the time series data corresponding to each medical data; the time series data includes synthetic medical data from multiple time steps.
[0194] The modules in the aforementioned time-series data generation device based on clinical knowledge and differential diffusion models can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0195] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 5 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores various types of data. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a time-series data generation method based on clinical knowledge and a differential diffusion model.
[0196] Those skilled in the art will understand that Figure 5The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0197] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0198] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0199] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0200] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties.
[0201] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments described above. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0202] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0203] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for generating time-series data based on clinical knowledge and a differential diffusion model, characterized in that, The method includes: S1. Acquire multiple medical data points and extract multiple medical constraints from a multi-source clinical knowledge base; the medical data includes data from at least two time steps. S2. Determine the difference between the data at the two time steps of each medical data; S3. Based on the preset noise sequence and the difference result, the noise addition result for multiple time steps is determined by forward diffusion; the noise sequence includes the added noise for each time step. S4. The noise addition results at each time step are denoised by backdiffusion to obtain the denoised results at each time step, and a penalty value is obtained based on the denoised results and the medical constraints. S5. When the penalty value is greater than the penalty threshold, update the noise sequence in S3, and repeat S3 and S4 based on the updated noise sequence until the penalty value is less than or equal to the penalty threshold, and output the target difference result at each time step. S6. Based on the data of the first time step in each of the medical data and the corresponding target difference results, obtain the time series data corresponding to each of the medical data; the time series data includes synthetic medical data of multiple time steps.
2. The method according to claim 1, characterized in that, The extraction process of the medical constraints described in step S1 includes: Text content related to the target disease is extracted from a multi-source clinical knowledge base, and the text content is preprocessed to obtain multiple structured text blocks; Based on a preset indicator range extraction template, first information including the indicator range is extracted from each of the text blocks. Based on a preset trend direction extraction template, second information representing the indicator change trend of the target disease at each clinical stage is extracted from each of the text blocks. Based on a preset covariation pattern extraction template, related indicator pairs and the covariation direction of the indicator pairs are extracted from each of the text blocks. The related indicator pairs and the covariation direction of the indicator pairs constitute third information. The first information, the second information, and the third information are deduplicated and conflict-resolved to obtain the processed first information, the second information, and the third information; Based on the processed first information, second information, and third information, medical constraints are determined.
3. The method according to claim 2, characterized in that, The determination of medical constraints based on the processed first information, second information, and third information includes: In response to the annotation operations on each of the first information, the second information, and the third information, at least one first annotation result for each of the first information, at least one second annotation result for each of the second information, and at least one third annotation result for each of the third information are obtained; the first annotation result is used to characterize whether the first information is valid, the second annotation result is used to characterize whether the second information is valid, and the third annotation result is used to characterize whether the third information is valid. Based on each of the first annotation results, the second annotation results, and the third annotation results, invalid first target information is determined from the first information, invalid second target information is determined from the second information, and invalid third target information is determined from the third information; In response to the correction operation on the first target information, the second target information, and the third target information, correction information is obtained; Based on the corrected information, the valid first information, the valid second information, and the valid third information, medical constraints are determined.
4. The method according to claim 3, characterized in that, The determination of medical constraints based on the corrected information and the valid first information, the valid second information, and the valid third information includes: In response to the information supplementation operation for the evolution pattern of the target disease, supplementary information characterizing the evolution pattern of the target disease is obtained; the supplementary information includes baseline drift information characterizing the baseline shift of physiological indicators caused by immune aging and underlying diseases, delayed response information characterizing the delayed or weakened response of treatment interventions, and organ coupling information characterizing cross-organ effect constraints. Natural language parsing is performed on the supplementary information, the corrected information, and the valid first information, the valid second information, and the valid third information to obtain multiple parsed texts; Entities are extracted from each of the parsed texts, and triples corresponding to each parsed text are constructed based on the entities extracted from each of the parsed texts. Each of the triples is subjected to conflict resolution processing, and the triples obtained after conflict resolution are used as medical constraints.
5. The method according to claim 1, characterized in that, Step S3 includes: For each piece of medical data, based on a preset noise sequence and the corresponding difference result, through... or Determine the noise addition results at multiple time steps; in, For the difference result, The result of adding noise at time step k. Add noise to time step k. The result of adding noise at time step k-1. This is the cumulative coefficient for time step k. , The noise is standard Gaussian noise, and k is a non-negative integer.
6. The method according to claim 1, characterized in that, The medical constraints include a first constraint corresponding to the range of indicators, a second constraint corresponding to the trend of indicator changes, and a third constraint corresponding to the jump threshold. Step S4 includes: For each piece of medical data, the noise addition result of each time step corresponding to the medical data, the clinical stage of each time step, and the medical constraints are input into a pre-trained denoising network to obtain the noise to be removed from each piece of noise addition result. Based on the noise to be removed according to the aforementioned noise addition results, through The noise-adding results at each time step are denoised to obtain the denoised results at each time step. The denoising results and the medical constraints are input into the pre-trained language model to obtain the index range penalty value corresponding to the first constraint, the change penalty value corresponding to the second constraint, and the jump penalty value corresponding to the third constraint. The penalty value is obtained by weighted summing of the penalty value for the range of the indicator, the penalty value for the change, and the penalty value for the jump. in, The denoising result at time step k-1, Add noise to time step k. The result of adding noise at time step k. This is the cumulative coefficient for time step k. , The noise to be removed from the noise-adding result at time step k. Let k be the variance of time. for z represents random noise.
7. The method according to claim 1, characterized in that, Step S6 includes: For each piece of medical data, based on the data from the first time step in the medical data and the target difference results, through... The time-series data corresponding to the medical data is obtained; the time-series data includes synthetic medical data from multiple time steps. in, For synthetic medical data at time step k+1, This represents the target difference result at time step k; when k is 0, For the data at the first time step in the medical data, when k is greater than 0, Synthetic medical data at time step k.
8. The method according to claim 1, characterized in that, The time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes: Based on the time-series data from each clinical phase, using the formula... Calculate the reliability value of the time series data; S n Indicates clinical stage n, The weight of n in the clinical stage. This represents the conditional probability distribution of the true data regarding the target disease at clinical stage n. D represents the conditional probability distribution of the time-series data representing the clinical stage n of the target disease. KL (||) represents the KL divergence. The target difference result; When the reliability value of the time series data in the target stage is greater than or equal to the reliability threshold, the time series data in the target stage is determined to be reliable; the target stage is a stage in the clinical stage.
9. The method according to claim 1 or 8, characterized in that, The time-series data includes time-series data of the target disease at at least one clinical stage, and the method further includes: For each piece of medical data, each time step in each clinical stage is taken as a node, and the synthetic medical data at each time step is taken as the node feature. The nodes are connected to obtain a time series diagram of each piece of medical data. The edge weight between two connected nodes in the time series diagram is determined based on the time difference value, which is the time difference between the two connected nodes. A time-weighted graph neural network is trained based on each of the time-series graphs to obtain the trained time-weighted graph neural network; The preset multiple sets of verification medical data are input into the trained time-series weighted graph neural network to obtain the prediction result for each set of verification medical data. Based on the prediction results and the actual results of the verification medical data, the evaluation index of the trained time-weighted graph neural network is determined. The validity of the time series data is determined based on the evaluation metrics.