Machine learning based early identification of septic shock

By combining the Mondrian forest model and the random survival forest model, a joint risk score sequence is generated, which solves the shortcomings of existing septic shock prediction methods in terms of modeling mechanism and risk score generation. This enables efficient identification and early warning of septic shock, and improves the prediction accuracy and robustness of the model.

CN120913837BActive Publication Date: 2026-06-16XUANWU HOSPITAL OF CAPITAL UNIV OF MEDICAL SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XUANWU HOSPITAL OF CAPITAL UNIV OF MEDICAL SCI
Filing Date
2025-07-22
Publication Date
2026-06-16

Smart Images

  • Figure CN120913837B_ABST
    Figure CN120913837B_ABST
Patent Text Reader

Abstract

The application discloses a septic shock early identification method based on machine learning, comprising the following steps: constructing a multi-dimensional feature sequence arranged according to time by acquiring physiological data and medical record information of a patient, and performing missing value filling, standardization processing and sliding window division, extracting a feature subsequence with time continuity, generating a preliminary risk score by using a mondrian forest model, analyzing a score change trend, identifying a time period in which an abnormal change may exist, training a random survival forest model in combination with actual disease annotation of the patient, generating a second group of risk scores, fusing the two groups of scores, fitting and analyzing a trend of a score sequence, and judging whether a significant upward trend exists, so as to realize early warning and prompt of septic shock. The application is helpful to improve the identification timeliness and accuracy of septic shock.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a method for early identification of septic shock based on machine learning. Background Technology

[0002] With the development of medical informatization and intelligent diagnosis and treatment systems, patient risk prediction based on real-time physiological parameters and electronic medical record data has become an important research direction for assisting clinical decision-making. As a highly fatal clinical emergency, early identification of septic shock is crucial for improving the success rate of resuscitation. Currently, clinical practice often employs rule-based threshold discrimination, scoring systems, or traditional machine learning models for preliminary identification and early warning of septic shock.

[0003] Existing methods for predicting septic shock still have significant shortcomings in terms of modeling mechanisms, utilization of temporal features, and generation of risk scores. On the one hand, traditional models often rely on static features or single-time-slice data, making it difficult to capture the high-dimensional dynamic patterns of patient status evolution over time. They lack in-depth modeling of risk change trends over continuous time periods, resulting in insufficient model sensitivity and timeliness. On the other hand, existing scoring mechanisms often use the output of a single model, failing to adaptively adjust the training window or improve local prediction accuracy based on local fluctuations in predicted risk. They also lack mechanisms for jointly fusing outputs from multiple models, which is detrimental to improving prediction robustness and the ability to capture abnormal changes.

[0004] Therefore, how to provide a machine learning-based method for early identification of septic shock is a problem that urgently needs to be solved by those skilled in the art. Summary of the Invention

[0005] One objective of this invention is to propose an early identification method for septic shock based on machine learning. This invention fully integrates time series physiological data processing, the Mondrian forest model online scoring mechanism, and the random survival forest local modeling method. It describes in detail the entire process of risk score generation, identification of score change mutations, model fusion, and early warning signal output. It has the advantages of strong identification timeliness, high dynamic prediction accuracy, and sensitive response to sudden risks.

[0006] The machine learning-based early identification method for septic shock according to an embodiment of the present invention includes the following steps:

[0007] Acquire real-time physiological parameter data and electronic medical record data of the target patient, and generate a raw multidimensional feature sequence arranged in chronological order;

[0008] The original multidimensional feature sequence is subjected to missing value completion processing, numerical normalization processing, and fixed-length sliding window segmentation processing to obtain continuous time-segment feature subsequences;

[0009] The feature subsequence of each time period is input into the Mondrian Forest model for online training to obtain the corresponding first predicted risk score sequence.

[0010] The rate of change of the first predicted risk score sequence is calculated, the interval of score mutation is identified, and the corresponding time period feature subsequence is extracted as a candidate modeling window.

[0011] Label the survival status and time of the current sample within the candidate modeling window, construct training samples with censoring information, input them into the random survival forest model for local training, and output the second risk score sequence.

[0012] Perform risk score fusion processing on the first predicted risk score sequence and the second risk score sequence within the candidate modeling window to generate a joint risk score sequence;

[0013] The trend change is fitted based on the joint risk score sequence, and the inflection point of rapid score increase is identified to output an early warning signal for septic shock.

[0014] Optionally, the generation of the original multidimensional feature sequence arranged in chronological order includes:

[0015] Acquire real-time physiological parameter data and electronic medical record structured data of the target patient. The real-time physiological parameter data includes continuously monitored indicators recorded at fixed sampling periods, and the electronic medical record structured data includes clinical diagnostic information, test results and treatment records with time tags.

[0016] Time-stamping is performed on real-time physiological parameter data and electronic medical record structured data. Based on a unified time benchmark, the two data sources are time-aligned to construct data synchronization units for corresponding time points.

[0017] For each data synchronization unit, field filtering and structure standardization are performed to remove null fields and entries that do not meet the unified format requirements, and the data is converted into a fixed-length feature structure.

[0018] All eligible data synchronization units are arranged in chronological order to generate an original multidimensional feature sequence with a continuous time structure.

[0019] Optionally, obtaining the continuous time-segment feature subsequence includes:

[0020] For each feature field in the original multidimensional feature sequence arranged in chronological order, missing values ​​are identified, the time point where the missing value is located is recorded, and the adjacent valid data points before and after it are located.

[0021] After identifying missing values, missing value completion processing is performed on the feature fields with missing values. Interpolation is performed based on the located valid data points before and after the missing value completion process to generate the original multidimensional feature sequence after the missing value completion process.

[0022] Numerical normalization is performed on the original multidimensional feature sequence after missing value completion to determine the numerical range of each feature field in the corresponding sequence. All fields are then normalized according to a unified normalization rule to generate the normalized original multidimensional feature sequence.

[0023] Set a fixed-length sliding time window and sliding step size parameter, take the normalized original multidimensional feature sequence as input, divide the window according to time order, and extract the set of continuous feature fields covered by each time window.

[0024] The results of each window segmentation are organized, and the set of feature fields corresponding to the time window is organized into a feature subsequence for a time period.

[0025] Optionally, obtaining the first predicted risk score sequence includes:

[0026] The time period feature subsequences are input into the Mondrian Forest model in chronological order. The Mondrian Forest model includes multiple decision trees that support dynamic structural expansion.

[0027] For each decision tree, a feature subsequence of a time period is received. Starting from the root node of the decision tree, the feature subsequence of the current time period is judged layer by layer to determine whether it meets the feature partitioning boundary set by the current node.

[0028] When the time period feature subsequence is not fully received by the current path structure, an incremental partitioning operation is performed at the current node according to the partitioning mechanism of the Mondrian Forest model, expanding the new child node, and sending the time period feature subsequence into the newly added child node structure;

[0029] When the feature subsequence of a time period satisfies a certain partitioning path condition, the path is traversed downwards to the leaf node of the decision tree;

[0030] In the leaf nodes of the decision tree, update the statistical information related to the feature subsequence of the current time period. The statistical information includes the total number of samples, the cumulative survival time, and the event status label.

[0031] Based on the statistical information in the leaf nodes of the decision tree, the survival probability estimate of the feature subsequence in the current time period in the decision tree is calculated.

[0032] The survival probability estimates calculated from the feature subsequences of all decision trees for the current time period are weighted and averaged to generate the corresponding first predicted risk score.

[0033] All time period feature subsequences are processed sequentially in chronological order. Multiple first predicted risk scores are generated using the Mondrian Forest model, and all first predicted risk scores are arranged in chronological order to form a first predicted risk score sequence.

[0034] Optionally, the generation of the candidate modeling window includes:

[0035] Receive a first predicted risk score sequence, and construct a score change rate sequence between adjacent score points in the first predicted risk score sequence in chronological order. The score change rate is the difference between any two adjacent first predicted risk scores divided by the score time interval.

[0036] Based on the rate of change sequence of scores, the mean of the local rate of change within the sliding time window is calculated to obtain the mean of the local rate of change of scores at each scoring time point.

[0037] For each scoring time point in the scoring change rate sequence, compare the scoring change rate with the corresponding local scoring change rate mean. When the scoring change rate is greater than the local scoring change rate mean multiplied by a preset abnormal threshold multiple, record the corresponding scoring time point as the mutation start point.

[0038] Centered on the mutation initiation point, a fixed-length time range is extended forward and backward respectively. The corresponding time period feature subsequences within the covered time range are extracted to generate the initial mutation interval fragment.

[0039] Determine whether the rate of change of scores at each scoring time point in the initial mutation interval segment is continuously greater than the global average rate of change of the score change rate sequence. If the continuity condition is met, the set of time period feature subsequences corresponding to the initial mutation interval segment is confirmed as a candidate modeling window.

[0040] Optionally, the generation of the second risk score sequence includes:

[0041] Extract the feature subsequence of each time period within the candidate modeling window;

[0042] The system retrieves septic shock diagnosis records with time tags from the structured data of electronic medical records, and performs survival status labeling operations based on the time points corresponding to the time period feature subsequences. If there are septic shock diagnosis records before the current time point, the current time point is marked as the endpoint event; otherwise, it is marked as a censoring event.

[0043] Based on the time interval between the first collection time of physiological parameter data of the target patient and the current time point, calculate the survival time corresponding to the feature subsequence of each time period;

[0044] The time-time feature subsequences, survival status labels, and survival times are combined into training sample data.

[0045] Input the training sample data into the random survival forest model to generate the trained random survival forest model;

[0046] All time-segment feature subsequences in the candidate modeling window are sequentially input into the trained random survival forest model, and the survival probability estimate corresponding to each time-segment feature subsequence is calculated in each survival decision tree.

[0047] The survival probability estimates obtained from the feature subsequences of each time period in the entire survival decision tree are weighted and integrated to generate a second predicted risk score; all the second predicted risk scores are arranged in chronological order to form a second risk score sequence.

[0048] Optionally, the random survival forest model includes:

[0049] The feature subsequences, survival status labels, and survival times of all time periods contained in the candidate modeling window are combined to form training sample data.

[0050] The training sample data is subjected to feature perturbation-guided hierarchical sampling processing, which includes: dividing the training sample data into multiple risk level strata based on indicators reflecting the severity of clinical conditions in the structured data of electronic medical records; introducing feature perturbation into the training sample data in each risk level stratum before performing Bootstrap sampling to generate multiple risk-sensitive sampling subsets.

[0051] Train a survival decision tree for each risk-sensitive sampling subset, and repeat the above training process to construct a random survival forest model containing multiple survival decision trees.

[0052] At each leaf node of the survival decision tree, the survival probability of the feature subsequence of the input time period is estimated using the following survival probability estimation formula:

[0053] ;

[0054] in, Represents the characteristic subsequence within a time period Next moment The estimated survival probability, Indicates time The number of endpoint events that occurred. Indicates at time The number of samples for which the endpoint event has not yet occurred. Indicates the risk propensity adjustment factor. Represents the regularization constant;

[0055] Each time period feature subsequence is input into all survival decision trees, and the survival probability estimate calculated by each survival decision tree is obtained. All survival probability estimates are then weighted and integrated to generate the corresponding second predicted risk score.

[0056] Optionally, the generation of the joint risk score sequence includes:

[0057] Receive the first predicted risk score sequence and the second risk score sequence;

[0058] For each scoring time point in the first predicted risk scoring sequence and the second risk scoring sequence, a one-to-one time alignment process is performed to construct the mapping relationship between the feature subsequence of each time period and the two scores.

[0059] For each time period feature subsequence, a linear fusion operation is performed between the first predicted risk score and the second predicted risk score. The two risk scores are weighted and combined using a fixed fusion coefficient to generate a joint risk score for the current time period feature subsequence.

[0060] The joint risk scores corresponding to the feature subsequences of all time periods are arranged in chronological order to construct a joint risk score sequence.

[0061] Optionally, the generation of early warning signals for septic shock includes:

[0062] Receive the joint risk score sequence, arrange the joint risk score values ​​corresponding to the feature subsequences of each time period in the joint risk score sequence in chronological order, and construct the time series structure of the joint risk score;

[0063] On the time series structure of the joint risk score, the sliding fitting window parameters are set, and local fitting processing is performed according to the preset sliding step size. First-order linear fitting is performed on the joint risk score value sequence in each sliding fitting window to obtain the fitting slope corresponding to each time window.

[0064] Arrange the fitted slopes in time sequence to construct a fitted slope sequence;

[0065] In the fitted slope sequence, for each fitted slope point, it is determined whether the increase between the fitted slope at the previous time moment and the previous time moment is greater than the overall mean of the fitted slope sequence multiplied by a preset multiple threshold. If the condition is met, the corresponding scoring time point is marked as a risk mutation inflection point.

[0066] The scoring time points marked as risk mutation inflection points are output as early warning signals for septic shock.

[0067] The beneficial effects of this invention are:

[0068] (1) This invention constructs a global-local dual assessment framework for septic shock risk score by combining the Mondrian forest model and the random survival forest model. Based on the original multidimensional time series data, it realizes the whole process from real-time online training, local mutation score identification to candidate modeling window extraction, which enhances the model's ability to identify risk outbreak patterns and improves the sensitivity and stability of early warning.

[0069] (2) By constructing a joint risk score sequence that integrates the first and second predicted risk scores, and fitting and inflection point detection of the score trend, this invention effectively improves the trend identification ability of the model in the pre-onset stage of septic shock, ensures accurate early warning timing, and reduces the false alarm rate and false alarm rate.

[0070] (3) By introducing time-stamped diagnostic data from structured electronic medical records, this invention achieves high-quality annotation of training samples. At the same time, combined with the scoring mutation interval extraction mechanism, it ensures that local model training is carried out during the risk-sensitive period, improves the targeting of model training and the timeliness of prediction, and has higher risk response efficiency and clinical application value. Attached Figure Description

[0071] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used in conjunction with embodiments of the invention to explain the invention and do not constitute a limitation thereof. In the drawings:

[0072] Figure 1 This is a flowchart of the machine learning-based early identification method for septic shock proposed in this invention. Detailed Implementation

[0073] The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic diagrams, illustrating only the basic structure of the invention, and therefore only show the components relevant to the invention.

[0074] refer to Figure 1 A machine learning-based method for early identification of septic shock includes the following steps:

[0075] Acquire real-time physiological parameter data and electronic medical record data of the target patient, and generate a raw multidimensional feature sequence arranged in chronological order;

[0076] The original multidimensional feature sequence is subjected to missing value completion processing, numerical normalization processing, and fixed-length sliding window segmentation processing to obtain continuous time-segment feature subsequences;

[0077] The feature subsequence of each time period is input into the Mondrian Forest model for online training to obtain the corresponding first predicted risk score sequence.

[0078] The rate of change of the first predicted risk score sequence is calculated, the interval of score mutation is identified, and the corresponding time period feature subsequence is extracted as a candidate modeling window.

[0079] Label the survival status and time of the current sample within the candidate modeling window, construct training samples with censoring information, input them into the random survival forest model for local training, and output the second risk score sequence.

[0080] Perform risk score fusion processing on the first predicted risk score sequence and the second risk score sequence within the candidate modeling window to generate a joint risk score sequence;

[0081] The trend change is fitted based on the joint risk score sequence, and the inflection point of rapid score increase is identified to output an early warning signal for septic shock.

[0082] In this embodiment, the generation of the original multidimensional feature sequence arranged in chronological order includes:

[0083] Acquire real-time physiological parameter data and electronic medical record structured data of the target patient. The real-time physiological parameter data includes continuously monitored indicators recorded at fixed sampling periods, and the electronic medical record structured data includes clinical diagnostic information, test results and treatment records with time tags.

[0084] Time-stamping is performed on real-time physiological parameter data and electronic medical record structured data. Based on a unified time benchmark, the two data sources are time-aligned to construct data synchronization units for corresponding time points.

[0085] For each data synchronization unit, field filtering and structure standardization are performed to remove null fields and entries that do not meet the unified format requirements, and the data is converted into a fixed-length feature structure.

[0086] All eligible data synchronization units are arranged in chronological order to generate an original multidimensional feature sequence with a continuous time structure.

[0087] In this embodiment, obtaining the continuous time-time feature subsequence includes:

[0088] For each feature field in the original multidimensional feature sequence arranged in chronological order, missing values ​​are identified, the time point where the missing value is located is recorded, and the adjacent valid data points before and after it are located.

[0089] After identifying missing values, missing value completion processing is performed on the feature fields with missing values. Interpolation is performed based on the located valid data points before and after the missing value completion process to generate the original multidimensional feature sequence after the missing value completion process.

[0090] Numerical normalization is performed on the original multidimensional feature sequence after missing value completion to determine the numerical range of each feature field in the corresponding sequence. All fields are then normalized according to a unified normalization rule to generate the normalized original multidimensional feature sequence.

[0091] Set a fixed-length sliding time window and sliding step size parameter, take the normalized original multidimensional feature sequence as input, divide the window according to time order, and extract the set of continuous feature fields covered by each time window.

[0092] The results of each window segmentation are organized, and the set of feature fields corresponding to the time window is organized into a time period feature subsequence.

[0093] The normalization rule performs numerical normalization on each feature field in the original multidimensional feature sequence after missing value completion. For continuous numerical feature fields, normalization is performed according to the linear ratio between the minimum and maximum values ​​in the original multidimensional feature sequence, so that the numerical range of the corresponding feature field is converted to a preset normalization interval. For event count fields or discrete coding fields, the original numerical values ​​are converted into standardized representation values ​​within a specified interval based on a fixed mapping table, so as to maintain the comparability of the normalized feature fields on the numerical scale.

[0094] In this embodiment, obtaining the first predicted risk score sequence includes:

[0095] The time period feature subsequences are input into the Mondrian Forest model in chronological order. The Mondrian Forest model includes multiple decision trees that support dynamic structural expansion.

[0096] For each decision tree, a feature subsequence of a time period is received. Starting from the root node of the decision tree, the feature subsequence of the current time period is judged layer by layer to determine whether it meets the feature partitioning boundary set by the current node.

[0097] When the time period feature subsequence is not fully received by the current path structure, an incremental partitioning operation is performed at the current node according to the partitioning mechanism of the Mondrian Forest model, expanding the new child node, and sending the time period feature subsequence into the newly added child node structure;

[0098] When the feature subsequence of a time period satisfies a certain partitioning path condition, the path is traversed downwards to the leaf node of the decision tree;

[0099] In the leaf nodes of the decision tree, update the statistical information related to the feature subsequence of the current time period. The statistical information includes the total number of samples, the cumulative survival time, and the event status label.

[0100] Based on the statistical information in the leaf nodes of the decision tree, the survival probability estimate of the feature subsequence in the current time period in the decision tree is calculated.

[0101] The survival probability estimates calculated from the feature subsequences of all decision trees for the current time period are weighted and averaged to generate the corresponding first predicted risk score.

[0102] All time period feature subsequences are processed sequentially according to time order. Multiple first predicted risk scores are generated using the Mondrian Forest model. All first predicted risk scores are then arranged in chronological order to form a first predicted risk score sequence.

[0103] The feature partitioning boundary set by the current node is included in the Mondrian Forest model. Each decision tree's internal node sets the upper and lower limits of the value range corresponding to the feature field based on the value of the feature field in the received time period feature subsequence. This is used to define the partitioning conditions represented by the node. When a new time period feature subsequence is input, it is determined whether the value of the feature field is within the value range to decide whether to continue down the path of the current node.

[0104] The criteria for determining whether the current path structure has been fully received include: during the recursive traversal of the decision tree, the values ​​of all feature fields of the input time period feature subsequence satisfy the partitioning conditions of each level node, that is, the value of each feature field is within the range of values ​​defined by the feature partitioning boundary set by the corresponding node; and during the entire traversal from the root node to the leaf node, if the values ​​of the time period feature subsequence on all splitting features continuously satisfy the partitioning conditions of the corresponding node, then it is considered to have been fully received by the current path structure.

[0105] The path condition for a certain partitioning includes a path formed by a series of partitioning nodes from the root node down in each decision tree. Each partitioning node corresponds to a feature field and a judgment rule for its value range. The path condition for a certain partitioning is composed of the judgment rules of all partitioning nodes on the corresponding path. It means that if the value of the feature field corresponding to each partitioning node on the path falls within the value range set by the corresponding node, the input time period feature subsequence is considered to meet the path condition, and the path can continue to be traversed down to the next level node until the leaf node is reached.

[0106] In this embodiment, the generation of the candidate modeling window includes:

[0107] Receive a first predicted risk score sequence, and construct a score change rate sequence between adjacent score points in the first predicted risk score sequence in chronological order. The score change rate is the difference between any two adjacent first predicted risk scores divided by the score time interval.

[0108] Based on the rate of change sequence of scores, the mean of the local rate of change within the sliding time window is calculated to obtain the mean of the local rate of change of scores at each scoring time point.

[0109] For each scoring time point in the scoring change rate sequence, compare the scoring change rate with the corresponding local scoring change rate mean. When the scoring change rate is greater than the local scoring change rate mean multiplied by a preset abnormal threshold multiple, record the corresponding scoring time point as the mutation start point.

[0110] Centered on the mutation initiation point, a fixed-length time range is extended forward and backward respectively. The corresponding time period feature subsequences within the covered time range are extracted to generate the initial mutation interval fragment.

[0111] Determine whether the rate of change of scores at each scoring time point in the initial mutation interval segment is continuously greater than the global average rate of change of the score change rate sequence. If the continuity condition is met, the set of time period feature subsequences corresponding to the initial mutation interval segment is confirmed as a candidate modeling window.

[0112] In this embodiment, the generation of the second risk scoring sequence includes:

[0113] Extract the feature subsequence of each time period within the candidate modeling window;

[0114] The system retrieves septic shock diagnosis records with time tags from the structured data of electronic medical records, and performs survival status labeling operations based on the time points corresponding to the time period feature subsequences. If there are septic shock diagnosis records before the current time point, the current time point is marked as the endpoint event; otherwise, it is marked as a censoring event.

[0115] Based on the time interval between the first collection time of physiological parameter data of the target patient and the current time point, calculate the survival time corresponding to the feature subsequence of each time period;

[0116] The time-time feature subsequences, survival status labels, and survival times are combined into training sample data.

[0117] Input the training sample data into the random survival forest model to generate the trained random survival forest model;

[0118] All time-segment feature subsequences in the candidate modeling window are sequentially input into the trained random survival forest model, and the survival probability estimate corresponding to each time-segment feature subsequence is calculated in each survival decision tree.

[0119] The survival probability estimates obtained from the feature subsequences of each time period in the entire survival decision tree are weighted and integrated to generate a second predicted risk score; all the second predicted risk scores are arranged in chronological order to form a second risk score sequence.

[0120] In this embodiment, the random survival forest model includes:

[0121] The feature subsequences, survival status labels, and survival times of all time periods contained in the candidate modeling window are combined to form training sample data.

[0122] The training sample data is subjected to feature perturbation-guided hierarchical sampling processing, which includes: dividing the training sample data into multiple risk level strata based on indicators reflecting the severity of clinical conditions in the structured data of electronic medical records; introducing feature perturbation into the training sample data in each risk level stratum before performing Bootstrap sampling to generate multiple risk-sensitive sampling subsets.

[0123] Train a survival decision tree for each risk-sensitive sampling subset, and repeat the above training process to construct a random survival forest model containing multiple survival decision trees.

[0124] At each leaf node of the survival decision tree, the survival probability of the feature subsequence of the input time period is estimated using the following survival probability estimation formula:

[0125] ;

[0126] in, Represents the characteristic subsequence within a time period Next moment The estimated survival probability, Indicates time The number of endpoint events that occurred. Indicates at time The number of samples for which the endpoint event has not yet occurred. Indicates the risk propensity adjustment factor. Represents the regularization constant;

[0127] This formula originates from a survival analysis statistical model— The principle of survival function estimation has the following basic form: ,in Indicates at a point in time The number of events that occurred This represents the number of samples that survived before that time point. This application builds upon and improves upon this formula, introducing two innovative adjustment factors: firstly, a risk adjustment factor. First, it enables dynamic calibration of risk sensitivity for different samples; second, it introduces a regularization factor. Used for the denominator term Suppression corrections are applied to enhance the model's stability and generalization ability under small sample limits. This derivation result maintains dimensional consistency with the original formula, where... It is a dimensionless ratio. Since the coefficients are dimensionless, the entire term within the parentheses is dimensionless, resulting in the final estimation result. It represents the survival probability and is also a dimensionless value, which conforms to the principle of consistency of physical dimensions, has a clear statistical basis and practical operability, and reflects the creative technical breakthrough of this application in the direction of survival probability estimation.

[0128] Each time period feature subsequence is input into all survival decision trees, and the survival probability estimate calculated by each survival decision tree is obtained. All survival probability estimates are then weighted and integrated to generate the corresponding second predicted risk score.

[0129] In this embodiment, the generation of the joint risk scoring sequence includes:

[0130] Receive the first predicted risk score sequence and the second risk score sequence;

[0131] For each scoring time point in the first predicted risk scoring sequence and the second risk scoring sequence, a one-to-one time alignment process is performed to construct the mapping relationship between the feature subsequence of each time period and the two scores.

[0132] For each time period feature subsequence, a linear fusion operation is performed between the first predicted risk score and the second predicted risk score. The two risk scores are weighted and combined using a fixed fusion coefficient to generate a joint risk score for the current time period feature subsequence.

[0133] Arrange the joint risk scores corresponding to the feature subsequences of all time periods in chronological order to construct a joint risk score sequence;

[0134] The fixed fusion coefficient refers to the weight assigned to each model's score when fusing the risk scores generated by the Mondrian Forest model and the Random Survival Forest model. This fixed fusion coefficient is typically determined in advance through cross-validation or performance evaluation on historical training sets, representing the relative contribution of the two models to the overall recognition performance. For example, setting the fusion coefficient to 0.6 means that the first predicted risk score is assigned a weight of 0.6 and the second predicted risk score is assigned a weight of 0.4 during the fusion process, and the two are added together to form the joint risk score. This fusion coefficient remains unchanged throughout the model inference process and does not dynamically change with individual samples, thus ensuring the reproducibility and verifiability of the score fusion process.

[0135] In this embodiment, the generation of early warning signals for septic shock includes:

[0136] Receive the joint risk score sequence, arrange the joint risk score values ​​corresponding to the feature subsequences of each time period in the joint risk score sequence in chronological order, and construct the time series structure of the joint risk score;

[0137] On the time series structure of the joint risk score, the sliding fitting window parameters are set, and local fitting processing is performed according to the preset sliding step size. First-order linear fitting is performed on the joint risk score value sequence in each sliding fitting window to obtain the fitting slope corresponding to each time window.

[0138] Arrange the fitted slopes in time sequence to construct a fitted slope sequence;

[0139] In the fitted slope sequence, for each fitted slope point, it is determined whether the increase between the fitted slope at the previous time moment and the previous time moment is greater than the overall mean of the fitted slope sequence multiplied by a preset multiple threshold. If the condition is met, the corresponding scoring time point is marked as a risk mutation inflection point.

[0140] The scoring time points marked as risk mutation inflection points are output as early warning signals for septic shock.

[0141] Example 1:

[0142] To verify the feasibility of this invention in practice, it was applied to the intensive care unit of a provincial people's hospital. To improve the early identification and timely intervention of septic shock, a machine learning-based early identification method for septic shock was introduced. This method is deployed in the hospital's HIS system and data integration platform. It dynamically collects, processes, and predicts real-time physiological parameter data and structured electronic medical record data from hospitalized patients, assisting clinicians in making early warning decisions.

[0143] The system collects real-time physiological parameter data including heart rate, systolic blood pressure, diastolic blood pressure, body temperature, respiratory rate, and SpO2, with a data sampling frequency of once per minute. Electronic medical record structured data includes clinical laboratory items such as lactate concentration, C-reactive protein, white blood cell count, renal function indicators, liver function, infectious disease diagnosis records, and antibiotic medication records, which are updated daily. All data is processed using a unified time tag after collection, aligned along a timeline, and a continuous and standardized data synchronization unit is constructed.

[0144] The system preprocesses data using missing value interpolation and normalization techniques, and constructs time-period feature subsequences using a sliding window mechanism (window length set to 6 hours, sliding step size of 1 hour). Each time-period feature subsequence is sequentially input into the Mondrian Forest model for online training to generate the first predicted risk score sequence. Regions in the risk score sequence with significantly increased rate of change are identified as abrupt change intervals and further extracted as candidate modeling windows.

[0145] Within the candidate modeling window, the system calculates survival time by comparing the interval between the time point and the time of the patient's first physiological parameter collection, and labels endpoint events and censored events based on the presence or absence of a septic shock diagnosis record. Subsequently, a random survival forest model is trained using a feature-perturbation-guided hierarchical sampling method, and a second risk score sequence is generated through an innovative survival probability estimation formula.

[0146] The two risk score sequences are weighted and fused at each time point (the weight of the first predicted risk score is set to 0.6, and the weight of the second predicted risk score is set to 0.4) to form a joint risk score sequence. The system further employs a sliding window fitting mechanism to perform linear trend analysis, extracts the slope of the growth trend of the joint risk score, and marks the inflection point of rapid rise as an early warning signal.

[0147] During a 60-day clinical deployment, the model monitored 325 critically ill patients, of whom 49 were ultimately diagnosed with septic shock. The system successfully identified 41 high-risk cases of sudden septic shock, with an average early warning time of 6.2 hours, an improvement of approximately 3 hours compared to traditional rule-based systems based on SOFA scores. The following are some of the system's performance data:

[0148] Table 1. Evaluation data of the septic shock identification model (sample size: 325 cases)

[0149] Indicator The system Conventional SOFA early warning system Accuracy 92.1% 79.8% Early warning time (average) 6.2 hours 3.1 hours Sensitivity (recall rate) 83.7% 71.4% Specificity 94.8% 85.2% AUC value 0.914 0.791

[0150] As shown in Table 1, the evaluation data of the septic shock identification model demonstrates that this machine learning-based early identification method for septic shock exhibits significant advantages over the traditional SOFA scoring system in key performance indicators. Specifically, the model achieves an accuracy of 92.1%, a 12.3 percentage point improvement over the traditional system's 79.8%, indicating greater reliability in overall accuracy. In terms of sensitivity (recall), the method reaches 83.7%, higher than the traditional method's 71.4%, demonstrating a stronger ability to identify genuine septic shock cases and effectively reducing the risk of missed diagnoses. Regarding specificity, the model achieves 94.8%, a 9.6 percentage point improvement over the traditional method, helping to reduce false positives for non-septic shock patients and improve diagnostic efficiency. Furthermore, this method has a significant advantage in early warning time, with an average warning time of 6.2 hours, 3.1 hours longer than the traditional method, providing a more sufficient response window for clinical intervention. Most importantly, in terms of AUC value, this method achieves 0.914, which is much higher than the traditional method's 0.791, indicating that it has better distinguishing ability at various risk thresholds.

[0151] In summary, this embodiment demonstrates the deployment process, data input method, modeling and training path, and early identification mechanism of the method of the present invention in a real critical care clinical scenario, reflecting a comprehensive improvement over existing identification technologies in terms of accuracy, early warning timeliness, and generalization ability.

[0152] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.

Claims

1. A machine learning-based method for early identification of septic shock, characterized in that, include: Acquire real-time physiological parameter data and electronic medical record data of the target patient, and generate a raw multidimensional feature sequence arranged in chronological order; The original multidimensional feature sequence is subjected to missing value completion processing, numerical normalization processing, and fixed-length sliding window segmentation processing to obtain continuous time-segment feature subsequences; The feature subsequence of each time period is input into the Mondrian Forest model for online training to obtain the corresponding first predicted risk score sequence. The rate of change of the first predicted risk score sequence is calculated, the interval of score mutation is identified, and the corresponding time period feature subsequence is extracted as a candidate modeling window. Label the survival status and survival time of the feature subsequences for each time period within the candidate modeling window, and construct training samples with censoring information. The training samples are input into the random survival forest model for local training, generating the trained random survival forest model. Then, input the time period feature subsequences in the candidate modeling window into the trained random survival forest model, and generate a second predicted risk score sequence based on the survival probability estimate; Perform risk score fusion processing on the first and second predicted risk score sequences within the candidate modeling window to generate a joint risk score sequence; The trend change is fitted based on the joint risk score sequence, and the inflection point of rapid score increase is identified to output an early warning signal for septic shock.

2. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The generation of the original multidimensional feature sequence arranged in chronological order includes: Acquire real-time physiological parameter data and electronic medical record structured data of the target patient. The real-time physiological parameter data includes continuously monitored indicators recorded at fixed sampling periods, and the electronic medical record structured data includes clinical diagnostic information, test results and treatment records with time tags. Time-stamping is performed on real-time physiological parameter data and electronic medical record structured data. Based on a unified time benchmark, the two data sources are time-aligned to construct data synchronization units for corresponding time points. For each data synchronization unit, field filtering and structure standardization are performed to remove null fields and entries that do not meet the unified format requirements, and the data is converted into a fixed-length feature structure. All eligible data synchronization units are arranged in chronological order to generate an original multidimensional feature sequence with a continuous time structure.

3. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The acquisition of the continuous time-time feature subsequences includes: For each feature field in the original multidimensional feature sequence arranged in chronological order, missing values ​​are identified, the time point where the missing value is located is recorded, and the adjacent valid data points before and after it are located. After identifying missing values, missing value completion processing is performed on the feature fields with missing values. Interpolation is performed based on the located valid data points before and after the missing value completion process to generate the original multidimensional feature sequence after the missing value completion process. Numerical normalization is performed on the original multidimensional feature sequence after missing value completion to determine the numerical range of each feature field in the corresponding sequence. All fields are then normalized according to a unified normalization rule to generate the normalized original multidimensional feature sequence. Set a fixed-length sliding time window and sliding step size parameter, take the normalized original multidimensional feature sequence as input, divide the window according to time order, and extract the set of continuous feature fields covered by each time window. The results of each window segmentation are organized, and the set of feature fields corresponding to the time window is organized into a feature subsequence for a time period.

4. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The first predicted risk score sequence is obtained by: The time period feature subsequences are input into the Mondrian Forest model in chronological order. The Mondrian Forest model includes multiple decision trees that support dynamic structural expansion. For each decision tree, a feature subsequence of a time period is received. Starting from the root node of the decision tree, the feature subsequence of the current time period is judged layer by layer to determine whether it meets the feature partitioning boundary set by the current node. When the time period feature subsequence is not fully received by the current path structure, an incremental partitioning operation is performed at the current node according to the partitioning mechanism of the Mondrian Forest model, expanding the new child node, and sending the time period feature subsequence into the newly added child node structure; When the feature subsequence of a time period satisfies a certain partitioning path condition, the path is traversed downwards to the leaf node of the decision tree; In the leaf nodes of the decision tree, update the statistical information related to the feature subsequence of the current time period. The statistical information includes the total number of samples, the cumulative survival time, and the event status label. Based on the statistical information in the leaf nodes of the decision tree, the survival probability estimate of the feature subsequence in the current time period in the decision tree is calculated. The survival probability estimates calculated from the feature subsequences of all decision trees for the current time period are weighted and averaged to generate the corresponding first predicted risk score. All time period feature subsequences are processed sequentially in chronological order. Multiple first predicted risk scores are generated using the Mondrian Forest model, and all first predicted risk scores are arranged in chronological order to form a first predicted risk score sequence.

5. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The generation of the candidate modeling window includes: Receive a first predicted risk score sequence, and construct a score change rate sequence between adjacent score points in the first predicted risk score sequence in chronological order. The score change rate is the difference between any two adjacent first predicted risk scores divided by the score time interval. Based on the rate of change sequence of scores, the mean of the local rate of change within the sliding time window is calculated to obtain the mean of the local rate of change of scores at each scoring time point. For each scoring time point in the scoring change rate sequence, compare the scoring change rate with the corresponding local scoring change rate mean. When the scoring change rate is greater than the local scoring change rate mean multiplied by a preset abnormal threshold multiple, record the corresponding scoring time point as the mutation start point. Centered on the mutation initiation point, a fixed-length time range is extended forward and backward respectively. The corresponding time period feature subsequences within the covered time range are extracted to generate the initial mutation interval fragment. Determine whether the rate of change of scores at each scoring time point in the initial mutation interval segment is continuously greater than the global average rate of change of the score change rate sequence. If the continuity condition is met, the set of time period feature subsequences corresponding to the initial mutation interval segment is confirmed as a candidate modeling window.

6. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The generation of the second predicted risk score sequence includes: Extract the feature subsequence of each time period within the candidate modeling window; The system retrieves septic shock diagnosis records with time tags from the structured data of electronic medical records, and performs survival status labeling operations based on the time points corresponding to the time period feature subsequences. If there are septic shock diagnosis records before the current time point, the current time point is marked as the endpoint event; otherwise, it is marked as a censoring event. Based on the time interval between the first collection time of physiological parameter data of the target patient and the current time point, calculate the survival time corresponding to the feature subsequence of each time period; The time-time feature subsequences, survival status labels, and survival times are combined into training sample data. Input the training sample data into the random survival forest model to generate the trained random survival forest model; All time-segment feature subsequences in the candidate modeling window are sequentially input into the trained random survival forest model, and the survival probability estimate corresponding to each time-segment feature subsequence is calculated in each survival decision tree. The survival probability estimates obtained from the feature subsequences of each time period in the entire survival decision tree are weighted and integrated to generate a second predicted risk score; all the second predicted risk scores are arranged in chronological order to form a second predicted risk score sequence.

7. The method for early identification of septic shock based on machine learning according to claim 6, characterized in that, The random survival forest model includes: The training sample data is subjected to feature perturbation-guided hierarchical sampling processing, which includes: dividing the training sample data into multiple risk level strata based on indicators reflecting the severity of clinical conditions in the structured data of electronic medical records; introducing feature perturbation into the training sample data in each risk level stratum before performing Bootstrap sampling to generate multiple risk-sensitive sampling subsets. Train a survival decision tree for each risk-sensitive sampling subset, and repeat the above training process to construct a random survival forest model containing multiple survival decision trees; At each leaf node of the survival decision tree, the survival probability of the feature subsequence of the input time period is estimated, and the survival probability estimate is generated. Each time period feature subsequence is input into all survival decision trees, and the survival probability estimate calculated by each survival decision tree is obtained. All survival probability estimates are then weighted and integrated to generate the corresponding second predicted risk score.

8. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, The generation of the joint risk scoring sequence includes: For each scoring time point in the first and second predicted risk scoring sequences, a one-to-one time alignment process is performed to construct a mapping relationship between the feature subsequence of each time period and the two scores. For each time period feature subsequence, a linear fusion operation is performed between the first predicted risk score and the second predicted risk score. The two risk scores are weighted and combined using a fixed fusion coefficient to generate a joint risk score for the current time period feature subsequence. The joint risk scores corresponding to the feature subsequences of all time periods are arranged in chronological order to construct a joint risk score sequence.

9. The method for early identification of septic shock based on machine learning according to claim 1, characterized in that, Early warning signs of septic shock include: Receive the joint risk score sequence, arrange the joint risk score values ​​corresponding to the feature subsequences of each time period in the joint risk score sequence in chronological order, and construct the time series structure of the joint risk score; On the time series structure of the joint risk score, the sliding fitting window parameters are set, and local fitting processing is performed according to the preset sliding step size. First-order linear fitting is performed on the joint risk score value sequence in each sliding fitting window to obtain the fitting slope corresponding to each time window. Arrange the fitted slopes in time sequence to construct a fitted slope sequence; In the fitted slope sequence, for each fitted slope point, it is determined whether the increase between the fitted slope at the previous time moment and the previous time moment is greater than the overall mean of the fitted slope sequence multiplied by a preset multiple threshold. If the condition is met, the corresponding scoring time point is marked as a risk mutation inflection point. The scoring time points marked as risk mutation inflection points are output as early warning signals for septic shock.