Model for predicting postoperative discharge readiness in patients with rectal cancer
By constructing the GA_XGboost model and using genetic algorithms to select hyperparameters and feature engineering to process data, the shortcomings in assessing postoperative discharge readiness of colorectal cancer patients were addressed, enabling accurate prediction of discharge time and reducing waste of medical resources and hospitalization costs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG CANCER HOSPITAL
- Filing Date
- 2022-05-19
- Publication Date
- 2026-06-12
AI Technical Summary
The lack of effective methods in the current technology for assessing the postoperative discharge readiness of colorectal cancer patients leads to a waste of medical resources and an increase in patients' hospitalization costs.
A model based on GA_XGboost was constructed, and a genetic algorithm was used to select hyperparameters. Combined with training set data processed by feature engineering, the model predicted the postoperative discharge time of rectal cancer patients. The model was accurately evaluated by inputting relevant vital signs data of patients before and after surgery.
It improves the accuracy of discharge time prediction, reduces the waste of medical resources, provides a reference for doctors and patients, and helps to rationally arrange postoperative treatment.
Smart Images

Figure CN122201749A_ABST
Abstract
Description
[0001] This application is a divisional application of Chinese invention patent application number "202210542329.5", application date "2022.05.19", and invention title "Model for predicting postoperative discharge readiness of rectal cancer patients". Technical Field
[0002] This invention relates to a model for predicting postoperative discharge readiness in patients with rectal cancer. Background Technology
[0003] Colorectal cancer is a common malignant tumor of the digestive system, ranking among the top three in incidence rates for both men and women worldwide. It ranks third in incidence and second in mortality among men and women, respectively, seriously impacting human health. With the continuous improvement of living standards and changes in lifestyle, especially dietary structure, the incidence of colorectal cancer in my country is increasing, now ranking second in incidence and fourth in mortality among urban malignant tumors. In recent years, with continuous improvements in medical technology, the overall 5-year survival rate for colorectal cancer patients in my country has also significantly improved, with the 5-year survival rate for colon cancer reaching 57.6% and the 5-year survival rate for rectal cancer reaching 56.9%.
[0004] First proposed by Fenwick in 1979, patient discharge readiness refers to a comprehensive assessment of a patient's physical, psychological, and social health status conducted by healthcare professionals before discharge. This assessment analyzes and judges the patient's ability to recover and reintegrate into society after leaving the hospital. Therefore, discharge readiness assessment results can help determine whether a patient meets discharge criteria. Accurate assessment and prediction of when a patient will be discharged helps hospitals better manage medical resources and understand their peak capacity. It also has significant implications for saving medical resources and reducing patient hospitalization costs. However, current research on postoperative discharge readiness for colorectal cancer patients is scarce.
[0005] The foregoing background information is intended to help those skilled in the art understand prior art that is similar to the present invention, and to facilitate the understanding of the concept and technical solution of the present invention. It should be clearly stated that, in the absence of clear evidence that the above content was disclosed before the filing date of this patent application, the foregoing background information should not be used to evaluate the novelty of the technical solution of this application. Summary of the Invention
[0006] To address at least one of the technical problems mentioned in the background section, the present invention aims to provide a method for predicting the postoperative discharge time of rectal cancer patients. This method involves constructing and training a model based on GA_Xgboost, and then inputting relevant pre- and post-operative vital signs data of the rectal cancer patient as a validation set into the model to predict the appropriate discharge time. GA_Xgboost significantly improves the prediction accuracy of both MSE and MAE indicators, providing doctors and patients with certain references and expectations. This not only benefits postoperative arrangements and treatment but also reduces unnecessary waste of medical resources.
[0007] To achieve the above objectives, the present invention provides the following technical solution.
[0008] Models for predicting postoperative discharge readiness in rectal cancer patients, including methods for predicting discharge time in rectal cancer patients, include: Obtain relevant physical signs data before and after surgery in patients with rectal cancer to be predicted; The relevant vital signs data of the rectal cancer patient to be predicted before and after surgery are input into the pre-constructed postoperative discharge days prediction model based on GA_XGboost to obtain the predicted discharge days. The postoperative discharge days prediction model based on GA_XGboost uses the genetic algorithm GA to select hyperparameters for XGboost regression.
[0009] The postoperative discharge days prediction model based on GA_XGboost was trained using a training set consisting of pre- and post-operative vital signs data of historical rectal cancer patients.
[0010] The method for predicting postoperative discharge time for rectal cancer patients specifically includes: Step 1: Collect pre- and post-operative data of historical rectal cancer patients as a feature set; Step 2: Clean the feature set; Step 3: Perform feature filtering on the cleaned feature set; Step 4: Establish a prediction model for postoperative discharge days, and input the feature set obtained in Step 3 into the prediction model for training; Step 5: Use the trained prediction model to predict the validation set data to obtain the number of days until discharge.
[0011] In some implementation methods, step one includes the following pre- and post-operative data: age, PS score, TNM stage, differentiation degree, height, weight, BMI, fasting blood glucose upon admission, albumin, prealbumin, total protein, GOT (alanine), GPT (aspartate), L-γ-glutamyl transferase, total bilirubin, direct bilirubin, indirect bilirubin, creatinine, white blood cells, hemoglobin, lymphocyte count, and other essential pre-operative physical examination indicators, as well as the duration of surgery and amount of blood loss.
[0012] In some implementation methods, step two involves cleaning the feature set by performing the following steps: removing cases with severely missing indicators, and using... The criteria remove data with outliers, fill in missing features with the mean, and quantify some discrete data indicators.
[0013] In some implementation methods, step three involves feature selection as follows: Redundant and irrelevant features are removed using three metrics: importance scores from XGBoost and Random Forest, and linear correlation. Specifically, importance is ranked using XGBoost and Random Forest, and the five features with the smallest sum of importance scores are eliminated. The original data has 27 dimensions, inevitably containing redundancy and data ineffective for model prediction. Traditional linear dimensionality reduction methods like PCA are ineffective for tree models. Therefore, in the subsequent feature selection, ranking importance using XGBoost and Random Forest and eliminating the five features with the smallest sum significantly improves the accuracy of model predictions.
[0014] In some implementation methods, in step four, the postoperative discharge days prediction model is based on the GA_Xgboost model.
[0015] The prediction model includes the Xgboost regression algorithm: Its objective function mainly consists of a loss function and a regularization function:
[0016]
[0017] in, Let be the objective function for the s-th iteration; This is the loss function for the model; It is in the The predicted value of sample i in each round of training; It is the true value of sample i; For the i-th input sample; This is the sub-model trained in the s-th round; Let be the regularization function for the model in the s-th iteration; n is the number of samples; γ and λ are the regularization coefficients; T is the number of leaf nodes in the model.
[0018] The loss function is:
[0019] Where constant is a constant; Expand it using Taylor's formula:
[0020] in, This is the training loss from the previous s-1 round, which does not affect the training loss value in this round and can be considered a constant; removing the constant transforms it into:
[0021] make Output the value for the j-th leaf node. Given a subset of the values of the j-th leaf node, map it to each leaf node and rewrite it as follows:
[0022] in For loss function pairs The first derivative, Its second derivative; let , The above formula can be expressed as:
[0023] Differentiating the above equation yields the optimal value of leaf node j and the optimal value of the objective function:
[0024] in, The value can be used to evaluate the quality of a tree model; the smaller the value, the better the tree model.
[0025] The XGboost algorithm follows the greedy algorithm and assumes that the tree structure is a binary tree, making its final node splitting objective function formula as follows:
[0026] Where G and H are obtained by summing g and h of all samples at that node. ; ; ; ; ; ; ; ; ;I L The sample set of the left leaf nodes; I R This is the sample set of the right leaf nodes.
[0027] Therefore, the training process of XGboost is as follows: 1. A new initialized binary tree model is added in each round of training; 2. Update gradient statistics before starting training:
[0028] 3. Train the complete tree in this round using the greedy generation algorithm and gradient generation. ; 3.1 Selecting the optimal split point:
[0029] 3.2 Obtain the weight values of the leaf nodes:
[0030] 4. Weight the newly obtained tree model in this round and add it to the previous model: .
[0031] After training, the validation set data, which has undergone feature engineering, can be input into the trained model to obtain the predicted value.
[0032] The prediction model also includes the GA_Xgboost algorithm.
[0033] The following steps are taken to select hyperparameters for the XGboost model using a genetic algorithm (GA): 1. Randomly generate N combinations in the hyperparameter space as the initial population; 2. The XGboost loss function is obtained through cross-validation to calculate the fitness of each individual; 3. Retain parameter combinations with high adaptability; 4. Perform crossover operations on the retained individual parameters to generate new parameter combinations, and then perform random mutation operations on the generated parameter combinations; 5. Eliminate individuals with low fitness in the parameter combinations; 6. Repeat steps 2 to 5 until the set termination condition is met; 7. Select the individual with the highest fitness among all parameter combinations as the hyperparameter of the model.
[0034] In training XGBoost models, the grid method was often considered for hyperparameter tuning in the past. However, the grid method has an exponential complexity in finding the optimal model, resulting in extremely long training times and making it almost impossible to handle tuning multiple hyperparameters. This invention applies a genetic algorithm (GA) to select hyperparameters for the XGBoost model, which can significantly reduce training time and also greatly improve model accuracy, achieving the minimum value of the objective function.
[0035] The selected dataset is used as the training set and input into GA_XGboost for training.
[0036] Select The loss function of the model is used, and the optimal hyperparameters of the XGboost model are solved by a genetic algorithm (GA). The hyperparameters are shown in Table 2, as well as the optimal splitting points of each node in the tree after each training round:
[0037] And obtain the weight values of the leaf nodes:
[0038] Where G and H are obtained by summing g and h of all samples at that node. ; ; ; ; ; ; ; ; ;I L The sample set of the left leaf nodes; I R This is the sample set of the right leaf nodes.
[0039] In some implementations, in step five, the validation set data consists of preoperative physical examination indicators, surgical duration, and blood loss for patients with rectal cancer to be predicted.
[0040] In some implementations, in step five, the validation set data corresponds to the feature set data that has undergone feature filtering as described above.
[0041] The prediction results can be obtained by inputting the validation set data after feature engineering into the trained model. It was found that GA_XGboost has significantly improved the prediction accuracy in both MSE (Mean Squared Error) and MAE (Mean Absolute Error), which basically meets the requirements of providing patients and doctors with an effective prediction in real-world work.
[0042] A system for predicting postoperative discharge time in rectal cancer patients includes: The acquisition module is configured to acquire pre- and post-operative data of historical rectal cancer patients as a feature set and pre- and post-operative data of rectal cancer patients to be predicted as a validation set. The prediction module is configured to: input feature set data into a pre-built postoperative discharge days prediction model for training, and use the trained prediction model to predict the validation set data to obtain the discharge days.
[0043] An electronic device includes a memory, a processor, and computer instructions stored in the memory and running on the processor, wherein the computer instructions, when executed by the processor, perform the aforementioned method.
[0044] A machine-readable storage medium for storing computer instructions that, when executed by a processor, perform the aforementioned method.
[0045] Based on common knowledge in the field, the above-mentioned preferred conditions can be combined to obtain specific implementation methods.
[0046] The raw materials or reagents involved in this invention are all commercially available products, and the operations involved are all routine operations in the field unless otherwise specified.
[0047] The beneficial effects of this invention are as follows: This invention selects the ensemble learning algorithm XGboost, which has nonlinear segmentation capabilities and a good overfitting control mechanism on small training sets. A GA_XGboost-based model for predicting postoperative discharge days is obtained by selecting hyperparameters for the XGboost model using a genetic algorithm (GA). The model is then trained using a training set that has undergone feature engineering. Preoperative and postoperative vital sign data of rectal cancer patients are used as a validation set to predict the appropriate postoperative discharge time for some patients who underwent relevant preoperative physical examinations. Verification shows that GA_XGboost significantly improves the prediction accuracy of MSE and MAE indicators, basically meeting the practical needs of providing doctors and patients with an effective estimate, offering them some reference and expectation. This not only benefits postoperative arrangements and treatment but also reduces unnecessary waste of medical resources.
[0048] The present invention adopts the above-mentioned technical solution to achieve the above objectives, which makes up for the shortcomings of the prior art, is reasonably designed, and is easy to operate. Attached Figure Description
[0049] To make the above and / or other objects, features, advantages and examples of the present invention more apparent and understandable, the accompanying drawings are described below: Figure 1 This is a flowchart of the prediction algorithm of the present invention; Figure 2 After selecting the appropriate hyperparameter values for GA_XGboost, only the effect of changing the number of iterations on the results is shown in the graph. Detailed Implementation
[0050] Those skilled in the art can refer to the content of this document and appropriately replace and / or modify the process parameters to achieve the desired results. However, it should be particularly noted that all similar replacements and / or modifications are obvious to those skilled in the art and are considered to be included in this invention. The products and preparation methods described in this invention have been described through preferred examples, and those skilled in the art can obviously modify or appropriately change and combine the products and preparation methods described herein without departing from the content, spirit, and scope of this invention to realize and apply the technology of this invention.
[0051] Unless otherwise defined, the technical and scientific terms used herein have the same meanings as commonly understood by one of ordinary skill in the art to which this invention pertains. This invention uses the methods and materials described herein; however, other suitable methods and materials known in the art may also be used. The materials, methods, and examples described herein are illustrative only and are not intended to be limiting. All publications, patent applications, patent cases, provisional applications, database entries, and other references mentioned herein are incorporated herein by reference in their entirety. In case of conflict, the definitions included in this specification shall prevail.
[0052] Unless otherwise specified, the materials, methods, and examples described herein are exemplary and not limiting. While similar or equivalent methods and materials can be used to implement or test the invention, suitable methods and materials are described herein.
[0053] The present invention will now be described in detail.
[0054] Example 1: A method for predicting postoperative discharge time in rectal cancer patients includes the following steps.
[0055] The following physical examination data were collected from patients with a history of rectal cancer in the hospital before and after surgery as an existing patient characteristic set: age, PS score, TNM stage, differentiation degree, height, weight, BMI, fasting blood glucose on admission, albumin, prealbumin, total protein, GOT (alanine), GPT (aspartate), L-γ-glutamyl transferase, total bilirubin, direct bilirubin, indirect bilirubin, creatinine, white blood cells, hemoglobin, total number of lymph nodes, number of positive lymph nodes, lymphocyte count, and other necessary preoperative physical examination indicators, as well as the duration of surgery and blood loss.
[0056] The existing patient feature set data was cleaned to remove cases with severely missing indicators. The criteria remove data with outliers, fill in missing features with the mean, and quantify some discrete data indicators.
[0057] The importance calculation and correlation test of the aforementioned cleaned 27-dimensional feature data were performed using XGboost and Random Forest. The results are shown in Table 1. The calculated values were sorted, and the five features with the smallest sum of importance calculation scores from XGboost and Random Forest were removed. The remaining data were used as the existing patient feature set.
[0058] Table 1 - Feature Selection Results
[0059] Example 2: Based on the aforementioned embodiments, a postoperative discharge days prediction model based on GA_Xgboost is constructed.
[0060] The objective function of the XgBoost algorithm mainly consists of a loss function and a regularization function:
[0061]
[0062] in, Let be the objective function for the s-th iteration; This is the loss function for the model; It is in the The predicted value of sample i in each round of training; It is the true value of sample i; For the i-th input sample; This is the sub-model trained in the s-th round; Let be the regularization function for the model in the s-th iteration; n is the number of samples; γ and λ are the regularization coefficients; T is the number of leaf nodes in the model.
[0063] The loss function is:
[0064] Where constant is a constant; Expand it using Taylor's formula:
[0065] in, This is the training loss from the previous s-1 rounds, and it does not affect the training loss value in this round. Therefore, it can be considered a constant. Since constants do not affect the training results, after removing all constants, the final loss function is transformed into:
[0066] make Output the value for the j-th leaf node. Given a subset of the values of the j-th leaf node, map it to each leaf node and rewrite it as follows:
[0067] in For loss function pairs The first derivative, Its second derivative; let , The above formula can be expressed as:
[0068] Differentiating the above equation yields the optimal value of leaf node j and the optimal value of the objective function:
[0069] in, The value can be used to evaluate the quality of a tree model; the smaller the value, the better the tree model.
[0070] The XGboost algorithm follows the greedy algorithm and assumes that the tree structure is a binary tree, making its final node splitting objective function formula as follows:
[0071] Where G and H are obtained by summing g and h of all samples at that node. ; ; ; ; ; ; ; ; ;I L The sample set of the left leaf nodes; I R This is the sample set of the right leaf nodes.
[0072] Therefore, the training process of XGboost is as follows: 1. A new initialized binary tree model is added in each round of training; 2. Update gradient statistics before starting training:
[0073] 3. Train the complete tree in this round using the greedy generation algorithm and gradient generation. ; 3.1 Selecting the optimal split point:
[0074] 3.2 Obtain the weight values of the leaf nodes:
[0075] 4. Weight the newly obtained tree model in this round and add it to the previous model:
[0076] Once trained, the predicted values can be obtained by inputting the feature-engineered data into the trained model.
[0077] Example 3: Based on the aforementioned embodiments, a postoperative discharge time prediction model based on GA_Xgboost is constructed, and the patient's discharge time is predicted according to this model. The prediction algorithm flowchart is as follows. Figure 1 As shown.
[0078] When training the XGboost model, the five hyperparameters listed in Table 2 are mainly considered. In the past, the grid method was often used for hyperparameter tuning, but the complexity of finding the optimal model using the grid method is exponential, the training time is particularly long, and it is almost impossible to handle the tuning of multiple hyperparameters.
[0079] Table 2 - Key Hyperparameters of XGboost
[0080] Genetic Algorithm (GA), proposed by J. Holland in the last century, is a random search method. Its main advantage is that it directly judges the value of the target object without requiring it to meet the assumptions of differentiation and function continuity. It adaptively finds the global optimum, reducing computational complexity. By using Genetic Algorithm (GA) to select hyperparameters for the XGBoost model, the training time can be greatly reduced, and the model accuracy can also be significantly improved, reaching the minimum value of the objective function. The steps are as follows: 1. Randomly generate N combinations in the hyperparameter space as the initial population; 2. The XGboost loss function is obtained through cross-validation to calculate the fitness of each individual; 3. Retain parameter combinations with high adaptability; 4. Perform crossover operations on the retained individual parameters to generate new parameter combinations, and then perform random mutation operations on the generated parameter combinations; 5. Eliminate individuals with low fitness in the parameter combinations; 6. Repeat steps 2 to 5 until the set termination condition is met; 7. Select the individual with the highest fitness among all parameter combinations as the hyperparameter of the model.
[0081] The selected dataset was used as the training set to train GA_XGboost.
[0082] Select The loss function of the model is used, and the optimal hyperparameters of the XGboost model are solved by a genetic algorithm (GA). The hyperparameters are shown in Table 2. After selecting the hyperparameters, the effect of changing only the number of iterations on the results of GA_Xgboost is as follows. Figure 2 As shown, both MSE and MAE values show a downward trend, indicating that the prediction accuracy of the postoperative discharge days prediction model based on GA_Xgboost in this application is high.
[0083] After each round of training, the optimal split point for each node of the tree is obtained:
[0084] And obtain the weight values of the leaf nodes:
[0085] Finally, the validation set was input into the trained model, and the prediction results are shown in Table 3.
[0086] Table 3 - Comparison of Prediction Results of Various Regression Models
[0087] As shown in Table 3, both the XGboost model and the GA_XGboost model provided in this application can predict the postoperative discharge readiness of rectal cancer patients. Moreover, it was found that the prediction accuracy of GA_XGboost in the two indicators of MSE and MAE is improved, which basically meets the needs of providing patients and doctors with an effective prediction in real work, which is conducive to the postoperative arrangement and treatment of patients, and can reduce unnecessary hospitalization costs and waste of medical resources.
[0088] Example 4: Based on the foregoing embodiments, this embodiment provides an electronic device, including at least one memory, at least one processor, and at least one computer instruction stored in the memory and running on the processor. When the electronic device is running, the processor executes the computer instruction to cause the electronic device to perform the method described in the foregoing embodiments.
[0089] Example 5: Based on the foregoing embodiments, this embodiment provides a machine-readable storage medium for storing computer instructions, which, when executed by a processor, complete the method described in the foregoing embodiments.
[0090] The conventional techniques described in the above embodiments are existing technologies known to those skilled in the art, and therefore will not be described in detail here.
[0091] The specific embodiments described herein are merely illustrative of the spirit of the invention. Those skilled in the art to which this invention pertains may make various modifications or additions to the described specific embodiments or use similar methods to substitute them, without departing from the spirit of the invention or exceeding the scope defined by the appended claims.
[0092] Although the present invention has been described in detail and specific embodiments have been cited, it will be apparent to those skilled in the art that various changes or modifications can be made without departing from the spirit and scope of the invention.
[0093] While the foregoing detailed descriptions have shown, described, and pointed out novel features applicable to various embodiments, it should be understood that various omissions, substitutions, and changes may be made to the form and details of the described apparatus or methods without departing from the spirit of this disclosure. Furthermore, the various features and methods described above may be used independently of each other or may be combined in various ways. All possible combinations and sub-combinations are intended to fall within the scope of this disclosure. Many of the foregoing embodiments include similar components, and therefore, these similar components are interchangeable in different embodiments. Although the invention has been disclosed in the context of certain embodiments and examples, those skilled in the art will understand that the invention extends beyond the specifically disclosed embodiments to other alternative embodiments and / or applications, as well as their obvious modifications and equivalents. Therefore, the invention is not intended to be limited to the specific disclosure of the preferred embodiments herein.
[0094] All matters not covered in this invention are common knowledge.
Claims
1. A model for predicting postoperative discharge readiness in rectal cancer patients, characterized in that: The model is a machine learning model that has been trained. The training data for the model comes from a feature set obtained by processing pre- and post-operative physical signs data of historical rectal cancer patients as follows: Data cleaning includes: removing cases with severely missing indicators, removing data with abnormal values using the 3δ criterion, filling in missing features with the mean of the feature for a small number of data, and quantifying discrete data indicators. Feature selection includes: calculating the XGBoost feature importance score and the random forest feature importance score for the cleaned 27-dimensional vital signs data, and combining the results of linear correlation analysis to remove the 5 features with the smallest sum of the XGBoost feature importance score and the random forest feature importance score, thus obtaining the optimized feature set.
2. The model according to claim 1, characterized in that: The model's prediction performance on the validation set satisfies the following conditions: mean squared error (MSE) ≤ 5.213 and mean absolute error (MAE) ≤ 1.
883.
3. The model according to claim 1 or 2, characterized in that: The vital signs data used to train the model include at least the following 27 items: age, PS score, TNM stage, differentiation degree, height, weight, BMI, fasting blood glucose on admission, albumin, prealbumin, total protein, alanine aminotransferase, aspartate aminotransferase, gamma-glutamyl transferase, total bilirubin, direct bilirubin, indirect bilirubin, creatinine, white blood cell count, hemoglobin concentration, lymphocyte count, duration of operation, and amount of surgical bleeding.
4. The model according to claim 1 or 2, characterized in that: The prediction model is constructed based on the XGBoost regression algorithm optimized by a genetic algorithm; The genetic algorithm is used to search and optimize the hyperparameter space of the XGBoost model. The hyperparameters include at least: the maximum depth of the tree, the number of iterations, the learning rate, the weight of the L2 regularization term, and the subsampling ratio.
5. The model according to claim 1 or 2, characterized in that: During model training, the Mean Absolute Error (MAE) is used as the loss function. The formula for calculating MAE is as follows: ,in, For the sample size, For the first The actual number of days discharged from the hospital for each sample For the model to the first Predicted number of days to discharge for each sample.
6. A method for predicting the number of days of hospital stay after surgery for rectal cancer patients, characterized in that, include: Obtain preoperative physical examination and surgical indicators of the patients to be predicted; The preoperative physical examination data shall include at least the following: age, PS score, TNM stage, differentiation degree, height, weight, BMI, fasting blood glucose on admission, albumin, prealbumin, total protein, alanine aminotransferase, aspartate aminotransferase, gamma-glutamyl transferase, total bilirubin, direct bilirubin, indirect bilirubin, creatinine, white blood cell count, hemoglobin concentration, and lymphocyte count; the surgical data shall include at least the following: operation duration and surgical blood loss. The acquired data is input into the model as described in any one of claims 1-5; Output the predicted number of days to discharge calculated by the prediction model.
7. A data processing method for constructing the model according to any one of claims 1-5, characterized in that, include: We collected the actual number of days after surgery and the corresponding multidimensional vital signs data of historical rectal cancer patients to form an initial dataset. The initial dataset is cleaned, including: removing cases where the missing rate of key vital signs exceeds a preset threshold, identifying and removing outliers in each vital sign dimension based on the 3δ criterion, filling in a small number of missing values for non-key vital signs using the overall mean of the indicator, and encoding all categorical data into numerical data. For the cleaned data, the XGBoost algorithm and the random forest algorithm were used to evaluate the feature importance and calculate the linear correlation between each feature. Based on the importance scores of XGBoost, Random Forest, and linear correlation, all features are weighted and ranked. According to the ranking results, a predetermined number of features with the lowest ranking are removed to obtain an optimized feature subset. The optimized feature subset and its corresponding discharge days label are used to train the prediction model based on XGBoost regression.
8. The data processing method according to claim 7, characterized in that, The preset quantity is 5.
9. A system for predicting postoperative discharge time in patients with rectal cancer, characterized in that, include: The data acquisition and processing module is used to acquire historical and predictive vital sign data of patients, and execute the data processing method as described in claim 7 or 8 to output processed feature data. The model storage and retrieval module is used to store the prediction model as described in any one of claims 1-5, and to retrieve the corresponding prediction model in response to a prediction request; The prediction application module is used to receive the patient characteristic data to be predicted from the data acquisition and processing module, and call the prediction model in the model storage and call module to output the predicted number of days to discharge. The system according to claim 6 is characterized in that, during the training process, the model training module uses a genetic algorithm to search and optimize the hyperparameter space of the XGBoost regression model, wherein the hyperparameters include at least: the maximum depth of the tree, the number of iterations, the learning rate, the weight of the L2 regularization term, and the subsampling ratio.
10. The system according to claim 9, characterized in that, The system also includes a model training module, which receives a subset of historical patient optimized features and corresponding discharge day labels from the data acquisition and processing module, trains the model by optimizing XGBoost hyperparameters using a genetic algorithm, and stores the trained prediction model in the model storage and retrieval module.