A method for analyzing risk factors for hypertension
By employing a hierarchical mRMR algorithm and global optimization search, the problems of excessive manual processing and insufficient accuracy in hypertension risk factor analysis are solved, achieving efficient and accurate screening of hypertension risk factors.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- UNIV OF ELECTRONICS SCI & TECH OF CHINA
- Filing Date
- 2022-08-31
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for analyzing hypertension risk factors suffer from problems such as high workload of manual processing, insufficient accuracy of analysis results, and neglect of the joint effects between attributes, leading to local optima.
The attributes are hierarchically classified using a hierarchical approach, and the mRMR algorithm is used to filter the candidate feature set. The global optimal solution is searched by combining the perceptual operator and the mutation operator, and the risk factors for hypertension are obtained through iterative optimization.
It improves the accuracy of hypertension risk factor analysis, avoids manual processing workload, finds the global optimal solution, and avoids feature loss and local optimum stagnation.
Smart Images

Figure CN115394452B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of computer science, and specifically relates to a data analysis technique. Background Technology
[0002] Hypertension involves numerous risk factors, including genetic, environmental, and biological risk factors. Analyzing and identifying a set of key risk factors for hypertension from these complex factors is of great significance for the prevention, diagnosis, and assessment of hypertension.
[0003] In the medical field, statistical methods are mostly used to analyze disease risk factors, while in the computer science field, feature selection methods are more commonly used for risk factor analysis. Hsu proposed an improved mRMR algorithm, which first ranks attributes based on the rules of maximum relevance and minimum redundancy, and then recursively eliminates attributes with little impact on the learner's prediction accuracy based on the contribution of individual attributes to the learner's prediction accuracy, thereby better identifying the risk factors with the greatest impact on cardiovascular disease.
[0004] Chinese patent CN109686442B, "Method and System for Determining Risk Factors of Gastroesophageal Reflux Disease Based on Machine Learning," constructs a user information set containing risk factors for gastroesophageal reflux disease, processes it to obtain a quantified data matrix, and then standardizes it. Principal component analysis is then used for dimensionality reduction. The processed data is clustered to obtain a hierarchical clustering dendrogram. Furthermore, based on the number of clusters determined by the hierarchical clustering dendrogram, the data in the processed dataset is further divided into multiple clusters. Finally, the correlation index between elements in each cluster is calculated, and the element with the highest correlation index is identified as a risk factor for gastroesophageal reflux disease.
[0005] Traditional statistical methods require high-quality data, involve a large amount of manual processing, and the analysis results may not meet the accuracy requirements of downstream prediction tasks. They are also not very practical for high-dimensional data. Hsu's method ignores the joint effects between attributes and is prone to getting trapped in local optima. Chinese patent CN109686442B, "Method and System for Determining Risk Factors of Gastroesophageal Reflux Disease Based on Machine Learning," mainly uses hierarchical clustering and then selects the element with the highest correlation index from the clusters. This method does not consider the differences in the correlation between each cluster and the disease and also ignores the joint effects between attributes. Summary of the Invention
[0006] To address the aforementioned technical problems, this invention proposes a method for analyzing hypertension risk factors, which avoids a large amount of manual processing and improves the accuracy of the analysis results.
[0007] The technical solution adopted in this invention is: a method for analyzing risk factors of hypertension, comprising:
[0008] S1. Extract hypertension-related detection indicators and related demographic indicators from the electronic health record database; and preprocess the extracted data.
[0009] S2. Stratify according to hypertension-related detection indicators and demographic indicators to obtain the feature subsets of each stratum;
[0010] S3. The mRMR algorithm is used to filter the indicators in each layer to obtain the candidate feature subsets of each layer;
[0011] S4. Merge the candidate feature subsets of each layer to obtain the candidate feature set;
[0012] S5. Search for the optimal set of risk factors from the candidate feature set to obtain the risk factors for hypertension.
[0013] The beneficial effects of the present invention are as follows: Compared with the prior art, the present invention first divides the attributes into layers, determines the number of candidate features within each layer based on the mutual information of each layer, and then uses the layered mRMR algorithm to screen out the candidate set of important risk factors for the onset of hypertension, avoiding the loss of important features and features with joint effects.
[0014] The present invention further performs secondary screening of features in the candidate set with the goal of searching for the global optimal solution. The perceptual operator enables the search agent to perceive the quality of the current solution, and the random mutation operator enables the search agent to avoid local optima stagnation and obtain the global optimal solution. Attached Figure Description
[0015] Figure 1 This is a flowchart of the solution of the present invention;
[0016] Figure 2 This is a flowchart of the iterative search process. Detailed Implementation
[0017] To facilitate understanding of the technical content of this invention by those skilled in the art, the following description, in conjunction with the accompanying drawings, further illustrates the invention.
[0018] This invention screens risk factors for hypertension in two stages. First, using a hierarchical approach, attributes are stratified based on domain knowledge, and the number of stratified features is determined according to the mutual information of each layer, resulting in a candidate feature set. Then, the candidate feature set is screened a second time to search for the global optimum, yielding the final set of risk factors.
[0019] like Figure 1 As shown, the method of the present invention includes the following steps:
[0020] I. Data Acquisition and Processing
[0021] First, hypertension-related detection indicators and related demographic indicators were extracted from the electronic health record database of the cardiovascular specialist hospital. For numerical data, outliers and missing values were filled in using the average value. For non-numerical data, based on the mode principle in statistics, the most frequently occurring value of the indicator was used to fill in outliers and missing values.
[0022] II. Indicator Stratification
[0023] By using domain knowledge, hypertension-related detection indicators can be stratified, such as by the fact that cardiovascular disease data often includes blood glucose indicators (fasting blood glucose, insulin), blood routine indicators (white blood cells, platelets), liver and kidney function indicators (total protein, albumin, globulin), and demographic indicators (gender, age, height, weight, BMI, smoking history), to obtain stratified feature subsets.
[0024] III. Screening Candidate Feature Sets
[0025] 1. The mRMR algorithm is used to select candidate feature subsets for each stratified index.
[0026] First, calculate the arithmetic mean D of the mutual information between all individual characteristics and category label c (i.e., having hypertension or not having hypertension) in each stratum:
[0027]
[0028] Where j represents the feature stratification index, i represents the individual feature index in the stratification, and |S j | represents the number of features, I(x) i c) represents feature x i Mutual information between S and c j Let j represent the feature subset formed by the features in the j-th feature layer. The specific formula for calculating mutual information is as follows:
[0029]
[0030] Where p(x) i c) is x i The joint distribution between x and c, p(x) i p(c) and p(c) are x i The marginal distribution of c.
[0031] Then, the arithmetic mean R of the mutual information among individual characteristics in each stratum is calculated. j :
[0032]
[0033] Calculate the difference between the two arithmetic means above:
[0034] θj =D j -R j
[0035] For θ j Perform normalization and calculate the number m of candidate features selected in each layer. j :
[0036]
[0037] Where m represents the total number of candidate features selected, and J represents the number of feature subsets obtained from each stratification. The number m of candidate features selected in each stratification is determined. j Then, the mRMR algorithm is used in each layer to calculate the difference ф between the arithmetic mean of the mutual information between features and categories and between features, using the principle of maximum correlation and minimum redundancy. j Using an incremental search method, m is obtained. j Individual ф j The feature with the smallest value is selected as the candidate feature subset for that layer.
[0038] 2. Merge the candidate feature subsets after each layer of screening to obtain the candidate feature set S.
[0039] IV. Search for the optimal set of risk factors from the candidate feature set S to obtain the risk factors for hypertension.
[0040] This step uses an iterative search approach to obtain hypertension risk factors, such as... Figure 2 As shown, the specific steps include the following:
[0041] 1. Randomly generate n search agents X1, X2, ..., X... n And initialize it. Each search agent X i It is represented as a binary vector, representing a possible combination of features (corresponding to a solution), and has a dimension d equal to the number of features in the candidate feature set S. A position of 1 indicates that the feature is selected, and a position of 0 indicates that the feature is not selected.
[0042] 2. Initialize the number of iterations t.
[0043] 3. Ten-fold cross-validation is used to partition the data subset s consisting of the feature combinations corresponding to the search agent X. The dataset is divided into 10 partitions of approximately equal size. Nine of these partitions are used alternately as the training set, and the remaining partition is used as the test set. A KNN classifier is used to predict the class label (whether the sample is a patient with hypertension) for each sample. Then, the percentage of incorrect class label predictions is calculated, which is the classification error rate. The classification error rates of the 10 different rounds are then averaged to obtain the average classification error rate γ(X).
[0044] The fitness function `fitmess` is designed to calculate the score of the search agent. The calculation formula is as follows:
[0045]
[0046] Where γ(X) represents the average classification error rate of the classifier using the feature combination represented by the search agent X, |s| represents the number of features selected by the search agent X, and |S| is the number of features in the candidate feature set S; α and β are weight parameters corresponding to the importance of classification accuracy and the length of the selected feature subset, α∈[0,1] and β=1-α. The benefit of this fitness function is that it achieves a balance between minimizing the number of selected features and maximizing classification accuracy.
[0047] The above training method can be used to calculate the fitness function values of all feature combinations corresponding to the search agents. The solution for searching the three agents with the highest fitness function values is represented as follows: and These represent the first, second, and third best solutions, respectively, with the remaining candidate solutions serving as...
[0048] 4. Update algorithm parameters and search proxy. Candidate solutions. Update the values of the three best solutions for the next iteration. Represented as:
[0049]
[0050] Where t represents the current iteration search number. They represent according to and The updated fractional vectors are calculated as follows:
[0051]
[0052]
[0053]
[0054] in, These represent the movement vectors in the three optimal solution directions, respectively. and The coefficient vector can be represented as:
[0055]
[0056]
[0057] and Given a random vector in [0,1], different random vectors can be calculated. and The iterations decrease linearly from 2 to 0 in the following iterations:
[0058]
[0059] The parameter `iterations` sets the maximum number of iterations.
[0060] The perceptron is used to sense the quality of the solution corresponding to the current position. If the quality of the solution corresponding to the current position is high, the search and update are performed in the vicinity of the current position; otherwise, the search is performed in the three directions of optimal solution. It is modeled using the following formula:
[0061]
[0062] Among them, D m R represents the arithmetic mean of the mutual information between features and category labels in the feature subset corresponding to the m-th search agent. m Let represent the arithmetic mean of the mutual information between features in the feature subset corresponding to the m-th search agent, calculated as follows:
[0063]
[0064]
[0065] Among them, s m Indicates the search of the m-th agent. The selected feature set, I(x) n c) represents feature x n Mutual information between c and c.
[0066] when At that time, each agent searches and updates itself to its position between the current solution and the optimal solution. During the search process, and They will iterate and update, separating from each other to find the optimal solution. When When taking a random value less than -1 or greater than 1, the proxy search will update to a position far away from the current optimal solution.
[0067] 5. The continuous search space is mapped to the binary space through a transformation function, as shown in the following formula:
[0068]
[0069] in, These are continuous values (features) in the sigmoid function search agent, p = 1,...,d, representing the dimension, xp Indicating that in a continuous search space, the agent The value of the p-th dimension. binary The value can be 0 or 1, depending on the random number. and Comparison of values.
[0070] 6. To prevent getting trapped in local optima, mutation and crossover operators are added to the model, with the mutation rate r as follows:
[0071]
[0072] The mutation rate r decreases linearly from 0.9 to 0 with the number of iterations (t).
[0073] The solution vector is obtained through mutation.
[0074]
[0075] The result obtained from the mutation is solved and Perform a crossover operation as follows:
[0076]
[0077] This indicates the result after updating by the crossover operator. Crossover(x,y) is a simple crossover operator that obtains an intermediate solution (Xi) between solutions x and y by switching between the two input solutions with equal probability. d As shown in the following formula:
[0078]
[0079] Where d represents the d-th dimension of the solution, CR is the predefined crossover rate, and rand is a random number.
[0080] 7. Iterate through steps 3, 4, 5, and 6 until the number of iterations t equals the maximum number of iterations, then stop iterating. The mapped features constitute the final set of hypertension risk factors.
[0081] Those skilled in the art will recognize that the embodiments described herein are intended to help the reader understand the principles of the invention, and should be understood that the scope of protection of the invention is not limited to such specific statements and embodiments. Various modifications and variations can be made to the invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the invention should be included within the scope of the claims of the invention.
Claims
1. A method for analyzing risk factors of hypertension, characterized in that, include: S1. Extract hypertension-related detection indicators and related demographic indicators from the electronic health record database; and preprocess the extracted data. S2. Stratify according to hypertension-related detection indicators and demographic indicators to obtain the feature subsets of each stratum; S3. The mRMR algorithm is used to filter the indicators in each stratum to obtain the candidate feature subset for each stratum; Step S3 is specifically as follows: S31. Calculate the arithmetic mean D of the mutual information between all individual characteristics and category labels c in each stratum. j : ; Where j represents the feature stratification index, and i represents the individual feature index within the stratification. Representation of features Mutual information between S and c j This represents the feature subset formed by the features in the j-th feature layer; S32. Calculate the arithmetic mean R of the mutual information among individual characteristics in each stratum. j : ; in, Representation of features and Mutual information between them; S33, Calculate D j With R j Difference: ; S34, to Perform normalization and calculate the number m of candidate features selected in each layer. j : ; Where m represents the total number of candidate features selected, and J represents the number of feature subsets obtained from the stratification. S35. Determine the number m of candidate features to be selected in each layer. j Then, the mRMR algorithm is used in each layer to calculate the difference ф between the arithmetic mean of the mutual information between features and categories and between features, using the principle of maximum correlation and minimum redundancy. j Using an incremental search method, m is obtained. j Individual ф j The feature with the smallest value is selected as the subset of candidate features for that layer. S4. Merge the candidate feature subsets of each layer to obtain the candidate feature set; S5. Search for the optimal set of risk factors from the candidate feature set to obtain the risk factors for hypertension.
2. The method for analyzing hypertension risk factors according to claim 1, characterized in that, Hypertension-related test indicators include: blood glucose levels, complete blood count (CBC) results, and liver and kidney function indicators.
3. The method for analyzing hypertension risk factors according to claim 2, characterized in that, Step S5 is as follows: S51. Randomly generate n search agents. And initialize; each search agent X i It is represented as a binary vector, representing a possible combination of features, and each search agent has a dimension d equal to the number of features in the candidate feature set. In each search agent, a corresponding position of 1 indicates that the feature is selected, and a position of 0 indicates that the feature is not selected. S52. Initialize the number of iterations t; S53. Use ten-fold cross-validation to divide the data subset s consisting of the feature combination corresponding to the search agent into several partitions, and calculate the average classification error rate of these partitions; calculate the fitness function value corresponding to the search agent based on the average classification error rate. S54. Represent the solutions of the three search agents with the highest fitness function values in descending order as follows: , and The remaining candidate solutions are ; S55, The update expression is: ; Where t represents the current iteration search number. They represent according to , and Updated subvectors; S56. Add mutation and crossover operators; S57. When the maximum number of iterations is reached... The characteristics of the mapping are the final set of hypertension risk factors.
4. The method for analyzing hypertension risk factors according to claim 3, characterized in that, Step S53 calculates the average classification error rate of these partitions, specifically: using a KNN classifier to predict the class label of each sample in each partition, and calculating the classification error rate of the partition based on the class labels of all samples in the partition; averaging the classification error rates of these partitions to obtain the average classification error rate of these partitions; that is, each search agent corresponds to an average classification error rate.
5. The method for analyzing hypertension risk factors according to claim 4, characterized in that, The fitness function expression is: ; in, Let |s| represent the average classification error rate of the classifier when classifying the feature combination represented by the search agent X, |s| represent the number of features selected by the search agent X, and |S| is the number of features in the candidate feature set S; α and β are weight parameters corresponding to the importance of classification accuracy and the length of the selected feature subset, α∈[0,1], β=1-α.
6. The method for analyzing hypertension risk factors according to claim 5, characterized in that, The formula for calculation is: ; ; ; in, Represents the perceptron. These represent the movement vectors in the three optimal solution directions, respectively. and The coefficient vector is calculated as follows: ; ; and Given a random vector in [0,1], different random vectors are calculated to obtain... and ; The formula for calculation is: ; Here, iterations represents the maximum number of iterations.