A Genetic Programming Mutation Probability Optimization Method Based on Maximum Mutual Information Coefficient

By measuring the correlation between features and the target using the maximum mutual information coefficient, the mutation selection process of genetic programming is optimized, solving the randomness problem in feature selection of genetic programming algorithms and improving the training efficiency and generalization ability of the model.

CN116402128BActive Publication Date: 2026-06-30SOUTH CHINA UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SOUTH CHINA UNIV OF TECH
Filing Date
2023-04-06
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing genetic programming algorithms are greatly affected by random initialization during feature selection, leading to uncertain search directions, easy local convergence, and impacting the model's generalization ability and training efficiency.

Method used

The correlation between features and the target is measured by the maximum mutual information coefficient. The probability distribution of genetic programming mutations that select new features is determined and fixed during evolution, independent of the influence of the number of features in the population.

Benefits of technology

It improves the search robustness of genetic programming, enhances the training efficiency and generalization performance of the model, and reduces the impact of random initialization on algorithm performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116402128B_ABST
    Figure CN116402128B_ABST
Patent Text Reader

Abstract

This invention discloses a genetic programming mutation probability optimization method based on the maximum mutual information coefficient. In traditional genetic programming, feature selection tends to shift from initial completely random selection to selective selection with bias. This invention constrains the search direction of genetic programming, thereby improving search efficiency. The method includes: Step S1, using the maximum mutual information coefficient to measure the correlation between each feature and the target in the dataset, and merging the correlations of each feature and the target into a correlation vector; Step S2, determining the probability distribution of genetic programming when selecting new features through mutation based on the correlation vector; Step S3, performing genetic programming evolution, during which the probability distribution remains fixed and is unaffected by the number of features selected within the population. This invention reduces the impact of random initialization on the overall performance of genetic programming, enhances the robustness of genetic programming on fundamental problems, improves training efficiency and accuracy, and enhances model generalization performance.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to two major fields: regression analysis and intelligent computing. It mainly relates to a genetic programming mutation probability optimization method based on the maximum mutual information coefficient.

[0002] Background Introduction

[0003] Genetic programming is a method for automatically generating programs, inspired by Darwin's theory of natural selection in evolution. Genetic programming and its variants have been successfully applied to various real-world scenarios, including character recognition, financial forecasting, digital filters, electronic circuit design, image processing, biological sequencing, and robotics. Genetic programming typically models practical problems as traditional regression or classification problems based on various encoding methods; therefore, optimizing the performance of genetic programming on fundamental problems helps enhance its applicability to practical applications.

[0004] Like other hyperheuristic algorithms, genetic programming can find an acceptable solution to NP-hard large-scale complex problems in a finite amount of time. The solution approach of genetic programming can be divided into a stochastic initialization process, a non-directional search process, and a greedy selection process. Therefore, improving the performance of genetic programming can be achieved by improving the initialization environment, enhancing search capabilities, specifying more reasonable search directions, and designing different greedy strategies.

[0005] For some new genetic programming variants, such as SL-GEP, the selection of features for new individuals still relies on a roulette wheel approach. This method determines the selection probability of a feature based on its quantity within the current population. Due to this feature selection mechanism, the feature selection tendency of genetic programming shifts from initially completely random selection to selective selection with bias. The distribution of the current population's features influences the probability of selecting subsequent new features, and the selected new features further solidify the distribution of features within the population. Furthermore, different initial populations result in different feature distribution ratios, and the search performance of genetic programming is constrained by the random initialization process. The feature selection bias of genetic programming is largely determined by the random initialization process, ultimately leading to unpredictable local convergence. In this context, superior individuals during the evolutionary process may select low-quality features and assign them evaluation levels disproportionate to their quality. This causes the impact of incorrect selections by some superior individuals to spread throughout the entire population, affecting the search direction. In this process, high-quality features cannot be guaranteed to be selected with a higher probability, and due to the influence of the roulette wheel, their probability of being selected gradually decreases to a stable level, with their space squeezed by low-quality features. Genetic programming algorithms will tend to converge locally under this selection mechanism, which directly leads to problems such as limited expressive effect, weak generalization ability of output model and unstable algorithm performance when solving practical problems. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a genetic programming mutation probability optimization method based on the maximum mutual information coefficient. This method constrains the search direction of genetic programming and improves search efficiency.

[0007] To achieve the above objectives, this invention provides a genetic programming mutation probability optimization method based on the maximum mutual information coefficient, comprising the following steps:

[0008] Step S1: Use the maximum mutual information coefficient to measure the correlation between each feature and the target in the dataset, and combine the correlation between each feature and the target into a correlation vector.

[0009] Step S2: Determine the probability distribution of genetic programming when selecting new features through mutation based on the correlation vector;

[0010] Step S3: Perform genetic programming evolution. During the genetic programming evolution process, the probability distribution is always fixed and is not affected by the number of each feature selected in the population.

[0011] Compared with the prior art, the present invention can achieve at least the following beneficial effects:

[0012] This invention uses the correlation between features and the target as a basis, fixes the probability of selecting new features during the initial and evolutionary processes, ensures that genetic programming always searches in a more reasonable direction, reduces the impact of random initialization on the overall performance of the algorithm, enhances the robustness of genetic programming on basic problems, improves training efficiency and accuracy, and enhances the generalization performance of the model. Attached Figure Description

[0013] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0014] Figure 1 This is an overall flowchart of the genetic programming mutation probability optimization method based on the maximum mutual information coefficient provided by the present invention.

[0015] Figure 2 The diagram shows the convergence process of the two methods in the embodiment of the present invention on F1 during training (Figure a) and testing (Figure b).

[0016] Figure 3 The diagram shows the convergence process of the two methods in this embodiment of the invention on F2 during training (Figure a) and testing (Figure b).

[0017] Figure 4 The diagram shows the convergence process of the two methods in this embodiment of the invention on F3 during training (Figure a) and testing (Figure b).

[0018] Figure 5 The diagram shows the convergence process of the two methods in this embodiment of the invention on F4 during training (Figure a) and testing (Figure b). Detailed Implementation

[0019] The technical solution of this embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiment is one embodiment of the present invention, and not all embodiments thereof. Based on this embodiment of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0020] Consider an m×(n+1) dataset, where m represents the number of samples and n represents the number of features. Each sample has n features and one target. The purpose of regression analysis is to establish a mapping relationship between n-dimensional features and the target. Typical regression analysis requires experienced personnel to pre-define a parameterized model that may reflect this mapping relationship, then substitute the dataset into the regression model, fit the relevant parameters, and obtain a complete regression model. The purpose of this invention, and that of regression analysis, is to attempt to build a more accurate regression model—that is, a complete regression model capable of predicting the changing patterns of unseen data, and for samples without target values, the model can provide predictions that more closely match the actual patterns. This invention designs an optimization method based on the mutation probability of the maximum mutual information coefficient, based on the genetic programming algorithm, enabling the regression model built by the genetic programming algorithm to have a more powerful predictive ability.

[0021] Please see Figure 1 The present invention provides an optimization method for mutation probability based on the maximum mutual information coefficient. The overall implementation can be roughly divided into two processes: initialization and evolution, including the following steps:

[0022] Step S1: During the initialization process, the correlation between each feature and the target in the dataset is first calculated using the maximum mutual information coefficient, and then the correlation vector is obtained by combining them.

[0023] For variables X and Y, the mutual information between them is defined as follows:

[0024]

[0025] Where p(x, y) is the joint probability distribution function of the two variables, and p(x) and p(y) are the marginal probability distribution functions of variables X and Y. In the discrete case, the two-dimensional scattered data of variables X and Y are depicted in an a×b grid, where a and b represent the number of horizontal and vertical lines under a certain grid division method, respectively. Different a and b correspond to different grid spacing and the division method of scattered points. The upper limit of a and b is dynamically determined by the number of scattered points. The mutual information values ​​under different a and b are calculated according to equation (1) and normalized for comparison. The largest value is the maximum mutual information value. The values ​​of a and b are determined by dynamic programming. The mutual information values ​​of variables X and Y are calculated by equation (1) under various division methods, and the maximum value is taken as the maximum mutual information coefficient of variables X and Y. The magnitude of the maximum mutual information coefficient represents the degree of correlation between each terminal and the target. The closer the maximum mutual information value is to 1, the stronger the correlation between the two. The closer the maximum mutual information value is to 0, the weaker the correlation between the two.

[0026] Step S2: Determine the probability distribution of genetic programming when selecting new features through mutation based on the correlation vector, where each value in the correlation vector is the absolute value of the correlation between each feature and the target.

[0027] When determining the probability distribution, the probability p of the i-th feature... i for:

[0028]

[0029] Where c i and c j Let represent the absolute values ​​of the relevance of the i-th and j-th terminals, respectively, and n represent the size of the feature set. Equation (2) shows that features with stronger relevance to the target have a higher probability of being selected.

[0030] Among them, the genetic programming algorithm takes "survival of the fittest" as its core idea and generates explicit expression solutions within a certain number of generations or time. During the training process, the genetic programming algorithm first initializes the population and randomly generates a specified number of individuals. Under the string encoding method, the individuals are strings of a specified length and can be recursively combined into a binary tree. The mathematical expression can be decoded by traversing the binary tree. For each individual, each bit of the string is an encoding bit. In these encoding bits, except for some that are restricted to feature bits, random numbers are generated first. When the random number is less than Rnd_Int, a symbol is randomly selected in the encoding bit; otherwise, a feature is randomly selected. Rnd_Int is a hyperparameter and is obtained through pre-experiment tuning. When selecting a symbol, the probability of each symbol in the symbol set being selected is consistent. When selecting a feature, the probability of each feature being selected is determined with reference to formula (2). In some embodiments of the present invention, the probability distribution vector terminal_probability is determined as follows. The value of the first dimension is:

[0031]

[0032] The values ​​for the remaining 2 to n dimensions are:

[0033]

[0034] At this point, the probability distribution vector terminal_probability is designed as a roulette wheel. A random number between 0 and 1 is generated, and the selected feature is determined by judging which dimension the random number belongs to.

[0035] After the initialization process, the evolution process begins. In some embodiments of the present invention, there are two termination conditions for the evolution process: one is to determine whether the fitness of the current best individual BestInd is less than σ, and the other is to determine whether the number of generations Gen has reached the maximum limit number of generations. Evolution terminates when either condition is met. Here, σ and the maximum limit number of generations are preset values. When more accurate results are required, smaller precision values ​​or larger maximum limit numbers can be used.

[0036] In each iteration, the sign selection probability is updated first. The update process begins by counting the number of individuals with signs (FQ) in the current population and normalizing it using the following formula:

[0037]

[0038] Where POPSIZE is the population size, H is the fixed size of each individual, and FQ is defined by equation (5) as the proportion of the number of symbols in the population relative to the sum of the number of symbols and the number of features. Then, the number of symbols of each type is counted to form a vector function_freq. At this point, the symbol probability function_probability in the current iteration period can be determined as follows: the probability of the first symbol is:

[0039]

[0040] The probabilities of the 2nd to fnth symbols are:

[0041]

[0042] Where function_freq[1] represents the number of times the first symbol in the current population is used, and Tunction_freq[i] and function_freq[j] represent the number of times the i-th symbol and the j-th symbol in the current population are used, respectively.

[0043] In one iteration cycle, every individual in the current population needs to be traversed. For each individual, the probability of mutation (CR) and the site k are randomly determined. After determining these, a random number is generated. If this random number is less than the probability (CR), mutation occurs; otherwise, no mutation occurs, and the individual is directly retained into the next generation of the population. If mutation occurs, the corresponding probability distribution is selected based on the attribute at site k. If site k is located at the head of the chromosome, a conversion between symbol and trait may occur. In this case, a random number is generated and compared with the probability distribution (FQ). If the random number is less than the FQ, a new symbol is selected to replace the original at that site through mutation; otherwise, a new trait is selected. If site k is located at the tail of the chromosome, it is determined whether the original at site k was a trait. If the original at site k was a trait, a new trait is selected based on terminal_probability; if the original at site k was a symbol, a new symbol is selected based on function_probability, thus forming a new mutated individual. By evaluating the fitness of the mutated individual and comparing it with the original individual, the better one is selected to enter the next generation of the population. Furthermore, the fitness of the better individual among the two is compared with that of the current best individual, BestInd. If the better individual is superior to BestInd, then BestInd is updated. In some embodiments of the present invention, the fitness evaluation follows the following formula.

[0044]

[0045] Where m train x represents the number of training samples. iLet f represent the feature vector of the i-th sample, and f(x) represent the individual being evaluated. i ) represents the predicted value of an individual on the i-th sample, y i This represents the target value of the i-th sample.

[0046] After several complete iterations, if the current best individual BestInd of the population meets the accuracy requirement or reaches the maximum generation limit, then the current BestInd is extracted as the final regression model given by the algorithm and validated on the test set. The method is similar to the fitness evaluation during training, and the following formula is given.

[0047]

[0048] Where m test f represents the number of test samples, f represents the number of individuals being evaluated, and f(x) represents the number of individuals being evaluated. i ) represents the predicted value of an individual on the i-th sample, y i This represents the target value of the i-th sample.

[0049] The purpose of this invention is that, compared with traditional genetic programming algorithms, the mutation probability optimization method based on the maximum mutual information coefficient proposed in this invention can obtain a smaller RMSE value during training and testing. This means that this invention can effectively improve the training efficiency and generalization performance of genetic programming.

[0050] The effects of the present invention will be further demonstrated below with reference to experiments.

[0051] First, four target expressions are given:

[0052]

[0053]

[0054]

[0055]

[0056] Equations (10)-(13) are designated as F1, F2, F3, and F4, respectively, where x0, x1, x2, x3, and x4 represent the features selected for each objective expression. When constructing the dataset, each feature is a random value between 0 and 1. F1 and F4 use three relevant features, F3 uses four relevant features, and F2 uses five relevant features. y represents the result obtained by substituting the generated x0, x1, x2, x3, and x4 into the above expressions. 200 samples are generated for each dataset, that is, 200 random values ​​are generated for each x on each F. 140 samples are used to train the model, and 60 samples are used to test the generalization effect of the model.

[0057] Two experimental schemes were then designed. Scheme 1 uses the currently popular genetic programming variant algorithm SL-GEP, which has a stronger ability to solve complex symbolic regression problems than the standard genetic programming algorithm. The purpose of choosing it is to indirectly prove that the present invention still has a good auxiliary effect on the well-performing genetic programming algorithm. Scheme 2 introduces the mutation probability optimization method designed in this invention into SL-GEP. Both schemes are run 20 times with 20 different initialization scenarios. For comparison, each initialization scenario is consistent in both schemes. During each run, the training RMSE and test RMSE of BestInd are recorded every 25 generations. Finally, the average training fitness and average test fitness over 20 runs are calculated. The relevant parameter settings are shown in Table 1.

[0058] Table 1 Parameter Settings

[0059] POPSIZE Population size 20 MAXGEN Maximum number of generations 5000 H Chromosome length 55 σ accuracy 0.0001 Rnd_Int Initialize the selection symbol delimiter 0.3333 symbol set +, -×, ÷, sin, cos, exp, log

[0060] The average training fitness and average test fitness of the final population for the two schemes are obtained for each dataset as follows.

[0061] Table 2. Experimental Results of Average Training Fitness

[0062]

[0063] Table 3. Average Test Fitness Experiment Results

[0064]

[0065] Referring to the results in Tables 2 and 3, generally speaking, the optimal individual in the final population is also the final output solution of the algorithm. In real-world scenarios, this will be used as the final fitted model. This statistical indicator can accurately measure the fitting accuracy and generalization effect of the algorithm, representing the expressive performance of the method in regression problems and its ability to solve practical problems. Experimental results show that the optimization method provided by this invention can achieve better fitting and generalization effects than the original method in a statistically significant way. The improvement in generalization performance of the original method is significant. In F2 and F3, the method of this invention improves the original algorithm by an order of magnitude. In F4, it converges the training accuracy and testing accuracy to below the target requirements, demonstrating its superiority in complex symbolic regression problems. Further analysis of the training efficiency and corresponding testing effects of the two methods during the evolution process is as follows... Figures 2-5As shown, the proposed method significantly improves upon the original method across all four problems. Based on the greedy selection operator, the training process of both methods involves pursuing lower RMSE values. However, the corresponding testing process does not necessarily follow this pattern, as some algorithms exhibit overfitting, meaning the test RMSE values ​​gradually increase. The original method shows significant improvement in problems F1 and F2, while the proposed method effectively mitigates this issue. Furthermore, in problem F4, the proposed method finds the target solution much earlier, saving unnecessary computational resources compared to the original method. Comprehensive analysis of the experimental results demonstrates superior fitting performance and generalization effect in complex symbolic regression problems with multivariate dependencies. The designed probability determination method and probability fixation concept effectively enhance the search performance of the original method, giving it greater potential and prospects in finding the target solution.

[0066] The above embodiments are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above embodiments. Any changes, modifications, substitutions, combinations, or simplifications made without departing from the spirit and principle of the present invention shall be considered equivalent substitutions and shall be included within the protection scope of the present invention.

Claims

1. A genetic programming mutation probability optimization method based on a maximum mutual information coefficient, characterized by, The method is applied to image processing, and the method includes the following steps: Step S1: Use the maximum mutual information coefficient to measure the correlation between each feature and the target in the dataset, and combine the correlation between each feature and the target into a correlation vector. Step S2: Determine the probability distribution of genetic programming when selecting new features through mutation based on the correlation vector; Step S3: Perform genetic programming evolution. During the genetic programming evolution process, the probability distribution is always fixed and is not affected by the number of each feature selected in the population. wherein, in step S2, the probability of the i-th terminal when determining the probability distribution is: ​​ wherein, and denote the absolute value of the relevance size of the and the terminal, respectively, denotes the size of the feature set; In step S2, the probability distribution is constructed as a probability distribution vector in a roulette wheel manner, and the probability distribution vector is determined as follows, where the value of the first dimension is: The remaining 2 to The value of dimension is: At this point, the probability distribution vector It was designed as a roulette wheel, where a random number between 0 and 1 is generated, and the selected feature is determined by judging which dimension the random number belongs to.

2. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 1, characterized in that, For a given dataset, The matrix contains Each sample has One feature and one target, wherein, The sample is divided into two parts: one part is used for training and the other part is used for testing. A genetic programming algorithm is executed on the training sample to obtain a mathematical expression that fits the training data. During the training process, the test sample is not visible to the genetic programming algorithm. The mathematical expression is then tested on the test data. During the testing process, the target value of the test sample is not visible to the expression. By inputting the corresponding dimension value of the test sample into the specified feature of the mathematical expression, the predicted value of the sample can be output. The difference between the predicted value and the target value of the test sample is compared to obtain a quantitative evaluation of the algorithm's performance.

3. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 1, characterized in that, For variables X and Y, the mutual information between them is defined as follows: in and yes and The marginal probability distribution function, It is the joint probability distribution function of the two; In the discrete case, variables and Two-dimensional scatter data depicted in a×b In the grid, a and b These represent the number of horizontal and vertical lines under a certain grid division method, respectively. According to mutual information The formulas are used to calculate different a and b The mutual information values ​​are compared and normalized, and the largest value is the maximum mutual information value.

4. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 3, characterized in that, The method for calculating the maximum mutual information coefficient a and b The value of is determined by dynamic programming. a and b Different grid spacings and methods of dividing scattered points are used.

5. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 1, characterized in that, In step S2, the probability distribution of genetic programming in selecting a new terminal based on the correlation vector is determined, where each value in the correlation vector is the absolute value of the correlation between each terminal and the target.

6. A genetic programming mutation probability optimization method based on maximum mutual information coefficient according to any one of claims 1-5, characterized in that, Genetic programming algorithms take "survival of the fittest" as their core idea and generate explicit expression solutions within a certain number of generations or time. During the training process, the genetic programming algorithm first initializes the population. Each individual in the population obtains possible solutions through a specified encoding method, and the individual can decode them into a mathematical expression with practical meaning. Next, several generations of evolution are performed on the initial population. In each generation, each individual has a certain probability of performing genetic operator operations to obtain a newly generated temporary population. The fitness of individuals in the temporary population is compared with that of their corresponding parent individuals. The better individuals are retained in the new generation population. When the evolution reaches a specified number of generations, the individual with the highest fitness in the last generation population is selected as the output expression.

7. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 6, characterized in that, When an individual executes the crossover operator with a certain probability, it will exchange a part of its structure with other individuals in the population to generate a new individual. When an individual executes the mutation operator with a certain probability, it will replace the individual's current features. The replacement feature candidates come from all features in the dataset, and the probability of each feature being selected depends on the probability distribution determined in step S2.

8. The genetic programming mutation probability optimization method based on maximum mutual information coefficient according to claim 6, characterized in that, When comparing the fitness of new individuals with that of their parents, fitness assessment is involved, and the assessment function is used. for: in This represents the number of training samples. Representing the individuals being assessed, The representative individual in the first Predicted values ​​on a sample Representing the The target value for each sample.