Method, system, device and storage medium for estimating regional crop yield
By using a random forest model combined with a rotational correction algorithm in crop yield estimation, the problems of two-tailed error and overfitting in existing technologies are solved, achieving accurate and unbiased estimation of crop yield and improving the model's correction effect and applicability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHENZHEN INST OF ADVANCED TECH
- Filing Date
- 2023-07-07
- Publication Date
- 2026-06-23
AI Technical Summary
Existing machine learning models suffer from insufficient two-tailed error correction and overfitting in crop yield estimation, leading to overestimation in low-value areas and underestimation in high-value areas. This makes it impossible to accurately identify low-yield and high-yield regions, affecting agricultural production guidance and food security policy formulation.
A random forest model is used for pre-training simulation, and a rotation correction algorithm is used to correct the model error. The model results are rotated and corrected by a linear regression function so that they are corrected along a 1:1 line, which improves the correction effect in the low/high value region and avoids overfitting.
It achieves unbiased estimation of crop yield, improves the model's correction effect and generalization ability in low/high value regions, and enhances the model's applicability and accuracy.
Smart Images

Figure CN116843078B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of agricultural remote sensing monitoring technology, and in particular to a method, system, equipment and storage medium for estimating regional crop yield. Background Technology
[0002] Timely and accurate estimation of regional crop yields is crucial for guiding agricultural production and ensuring food security. With the continuous development of satellite remote sensing and agricultural information technology, the scale of agricultural data has exploded. Physical or statistical models coupling multi-source remote sensing and ground survey data have become important methods for estimating regional crop yields. Compared to physical process models describing crop growth and development, data-driven machine learning models have a concise and efficient model structure and strong nonlinear fitting advantages, and have been widely used in farmland yield simulation research in recent years. Commonly used machine learning models include Random Forest (RF), Artificial Neural Network (ANN), Support Vector Regressor (SVR), and eXtreme Gradient Boosting (XGB).
[0003] Machine learning models make no assumptions about data distribution or relationships between variables, directly learning the complex correlations between crop yield and multiple predictor variables from training data, generally exhibiting high simulation accuracy. However, the unbiased estimation of the model refers to the sum of simulation errors approaching zero. For simulated yield distribution, machine learning models often exhibit "overestimation in low-value areas" and "underestimation in high-value areas," defined here as "two-tailed error." Existing research usually focuses on the overall unbiased estimation of the model, ignoring this two-tailed error. However, for regional yield estimation, accurately simulating the two-tailed distribution of yield is crucial. For example, accurately identifying low-yield areas is essential for precisely guiding farmers' production management and improving the planting environment, thereby effectively increasing crop yield; accurately estimating regional high-yield crops helps formulate reasonable regional high-yield targets, determine crop production potential, and thus provide a reliable scientific basis for the formulation of food security policies.
[0004] Currently, the main methods for correcting two-tailed errors include regression and machine learning residual models. Regression is relatively simple to operate, but it only provides a linear fit between simulated and observed values. When the simulated values are highly discrete and the fit is poor, the difference between the corrected and observed values remains significant; underestimations remain overestimations, and overestimations remain underestimations, resulting in a less than ideal correction effect. Training another machine learning model to learn the simulated residuals of the original model often leads to overfitting. That is, during training, the two-tailed error and inherent noise in the training data are fully learned, causing the residual model to fit the training data too tightly and fail to generalize well to independent validation datasets, resulting in less than ideal simulated yield values after correction. Summary of the Invention
[0005] Therefore, it is necessary to provide a regional crop yield estimation method, system, equipment, and storage medium that has relatively complete correction and strong generalization ability, addressing the technical deficiencies of existing technologies such as insufficient correction and overfitting.
[0006] To solve the above problems, this application adopts the following technical solution:
[0007] One of the objectives of this application is to provide a method for estimating regional crop yields, comprising the following steps:
[0008] A random forest model was used to pre-train and simulate crop yields within the region.
[0009] The error in the prediction results of the random forest model is corrected using a rotation correction algorithm;
[0010] A corrected random forest model was used to predict regional crop yields.
[0011] In some embodiments, the step of pre-training and simulating crop yields within a region using a random forest model specifically includes the following steps:
[0012] Collect crop growth data within the region;
[0013] Construct a feature variable space based on the growth data;
[0014] A random forest model is established based on the feature variable space, and the random forest model is pre-trained and validated.
[0015] In some embodiments, in the step of collecting crop growth data within a region, the growth data includes yield data, time-series variable data, and static parameter data. The time-series variable data includes meteorological variables, remote-sensed vegetation variables, soil variables, and physiological process variables. The meteorological variables include air temperature, precipitation, and solar radiation. The remote-sensed vegetation variables include enhanced vegetation index and normalized difference vegetation index. The soil variables include soil temperature and soil moisture content. The physiological process variables include latent heat flux, sensible heat flux, canopy transpiration, and potential evaporation. The static parameter data includes growth environment parameters, including elevation, latitude, and soil physicochemical properties.
[0016] In some embodiments, the step of constructing the feature variable space based on the growth data specifically includes the following steps:
[0017] A dynamic correlation analysis was performed on the time-series variable data and the production data with a step size of 16 days to determine the most correlated variable and the most correlated time interval.
[0018] A static correlation analysis was performed on the growth environment parameters and crop yield data for each year, using the year as the unit.
[0019] The feature variable space is constructed by selecting variables with a Pearson correlation coefficient greater than 0.2, where the correlation coefficient is calculated using the following formula:
[0020]
[0021] In the formula: n is the total number of counties where crops are grown, v i Let x be the mean of the feature variables corresponding to county i. i For the crop yield statistics of county i, and These represent the characteristic variables and the statistical mean output for all counties, respectively.
[0022] In some embodiments, the steps of building a random forest model based on the feature variable space and pre-training and validating the random forest model specifically include the following steps:
[0023] Construct a random forest model based on the feature variable space;
[0024] The random forest model was trained and validated using leave-one-out cross-validation.
[0025] In some embodiments, the step of correcting the significant two-tailed error in the random forest model using a rotation correction algorithm specifically includes the following steps:
[0026] In the model training step, the training results of the random forest model are linearly fitted to obtain a linear regression function (y = ax + b) to quantify the two-tailed bias present in the model during training, and then the simulation results P are used to... t (x 0,t ,y 0,t (Where: x, y represent the sets of county-level statistical output and random forest model simulated output, respectively; a, b are the fitted regression coefficients; P...) t Let x represent a specific scatter point in the scatter plot corresponding to a county in the training dataset. 0,t and y 0,t (representing the county's statistical and simulated outputs respectively); the intersection point C(x) of the regression line (y = ax + b) and the 1:1 line (y = x). c ,y c Rotate and correct to a 1:1 line to obtain the corrected model training result P′. t (x r,t ,y r,t (where: x) c ,y c These represent the x and y coordinates of the intersection point C in the scatter plot, i.e., the statistical and simulated yields, respectively, and the corrected P. t The scatter coordinates corresponding to the point, where: y r,t That is, the corrected output).
[0027] In some embodiments, during the model prediction step, the simulation results of the random forest model are rotated and corrected to a 1:1 line around point C using the linear regression function obtained during training, resulting in the corrected yield prediction result y. r,v Specifically, it includes the following steps:
[0028] Assume P v The point should lie on the fitted regression line (y = ax + b), and P should be calculated. v x-coordinate of the point 0,v , wrap it around C(x) c ,y c Rotate the point to obtain the corrected production prediction result y. r,v ,in:
[0029] x 0,v =(y 0,v -b) / a、y c =x c =b / (1-a)
[0030] y r,v =(x 0,v -x c )·sinα+(y 0,v -yc )·cosα+y c,v ;
[0031] Where: P v Let y be the scatter plot corresponding to a certain county. 0,v The crop yield predicted by the random forest model is known; while the x-axis is... 0,v The county's statistical output is unknown in the forecasting process. c ,y c Let y be the x and y coordinates of the intersection of the regression line and the 1:1 line, and α be the angle between the regression line and the 1:1 line. r,v This indicates the corrected production forecast result.
[0032] The second objective of this application is to provide a regional crop yield estimation system, including:
[0033] The pre-training module uses a random forest model to simulate crop yields within the region.
[0034] The correction module uses a rotation correction algorithm to correct the errors in the prediction results of the random forest model;
[0035] The prediction module uses a calibrated random forest model to predict regional crop yields.
[0036] In some embodiments, the pre-training module includes:
[0037] The data acquisition unit is used to collect crop growth data within the region;
[0038] Feature construction unit, used to construct feature variable space based on the growth data;
[0039] The training unit is used to construct a random forest model based on crop growth data within the region and to train the random forest model.
[0040] A third objective of this application is to provide an apparatus comprising a processor and a memory coupled to the processor, wherein...
[0041] The memory stores program instructions for implementing the regional crop yield estimation method described above;
[0042] The processor is used to execute the program instructions stored in the memory to estimate the regional crop yield.
[0043] The fourth objective of this application is to provide a storage medium storing processor-executable program instructions for executing the regional crop yield estimation method.
[0044] The present application adopts the above technical solution, and its beneficial effects are as follows:
[0045] The regional crop yield estimation method, system, equipment, and storage medium provided in this application employ a rotation correction algorithm to calibrate the error of a random forest yield estimation model. The calibrated random forest model then estimates and predicts the yield of crops within the region. Compared with existing regression correction algorithms and machine learning correction algorithms, the rotation correction algorithm proposed in this application further rotates and corrects the linearly fitted regression model to improve its correction effect in low / high value areas. Compared with superimposing a machine learning error model, it has stronger generalization ability, avoids overfitting during training caused by using overly complex machine learning models, and has better correction effect and wider applicability. Attached Figure Description
[0046] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments of this application or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0047] Figure 1 This is a flowchart of the steps in the regional crop yield estimation method provided in Embodiment 1 of the present invention;
[0048] Figure 2 This is a schematic diagram illustrating the principle of the regional crop yield estimation method provided in Embodiment 1 of the present invention;
[0049] Figure 3 This is a schematic diagram of the rotation correction steps provided in Embodiment 1 of the present invention;
[0050] Figure 4 This is a schematic diagram showing the model training and prediction results before and after correction provided in Embodiment 1 of the present invention;
[0051] Figure 5 This is a schematic diagram of the regional crop yield estimation method provided in Embodiment 2 of the present invention;
[0052] Figure 6 This is a schematic diagram of the device structure provided in Embodiment 3 of the present invention;
[0053] Figure 7 This is a schematic diagram of the storage medium structure provided in Embodiment 4 of the present invention. Detailed Implementation
[0054] The embodiments of this application are described in detail below. Examples of these embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and intended to explain this application, and should not be construed as limiting this application.
[0055] In the description of this application, it should be understood that the terms "upper", "lower", "horizontal", "inner", "outer", etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, and are only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application.
[0056] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0057] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments.
[0058] Example 1
[0059] Please see Figure 1 and Figure 2 The following is a flowchart and schematic diagram of the regional crop yield estimation method provided in this embodiment, including the following steps S110 to S120. The implementation of each step is described in detail below.
[0060] Step S110: Use a random forest model to pre-train and simulate crop yields within the region.
[0061] In this embodiment, the step of using a random forest model to predict crop yields within a region specifically includes the following steps S111 to S112.
[0062] Step S111: Collect crop growth data within the region.
[0063] Specifically, the growth data includes yield data, time-series variable data, and static parameter data. The time-series variable data includes meteorological variables, remote-sensed vegetation variables, soil variables, and physiological process variables. The meteorological variables include air temperature, precipitation, and solar radiation. The remote-sensed vegetation variables include enhanced vegetation index and normalized difference vegetation index. The soil variables include soil temperature and soil moisture content. The physiological process variables include latent heat flux, sensible heat flux, canopy transpiration, and potential evaporation. The static parameter data includes growth environment parameters, including elevation, latitude, and soil physicochemical properties.
[0064] In this embodiment, statistical rice yield data at the county level in the three northeastern provinces (Heilongjiang, Jilin, and Liaoning) from 2006 to 2016 were collected. At the same time, time-series variable data closely related to rice yield formation were downloaded, including: ① meteorological variables (temperature, precipitation, solar radiation), ② remote sensing vegetation variables (enhanced vegetation index EVI, normalized difference vegetation index NDVI), ③ soil variables (soil temperature, soil moisture content), ④ physiological process variables (latent heat flux, sensible heat flux, canopy transpiration and potential evaporation, etc.); and static parameter data, including ⑤ growth environment parameters (elevation, latitude, soil physicochemical properties).
[0065] Step S112: Construct a feature variable space based on the growth data.
[0066] In this embodiment, the step of constructing the feature variable space based on the growth data specifically includes: performing dynamic correlation analysis on the time-series variable data and the yield data with a step size of 16 days to determine the most relevant variable and the most relevant time interval; performing static correlation analysis on the growth environment parameters and crop yield data of each year with a unit of year; and selecting variables with a Pearson correlation coefficient greater than 0.2 to construct the feature variable space.
[0067] The formula for calculating the correlation coefficient is as follows:
[0068]
[0069] In the formula: n is the total number of counties where crops are grown, v i Let x be the mean of the feature variables corresponding to county i. i For the crop yield statistics of county i, and These represent the characteristic variables and the statistical mean output for all counties, respectively.
[0070] Step S113: Establish a random forest model based on the feature variable space, and pre-train and validate the random forest model.
[0071] It is understandable that existing machine learning yield estimation models often ignore the two-tailed error caused by insufficient representativeness of the sample distribution, resulting in significant two-tailed errors such as overestimation of low yields and underestimation of high yields.
[0072] In this embodiment, a random forest model is used, and leave-one-out cross-validation is used to pre-train and validate the random forest model. That is, when making a yield prediction for a certain year, the county-level statistical yield data for that year is only used for model validation, while the data for other years are used for model training.
[0073] Step S120: Use the rotation correction algorithm to correct the errors in the prediction results of the random forest model.
[0074] It is understandable that existing machine learning yield estimation models often ignore the two-tailed error caused by insufficient representativeness of the sample distribution, resulting in significant two-tailed errors such as overestimation of low yield and underestimation of high yield. In this embodiment, a two-tailed error rotation correction algorithm is proposed to correct model bias and improve model accuracy.
[0075] Please see Figure 3 The following is a schematic diagram of the rotation correction steps provided in this embodiment:
[0076] In the model training step, the training results of the random forest model are linearly fitted to obtain a linear regression function (y = ax + b) to quantify the two-tailed bias present in the model during training, and then the simulation results P are used to... t (x 0,t ,y 0,t (Where: x, y represent the sets of county-level statistical output and random forest model simulated output, respectively; a, b are the fitted regression coefficients; P...) t Let x represent a specific scatter point in the scatter plot corresponding to a county in the training dataset. 0,t and y 0,t (representing the county's statistical and simulated outputs respectively); the intersection point C(x) of the regression line (y = ax + b) and the 1:1 line (y = x). c ,y c Rotate and correct to a 1:1 line to obtain the corrected model training result P′. t (x r,t ,y r,t Where: x c ,y c These represent the x and y coordinates of the intersection point C in the scatter plot, i.e., the statistical and simulated yields, respectively, and the corrected P. t The scatter coordinates corresponding to the point, where: y r,t This refers to the corrected output; see details below. Figure 3 (I) Model Training and Calibration;
[0077] In the model prediction step, the simulation results of the random forest model are rotated and corrected to a 1:1 line around point C using the linear regression function obtained during training, thus obtaining the corrected yield prediction result y. r,v y r,v This indicates the corrected production forecast results; see details below. Figure 3 (II) Model Prediction and Correction.
[0078] It should be noted that the observed yield in the prediction step is unknown, i.e., P. v The x-point 0,v Since the coordinates are unknown, we cannot directly rotate P. v Point, we must first assume P v The point should lie on the regression line (y = ax + b) fitted in the calibration step. Calculate P. v x-coordinate of the point 0,v (Equation 1), wrap it around C(x) c ,y c Rotate point C (the coordinates of point C are calculated using Equation 2, and the rotation process is shown in Equations 3-4) to obtain the corrected production forecast result y. r,v .
[0079] x 0,v =(y 0,v -b) / a (1)
[0080] y c =x c =b / (1-a) (2)
[0081]
[0082] y r,v =(x 0,v -x c )·sinα+(y 0,v -y c )·cosα+y c,v (4)
[0083] Where: the scatter plot corresponding to a certain county has a y-coordinate of y. 0,v The crop yield predicted by the random forest model is known; while the x-axis is... 0,v The county's statistical output is unknown in the forecasting step, x. 0,v To verify the statistical output of a certain county in the dataset, y 0,v To predict the yield of the random forest model, x c ,y c Let y be the x and y coordinates of the intersection of the regression line and the 1:1 line. r,v This indicates the corrected production forecast result.
[0084] Step S130: Use a calibrated random forest model to predict regional crop yields.
[0085] After the above corrections, the model training and prediction results before and after corrections are compared to verify the model accuracy and regional applicability.
[0086] Please see Figure 4 This is a schematic diagram showing the model training and prediction results before and after correction provided in this embodiment.
[0087] It is understandable that in the model training step, the statistical yield of crops in the training set is used to perform regression fitting on the model simulation value to obtain a linear regression function, quantify the model bias, and then rotate the linear regression function to a 1:1 line to correct the double-tailed error of the model; the corrected model is then applied to the prediction step to estimate the crop yield and achieve an unbiased estimate of the crop yield.
[0088] The regional crop yield estimation method provided in Embodiment 1 of this application, compared with existing regression correction algorithms and machine learning correction algorithms, performs further rotation correction on the linearly fitted regression model to improve its correction effect on low / high value areas; and compared with superimposing a machine learning error model, it has stronger generalization ability, avoids the overfitting phenomenon caused by using overly complex machine learning models during training, and has better correction effect and wider applicability.
[0089] Example 2
[0090] Please refer to Figure 5 The diagram below shows the structure of the regional crop yield estimation system provided in Embodiment 2 of this application, which includes a pre-training module 10, a correction module 20, and a prediction module 30. The implementation of each module is described in detail below.
[0091] The pre-training module 10 uses a random forest model to pre-train and simulate crops in the region.
[0092] In this embodiment, the pre-training module 10 includes:
[0093] Data acquisition unit 11 is used to collect crop growth data within the area;
[0094] Feature construction unit 12 is used to construct a feature variable space based on the growth data;
[0095] Training unit 13 is used to construct a random forest model based on crop growth data in the region, and to train and validate the random forest model.
[0096] The correction module 20 uses a rotation correction algorithm to correct the errors in the prediction results of the random forest model, and the prediction module 30 uses the corrected random forest model to predict regional crop yields.
[0097] The regional crop yield estimation system provided in this embodiment can be found in Embodiment 1 for detailed implementation, which will not be repeated here.
[0098] The regional crop yield estimation system provided in Embodiment 2 of this application is coupled with an error correction module based on rotation correction. Compared with existing regression correction algorithms and machine learning correction algorithms, the rotation correction algorithm proposed in this application performs further rotation correction on the linear fitting error regression model to improve its correction effect on low / high value area errors.
[0099] Example 3
[0100] Please see Figure 6 This is a schematic diagram of the device structure according to an embodiment of this application. The device 50 includes a processor 51 and a memory 52 coupled to the processor 51.
[0101] The memory 52 stores program instructions for implementing the above-mentioned three-network information fusion and retrieval system.
[0102] The processor 51 is used to execute program instructions stored in the memory 52 to realize the fusion and retrieval of information from the three networks.
[0103] The processor 51 can also be referred to as a CPU (Central Processing Unit). The processor 51 may be an integrated circuit chip with signal processing capabilities. The processor 51 can also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor can be a microprocessor or any conventional processor.
[0104] Example 4
[0105] Please see Figure 7This is a schematic diagram of the structure of the storage medium according to an embodiment of this application. The storage medium of this embodiment stores a program file 61 capable of implementing all the above methods. This program file 61 can be stored in the storage medium in the form of a software product, including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods of various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, or devices such as computers, servers, mobile phones, and tablets.
[0106] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0107] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0108] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The system embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of units or modules may be electrical or other forms.
[0109] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0110] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0111] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0112] It is understood that the technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0113] The above are merely preferred embodiments of this application, and only specifically describe the technical principles of this application. These descriptions are only for explaining the principles of this application and should not be construed as limiting the scope of protection of this application in any way. Based on this explanation, any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application, as well as other specific embodiments of this application that can be conceived by those skilled in the art without creative effort, should be included within the scope of protection of this application.
Claims
1. A method for estimating regional crop yield, characterized in that, Includes the following steps: A random forest model was used to pre-train and simulate crop yields within the region. The error in the prediction results of the random forest model is corrected using a rotation correction algorithm; A calibrated random forest model is used to predict regional crop yields. The pre-training simulation of regional crop yields using the random forest model specifically includes the following steps: Collect crop growth data within the region; Construct a feature variable space based on the growth data; A random forest model is established based on the feature variable space, and the random forest model is pre-trained and validated. The error in the prediction results of the random forest model is corrected using a rotation correction algorithm, which specifically includes the following steps: Using the measured crop yields in the training dataset, i.e., the county-level statistical yields, the training results of the random forest model are linearly fitted to obtain a linear regression function. This function is used to quantify the two-tailed bias present in the model during training, thereby obtaining the simulation results. The linear regression function is... The simulation results are ,in: These represent the sets of statistical outputs at the county level and the sets of outputs simulated by the random forest model, respectively. The regression coefficients are the fitted values. This represents a specific scatter point in the scatter plot corresponding to a county in the training dataset. and Represent the statistical and simulated output of the county, respectively; the intersection of the linear regression function and the 1:1 line. Rotate and correct to a 1:1 line to obtain the corrected model training result. Among them: the 1:1 line is , They represent the intersection points respectively. The corresponding x and y axes in the scatter plot represent the statistical and simulated yields. Indicates after correction The scatter coordinates corresponding to the point, where This is the corrected output; In the model prediction step, the linear regression function obtained during training is used to apply the simulation results of the random forest model in the same way. Point rotation correction to a 1:1 line yields the corrected production prediction results. Specifically, it includes the following steps: The simulated output of the county is known. Assume that the county in the scatter plot corresponds to The point should lie on the fitted linear regression function, and the result should be obtained. x-coordinate of a point , and wrap it around Rotate the point to obtain the corrected production prediction result. ,in: 、 、 、 ; Where: the ordinate of the verification scatter plot corresponding to a certain county is... The crop yield predicted by the random forest model is known; while the x-axis... The statistical output of the aforementioned county. The x and y coordinates are the points where the regression line intersects the 1:1 line. The angle between the regression line and the 1:1 line. This indicates the corrected production forecast result.
2. The regional crop yield estimation method as described in claim 1, characterized in that, In the step of collecting crop growth data within the region, the growth data includes yield data, time-series variable data, and static parameter data. The time-series variable data includes meteorological variables, remote-sensed vegetation variables, soil variables, and physiological process variables. The meteorological variables include air temperature, precipitation, and solar radiation. The remote-sensed vegetation variables include enhanced vegetation index and normalized difference vegetation index. The soil variables include soil temperature and soil moisture content. The physiological process variables include latent heat flux, sensible heat flux, canopy transpiration, and potential evaporation. The static parameter data includes growth environment parameters, which include elevation, latitude, and soil physicochemical properties.
3. The regional crop yield estimation method as described in claim 2, characterized in that, The step of constructing the feature variable space based on the growth data specifically includes the following steps: A dynamic correlation analysis was performed on the time-series variable data and the production data with a step size of 16 days to determine the most correlated variable and the most correlated time interval. A static correlation analysis was performed on the growth environment parameters and crop yield data for each year, using the year as the unit. The feature variable space is constructed by selecting variables with a Pearson correlation coefficient greater than 0.2, where the correlation coefficient is calculated using the following formula: In the formula: The total number of counties that grow crops. For the county The corresponding characteristic variable mean, For the county Crop yield statistics and These represent the characteristic variables and the statistical mean output for all counties, respectively.
4. The regional crop yield estimation method as described in claim 1, characterized in that, The steps of establishing a random forest model based on the feature variable space and pre-training and validating the random forest model specifically include the following steps: Construct a random forest model based on the feature variable space; The random forest model was pre-trained and validated using leave-one-out cross-validation.
5. A regional crop yield estimation system that implements the regional crop yield estimation method of claim 1, characterized in that, include: The pre-training module uses a random forest model to simulate crop yields within the region. The correction module uses a rotation correction algorithm to correct the errors in the prediction results of the random forest model; The prediction module uses a calibrated random forest model to predict regional crop yields; the pre-training module includes: The data acquisition unit is used to collect crop growth data within the region; Feature construction unit, used to construct feature variable space based on the growth data; The training unit is used to construct a random forest model based on crop growth data within the region and to train the random forest model.
6. A device, characterized in that, The device includes a processor and a memory coupled to the processor, wherein, The memory stores program instructions for implementing the regional crop yield estimation method according to any one of claims 1-5; The processor is used to execute the program instructions stored in the memory to estimate the regional crop yield.
7. A storage medium, characterized in that, The system stores processor-executable program instructions for performing the regional crop yield estimation method according to any one of claims 1 to 5.