Tobacco yield rapid modeling and online self-calibration method based on cloud-edge collaborative AutoML
By performing data compression and automated cloud-based modeling at the edge nodes of tobacco growing areas, combined with Mondrian partitioning and online self-calibration methods based on temperature scaling, the problems of data transmission latency and model accuracy were solved, achieving efficient tobacco yield prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGDONG TOBACCO RES INST
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
AI Technical Summary
In tobacco-growing areas, existing technologies struggle to perform efficient data compression and automated modeling at edge nodes, resulting in high data transmission latency, high bandwidth pressure, and insufficient model prediction accuracy and generalization ability.
A compressed feature matrix is generated on the edge side by using sliding window statistics extraction and Gaussian random projection compression. Model selection and hyperparameter optimization are performed in the cloud by combining Bayesian optimization. Online self-calibration is performed by using Mondrian partitioning and temperature scaling to generate an edge inference model and perform self-calibrated prediction.
It effectively reduces data transmission volume, decreases computational overhead, improves the model's prediction accuracy and generalization ability, avoids reconstruction errors and catastrophic forgetting risks, and achieves coverage guarantee for the self-calibrated prediction interval.
Smart Images

Figure CN122241179A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of cloud technology, specifically relating to a method for rapid modeling and online self-calibration of tobacco yield using cloud-edge collaborative AutoML. Background Technology
[0002] Tobacco is an important economic crop in my country, and its yield is significantly influenced by multiple factors, including meteorological conditions, soil conditions, and cultivation management, exhibiting substantial spatiotemporal heterogeneity. Accurate prediction of tobacco yield per acre is of great practical significance for tobacco procurement planning, warehousing and logistics scheduling, and optimization of planting structure. In recent years, with the widespread application of IoT technology in agriculture, multi-source sensing devices such as meteorological sensors, soil sensors, and crop growth sensors have been widely deployed in tobacco-growing areas, providing a rich data foundation for data-driven yield prediction modeling.
[0003] In yield prediction modeling, existing research and engineering practices mainly employ two technical approaches. The first is based on empirical statistical models, such as multiple linear regression and partial least squares regression. These methods directly model the linear relationship between features and yield, but struggle to capture the complex nonlinear interactions between meteorological, soil, and crop growth indicators. Prediction accuracy significantly decreases in years with extreme weather or the introduction of new varieties. The second approach is based on machine learning, including random forests, gradient boosting decision trees, support vector regression, and deep neural networks. These methods possess stronger nonlinear fitting capabilities and, with sufficient training data, typically achieve better predictive results than statistical models. However, different machine learning algorithms exhibit significant performance variations under different data characteristics and sample sizes. Model selection and hyperparameter tuning heavily rely on the experience of modelers, lacking a systematic automatic optimization mechanism. Automated machine learning technology offers a new approach to solving model selection and hyperparameter optimization problems by automatically searching for the optimal model family and hyperparameter combination within a predefined model search space, reducing reliance on human experience. However, most current automated machine learning frameworks are designed based on the assumption that training data is centrally stored on a single computing node. In scenarios such as tobacco planting, where data sources are highly dispersed, edge node computing resources are limited, and network bandwidth is restricted, directly aggregating all raw sensor data to the cloud for centralized modeling will result in significant transmission latency and bandwidth pressure.
[0004] In cloud-edge collaborative computing, edge computing technology effectively reduces data transmission volume and response latency by offloading some computing tasks to edge nodes closer to the data source. Existing cloud-edge collaborative solutions mainly focus on model deployment during the inference phase, i.e., after model training is completed in the cloud, the model is distributed to edge nodes for execution. However, current research and practice on how to perform efficient data compression at the edge during the training phase and how to conduct automated modeling directly in the cloud based on compressed data are still insufficient. In particular, during the process of edge nodes reporting data to the cloud, a common practice is to simply downsample or aggregate the mean of the original data. Although this processing method reduces the amount of data, it may also lose feature information that is of great value for yield prediction, such as the peak signal of extreme weather events and the short-term fluctuation characteristics of soil conditions. Summary of the Invention
[0005] The main objective of this invention is to provide a rapid modeling and online self-calibration method for tobacco yield using cloud-edge collaborative AutoML. This invention achieves automatic cloud-based modeling, cross-seasonal online calibration, and interval prediction output with guaranteed edge sideband coverage while reducing transmission bandwidth.
[0006] To address the aforementioned technical problems, this invention provides a method for rapid modeling and online self-calibration of tobacco yield using cloud-edge collaborative AutoML, comprising the following steps: Step 1: Edge-side data compression and cloud-based AutoML modeling: Multiple edge nodes collect multi-source time-series acquisition sequences, extract them through sliding window statistics and compress them through random projection to obtain a compressed feature matrix, and report it to the cloud server. The cloud server merges the data to obtain a global compressed training feature table. Through Bayesian optimization, model selection and hyperparameter optimization are performed in the compressed feature space to train a cloud-based production prediction base model. Step 2, Chain-based online self-calibration of Mondrian partition conformal prediction and temperature scaling: Each edge node generates a new season compressed feature matrix for the newly harvested plots in the same way as in Step 1 and reports it. The cloud server merges the new season global compressed feature table and divides it into calibration sample set and test sample set. Based on tobacco variety label and planting plot label, Mondrian partitioning is performed. Within each Mondrian partition, residual scaling operation, temperature scaling operation and conformal prediction operation are performed in sequence to generate a partition calibration parameter table. Step 3, Model Distillation and Edge Calibration Inference Deployment: The cloud server uses the cloud-based yield prediction base model as the teacher model to distill the edge inference model. The edge inference model and the partition calibration parameter table are then distributed to each edge node. Each edge node uses the edge inference model to obtain the predicted yield per acre value and calculates the calibration radius based on the partition calibration parameter table. The self-calibrated prediction interval is formed by extending the calibration radius above and below the predicted yield per acre value.
[0007] Furthermore, the multi-source time-series acquisition sequence is composed of meteorological sensor data, soil sensor data, and crop growth sensor data. The sliding window statistics are extracted as follows: the window length is set to one tobacco growth period observation cycle, the window sliding step size is set to half of the tobacco growth period observation cycle, and four statistics, namely mean, variance, maximum and minimum values, are calculated for each channel in each window. The statistics of all channels in the same window are concatenated end to end to form a window feature vector, and all window feature vectors are stacked row by row to form a local feature matrix.
[0008] Furthermore, random projection compression uses a Gaussian random measurement matrix pre-stored in the edge nodes to perform linear projection on each row of the local feature matrix, compressing the dimension of each row from the original dimension to one-third of the original dimension. After all rows are projected, they are stacked to obtain the compressed feature matrix. The cloud server uses the Gaussian random projection to satisfy the approximate distance-preserving property of the Johnson-Lindenstrauss lemma, and directly performs model training in the compressed feature space using the globally compressed training feature table.
[0009] Furthermore, Bayesian optimization is performed within a predefined AutoML model search space, which includes three candidate model families: gradient boosting decision trees, temporal convolutional networks, and lightweight multilayer perceptrons. Each candidate model family has its own predefined range of hyperparameter values. Bayesian optimization adopts a Bayesian optimization method based on a tree-structured Parzen estimator. In each iteration, one set of candidate configurations is sampled from the AutoML model search space. Each set of candidate configurations includes a candidate model family identifier and a corresponding set of hyperparameter values.
[0010] Furthermore, a 5-fold cross-validation evaluation is performed using the globally compressed training feature table and the corresponding actual yield per acre labeled value. The mean of the 5-fold root mean square error is used as the evaluation score of the candidate configuration. After a preset number of iterations, the candidate configuration with the smallest evaluation score is selected as the optimal configuration. The cloud-based yield prediction base model is then fully trained on the globally compressed training feature table according to the candidate model family and hyperparameter values specified by the optimal configuration.
[0011] Furthermore, the calibration sample set contains 70% of all samples in the new season's global compressed feature table, and the test sample set contains the remaining 30%. The Mondrian partitioning is performed as follows: the concatenated string of the tobacco variety label and the planting plot label attached to each sample is used as the Mondrian partitioning key, and samples with the same Mondrian partitioning key are grouped into the same Mondrian partition. When the number of calibration samples in a Mondrian partition is less than 30, the samples in the current Mondrian partition are reassigned to the variety-level Mondrian partition using only the tobacco variety label as the Mondrian partitioning key.
[0012] Furthermore, the residual scaling estimation operation is performed as follows: the compressed feature vector of each calibration sample in the Mondrian calibration subset is input into the cloud-based yield prediction base model to obtain the predicted yield per acre value, the difference between the predicted yield per acre value and the actual yield per acre value is calculated as the prediction residual, the sample standard deviation of the prediction residual of all calibration samples in the Mondrian calibration subset is calculated, and the sample standard deviation of the prediction residual is determined as the partition residual scale.
[0013] Furthermore, the temperature scaling operation is performed as follows: the initial search range of the temperature scaling factor is set to 0.01 to 100; for each calibration sample in the Mondrian calibration subset, a Gaussian probability density function is constructed with 0 as the center and the product of the temperature scaling factor and the partition residual scale as the standard deviation; the prediction residual of the calibration sample is substituted into the Gaussian probability density function to obtain the probability density value, and the natural logarithm is taken to obtain the log-likelihood value; the log-likelihood values of all calibration samples in the Mondrian calibration subset are summed to obtain the total log-likelihood value; the golden section search method is used to perform 50 rounds of search iterations within the initial search range, and the temperature scaling factor that maximizes the total log-likelihood value is determined as the optimal temperature scaling factor; the partition residual scale is multiplied by the optimal temperature scaling factor to obtain the partition scale after temperature calibration.
[0014] Furthermore, the conformal prediction operation is performed as follows: Calculate the absolute value of the prediction residual for each calibration sample in the Mondrian calibration subset, divide the absolute value of the prediction residual by the temperature-calibrated partition scale to obtain the normalized inconsistency score; sort all normalized inconsistency scores from smallest to largest, calculate the quantile number as the total number of normalized inconsistency scores plus 1, multiply by the 95% confidence level, and round up to obtain the integer, and take the normalized inconsistency score corresponding to the quantile number as the conformal threshold; for each test sample in the Mondrian test subset, multiply the temperature-calibrated partition scale by the conformal threshold to obtain the calibration radius, and expand the calibration radius above and below the predicted yield per acre to form a self-calibrated prediction interval.
[0015] Furthermore, the model distillation process is as follows: A lightweight multilayer perceptron with fewer hidden layers and fewer nodes per layer than the cloud-based yield prediction base model is constructed as a student model. The predicted yield per acre value obtained by inputting each sample from the global compressed training feature table into the cloud-based yield prediction base model is used as the soft target value. The compressed feature vector is used as the input to the student model, and the soft target value is used as the supervision signal. The mean squared error loss function is used to train the student model to obtain the edge inference model. Each edge node generates a compressed feature vector to be predicted for the newly acquired multi-source time-series acquisition sequence in the same way as in step 1. The compressed feature vector to be predicted is input into the edge inference model to obtain the predicted yield per acre value. The corresponding record is found in the zoning calibration parameter table according to the tobacco variety label and the planting plot label. The zoning residual scale is multiplied by the optimal temperature scaling factor and then multiplied by the conformal threshold to obtain the calibration radius.
[0016] The cloud-edge collaborative AutoML method for rapid modeling and online self-calibration of tobacco yield in this invention has the following beneficial effects: This invention employs a combination of sliding window statistical extraction and Gaussian random projection compression at the edge side. This transforms the original multi-source time-series acquisition sequences into a low-dimensional compressed feature matrix before uploading to the cloud server. Compared to directly transmitting raw sensor data, this significantly reduces the amount of data transmitted uplink, effectively alleviating the data aggregation bottleneck caused by the dispersed nature of tobacco planting areas and limited network bandwidth. Simultaneously, the sliding window statistical extraction uses four statistics—mean, variance, maximum, and minimum—to characterize the distribution characteristics of each channel within the observation period. This compresses the data volume while retaining environmental factor level information, fluctuation information, and extreme condition information that are crucial for yield prediction.
[0017] This invention utilizes the approximate distance-preserving property of Gaussian random projection, enabling cloud servers to perform automated model selection and hyperparameter optimization directly within the compressed feature space without needing to perform a reconstruction operation on the compressed feature matrix. This avoids the computational overhead and reconstruction error accumulation problems caused by sparse reconstruction in traditional compressed sensing schemes. The Bayesian optimization method based on the tree-structured Parzen estimator can adaptively guide the search direction based on feedback information from evaluated candidate configurations, efficiently locating the optimal model family and hyperparameter combination within a limited number of iterations, reducing the dependence of yield prediction modeling on human experience.
[0018] The proposed Mondrian partitioning conformal prediction and temperature scaling chain calibration mechanism effectively solves the problem of insufficient cross-seasonal generalization ability of existing yield prediction models. Mondrian partitioning divides samples into subgroups with similar conditions based on tobacco variety labels and planting plot labels. Calibration is performed independently within each partition, allowing calibration parameters to adapt to the differentiated error distribution characteristics under different variety and plot combinations. This avoids the problem of excessively wide or narrow prediction intervals for some subgroups caused by global uniform calibration. Temperature scaling finely adjusts the scale parameters of the error distribution through a data-driven approach, making the normalized inconsistency score closer to the standardized distribution, providing a more accurate normalized benchmark for subsequent conformal prediction. The conformal prediction operation constructs prediction intervals based on the empirical quantiles of the calibration samples, without assuming that the prediction residuals follow a specific parameter distribution family. It can provide self-calibrated prediction intervals with coverage guarantees under limited sample conditions. The entire chain calibration process only requires post-processing of the prediction output of the cloud-based yield prediction base model, without retraining the model itself or fine-tuning parameters. This results in extremely low computational cost and no risk of catastrophic forgetting. Attached Figure Description
[0019] Figure 1This is a schematic diagram illustrating the verification of the approximate distance-preserving property of Gaussian random projection in an embodiment of the present invention. Figure 2 The convergence curve of Bayesian optimization based on tree-structured Parzen estimator in the AutoML model search space provided in the embodiments of the present invention; Figure 3 This is a schematic diagram comparing the fitting of the predicted residual distribution before and after temperature scaling, provided in an embodiment of the present invention. Figure 4 This is a schematic diagram comparing the Mondrian conformal prediction interval and the actual yield per acre provided in an embodiment of the present invention. Detailed Implementation
[0020] The method of the present invention will be further described in detail below with reference to the accompanying drawings and embodiments.
[0021] A rapid modeling and online self-calibration method for tobacco yield using cloud-edge collaborative AutoML includes the following steps: Step 1: Edge-side data compression and cloud-based AutoML modeling: Multiple edge nodes collect multi-source time-series acquisition sequences, extract them through sliding window statistics and compress them through random projection to obtain a compressed feature matrix, and report it to the cloud server. The cloud server merges the data to obtain a global compressed training feature table. Through Bayesian optimization, model selection and hyperparameter optimization are performed in the compressed feature space to train a cloud-based production prediction base model. Step 2, Chain-based online self-calibration of Mondrian partition conformal prediction and temperature scaling: Each edge node generates a new season compressed feature matrix for the newly harvested plots in the same way as in Step 1 and reports it. The cloud server merges the new season global compressed feature table and divides it into calibration sample set and test sample set. Based on tobacco variety label and planting plot label, Mondrian partitioning is performed. Within each Mondrian partition, residual scaling operation, temperature scaling operation and conformal prediction operation are performed in sequence to generate a partition calibration parameter table. Step 3, Model Distillation and Edge Calibration Inference Deployment: The cloud server uses the cloud-based yield prediction base model as the teacher model to distill the edge inference model. The edge inference model and the partition calibration parameter table are then distributed to each edge node. Each edge node uses the edge inference model to obtain the predicted yield per acre value and calculates the calibration radius based on the partition calibration parameter table. The self-calibrated prediction interval is formed by extending the calibration radius above and below the predicted yield per acre value.
[0022] In actual tobacco cultivation, planting areas are often located in mountainous and hilly regions, with scattered plots and limited network bandwidth. Uploading all raw sensor data collected by edge nodes directly to the cloud server would not only result in high transmission latency but also easily cause network congestion during peak harvest periods. To address this issue, this invention performs feature extraction and dimensionality compression on the edge side, reporting only the compressed low-dimensional feature matrix to the cloud. The cloud server then performs AutoML modeling within the compressed feature space, thereby significantly reducing communication overhead while ensuring modeling accuracy.
[0023] In a specific implementation scenario, 12 edge nodes are deployed in the tobacco planting area, with each edge node covering 8 to 15 planting plots. Each edge node is equipped with three types of data acquisition devices: meteorological sensors, soil sensors, and crop growth sensors. The meteorological sensors collect data from four channels—air temperature, air humidity, rainfall, and light intensity—at 15-minute intervals; the soil sensors collect data from three channels—soil moisture content, soil temperature, and soil electrical conductivity—at 15-minute intervals; and the crop growth sensors collect data from two channels—normalized vegetation index and leaf area index—at 1-hour intervals. Each edge node aligns the data collected by the three types of sensors within the same time period along a unified time axis and then splices them together to form a multi-source time-series acquisition sequence containing nine channels. In an optional implementation, the number of channels can be increased or decreased depending on the types of sensors deployed, such as adding a carbon dioxide concentration sensor or a wind speed sensor, and the number of channels in the multi-source time-series acquisition sequence will change accordingly.
[0024] Each edge node performs sliding window segmentation on its respective multi-source time-series acquisition sequence. The window length is set to one tobacco growth period observation cycle, which is 15 days in this embodiment, meaning each window covers 15 days of continuously acquired data. The window sliding step size is set to half of the tobacco growth period observation cycle, i.e., 7.5 days, rounded down to 7 days. The purpose of using a semi-overlapping sliding window is to retain approximately 50% data overlap between adjacent windows, ensuring that growth trend changes at the window boundaries are not lost due to segmentation, while also generating a sufficient number of training samples within the limited growth period. In optional implementations, the window length can be set to 7 days, 10 days, or 20 days, with the sliding step size adjusted accordingly to half of the window length.
[0025] Within each window, the edge nodes calculate four statistics for each of the nine channels: mean, variance, maximum, and minimum. These four statistics are chosen because: the mean reflects the average level of environmental factors or growth indicators within the window period; the variance captures the degree of fluctuation; and the maximum and minimum define the range of extreme conditions. The combination of these four statistics effectively characterizes the distribution characteristics of the original time series with low dimensionality. For a multi-source time series acquisition sequence containing nine channels, each window generates... There are 36 statistics, where 9 represents the number of channels and 4 represents the number of statistics calculated for each channel. These 36 statistics are concatenated end-to-end in channel order to form a window feature vector with dimension 36. In an optional implementation, the median or skewness can be added to the above 4 statistics, and the dimension of the window feature vector will increase accordingly.
[0026] Suppose that an edge node generates 16 windows after being divided by a sliding window within a complete reproductive period (approximately 120 days). Then, the edge node generates 16 window feature vectors. These 16 window feature vectors are stacked row by row to form a local feature matrix with 16 rows and 36 columns.
[0027] To reduce uplink bandwidth consumption, each edge node uses a Gaussian random measurement matrix pre-stored in its local storage to compress the dimension of its local feature matrix. The size of the Gaussian random measurement matrix is... ,in In this embodiment, the original dimension of each row of the local feature matrix is... ; The target dimension after compression is set to one-third of the original dimension, i.e. Each element in the Gaussian random measurement matrix is independently selected from a set of random measurements with a mean of 0 and a variance of 0. Generated by sampling from a Gaussian distribution, where The target dimension after compression, with variance taken as... The goal is to ensure that the projected vector maintains the same Euclidean norm scale as the original vector in the desired sense. This Gaussian random measurement matrix is generated uniformly by the cloud server during system initialization and distributed to all edge nodes. One identical copy is also stored on the cloud server to ensure that compression and subsequent processing use the same projection basis.
[0028] The compression process is performed row by row: For each row in the local feature matrix, the row is treated as a 36-dimensional row vector. The transpose of this row vector is then multiplied left by a Gaussian random measurement matrix to obtain a 12-dimensional column vector. This column vector is then transposed back to a row vector, completing the linear projection compression of the row from 36 dimensions to 12 dimensions. This process is repeated for all 16 rows of the local feature matrix. All the compressed row vectors are then stacked row by row to obtain a compressed feature matrix with 16 rows and 12 columns. Each edge node reports its compressed feature matrix to the cloud server via the network.
[0029] refer to Figure 1 , Figure 1 The horizontal axis represents the Euclidean distance between any two samples in the original feature space, and the vertical axis represents the Euclidean distance between the same pair of samples after being compressed to a lower-dimensional space by Gaussian random projection. Each scatter point in the figure corresponds to a pair of samples randomly selected from the globally compressed training feature table. The horizontal axis represents the Euclidean distance of this sample pair in the original 36-dimensional window feature vector space, and the vertical axis represents the Euclidean distance of this sample pair after being projected onto the 12-dimensional compressed feature space by the Gaussian random measurement matrix. A diagonal solid line is drawn from the origin in the figure. This solid line is the ideal distance-preserving line, meaning that if the Euclidean distance between two samples before and after projection remains completely unchanged, the scatter point should fall exactly on this solid line. An auxiliary line is drawn above and below the ideal distance-preserving line, corresponding to the upper and lower bounds of distance distortion, respectively. The upper bound line indicates that the magnification of the distance after projection relative to the distance before projection does not exceed a certain proportion, and the lower bound line indicates that the reduction of the distance after projection relative to the distance before projection does not exceed the same proportion. Figure 1 It can be observed that the vast majority of scattered points are closely clustered near the ideal distance-preserving line, and almost all fall within the band-shaped region enclosed by the upper and lower distortion boundaries, with only a very few points slightly exceeding the boundary of this band-shaped region. This distribution characteristic intuitively verifies that the Gaussian random projection used in this invention effectively controls the change in the Euclidean distance between any two samples before and after projection to a small range during the process of compressing the window feature vector from 36 dimensions to 12 dimensions. According to the Johnson-Lindenstrauss lemma, when the compressed target dimension... satisfy At that time, among them For constants related to probability guarantees, For the sample size, To maintain an acceptable distance distortion ratio, Gaussian random projection can guarantee the aforementioned approximate distance-preserving property with a high probability. In this embodiment, , The corresponding distortion ratio This is within an acceptable range for engineering purposes. This property allows cloud servers to directly perform AutoML model training in the 12-dimensional compressed feature space without having to use sparse reconstruction algorithms such as orthogonal matching pursuit to restore the compressed features to the original 36-dimensional space. This saves on reconstruction computational overhead and avoids the cumulative errors that may be introduced during the reconstruction process. Figure 1 The tight clustering of scatter points along the ideal distance-preserving line indicates that the inter-sample distance relationships on which the cloud-based production prediction base model trained in the compressed feature space depends are highly consistent with those in the original feature space, providing a reliable feature input basis for subsequent Bayesian optimization to search for the optimal model family and hyperparameter combinations.
[0030] Compared to directly reporting the local feature matrix, compressing the feature matrix reduces the number of columns from 36 to 12, decreasing the amount of data transmitted to one-third of the original. More importantly, according to the Johnson-Lindenstrauss lemma, when the target dimension of the Gaussian random projection... satisfy At that time, among them For constants related to probability guarantees, For the sample size, To allow for a certain level of distance distortion, the change in the Euclidean distance between any two samples before and after projection shall not exceed [a certain percentage]. This means that the relative distance relationships between samples are approximately preserved in the compressed feature space. Machine learning algorithms based on distance or similarity (such as the splitting criterion in gradient boosting decision trees and feature transformation in multilayer perceptrons) can still achieve learning results close to those in the original feature space on the compressed features. Therefore, the cloud server does not need to perform a restoration operation on the compressed feature matrix and can directly train the model in the compressed feature space.
[0031] After receiving the compressed feature matrices reported by all 12 edge nodes, the cloud server merges the 12 compressed feature matrices row by row. Assuming that the 12 edge nodes generate rows 16, 14, 18, 15, 17, 16, 13, 19, 15, 16, 14, and 17 respectively, the merged global compressed training feature table has a total of 190 rows, with 12 columns in each row. Each row corresponds to the compressed feature vector of one sample, and each row is associated with the actual yield per acre, tobacco variety label, and planting plot label of the corresponding plot within the corresponding window period.
[0032] After the global compressed training feature table is constructed, the cloud server performs AutoML model selection and hyperparameter optimization. The preset AutoML model search space includes three candidate model families: gradient boosting decision trees, temporal convolutional networks, and lightweight multilayer perceptrons. The considerations for selecting these three model families are as follows: gradient boosting decision trees usually have a strong fitting ability on structured tabular data and are not sensitive to feature scale, making them suitable as baseline models; temporal convolutional networks can capture local temporal patterns in the window feature vector sequence through one-dimensional convolutional kernels; and lightweight multilayer perceptrons have a simple structure and fast inference speed, which is beneficial for subsequent distillation to the edge deployment. Each candidate model family has its own preset range of hyperparameter values. For example, the hyperparameters of gradient boosting decision trees include the maximum tree depth (range 3 to 10), learning rate (range 0.01 to 0.3), and number of iterations (range 100 to 1000); the hyperparameters of temporal convolutional networks include kernel size (range 2 to 5), number of channels (range 16 to 128), and number of network layers (range 2 to 6); the hyperparameters of lightweight multilayer perceptrons include the number of hidden layers (range 1 to 3), number of nodes per layer (range 32 to 256), and dropout ratio (range 0 to 0.5).
[0033] The cloud server employs a Bayesian optimization method based on a tree-structured Parzen estimator for the search. Unlike random or grid search, the tree-structured Parzen estimator divides the hyperparameter space of candidate configurations into two groups—one with better performance and one with worse performance—in each iteration based on the evaluated candidate configurations and their evaluation scores. It then fits probability densities to both groups and samples a new candidate configuration that maximizes the ratio of the probability density of the better group to that of the worse group for evaluation. This concentrates the search on regions with higher evaluation scores, improving search efficiency.
[0034] refer to Figure 2 , Figure 2 The horizontal axis represents the number of iterations in the Bayesian optimization, from iteration 1 to iteration 80. The vertical axis represents the root mean square error (RMSE) of each candidate configuration evaluated using 5-fold cross-validation on a globally compressed training feature table, expressed in kilograms per acre. The scatter points in the figure are distinguished by three labels based on the candidate model family: one label corresponds to gradient boosting decision trees, another to temporal convolutional networks, and the third to lightweight multilayer perceptrons. The horizontal axis of each scatter point represents the iteration number in which the candidate configuration was evaluated, and the vertical axis represents the mean 5-fold RMSE of that candidate configuration. A monotonically decreasing solid line is also overlaid in the figure, representing the historical envelope of the smallest evaluation score among all evaluated candidate configurations from iteration 1 to the current iteration, i.e., the current optimal evaluation score. Figure 2The following characteristics can be observed: In the early stages of iteration, the scatter plots are relatively dispersed, and the evaluation scores are generally high. This is because Bayesian optimization has not yet accumulated sufficient evaluation feedback information in the initial stage, and the sampling is highly exploratory, resulting in a relatively uniform distribution of candidate configurations in the model search space. As the iterations progress, the tree-structured Parzen estimator gradually builds a probabilistic model based on the score distribution of the evaluated candidate configurations, guiding the sampling towards regions with better evaluation scores. The scatter plots gradually concentrate towards the low-value region on the vertical axis, and the envelope of the current optimal evaluation score shows a convergence trend of first decreasing rapidly and then leveling off. In this embodiment, Bayesian optimization completed the main decrease in evaluation scores within the first 30 iterations, after which it entered the fine-tuning search stage, and the improvement in evaluation scores gradually decreased. The figure marks the iteration of the candidate configuration with the lowest evaluation score and the corresponding root mean square error value in the entire search process with arrows; this candidate configuration is the finally determined optimal configuration. The scatter plots with different labels show that candidate configurations of the gradient boosting decision tree type appear more frequently in regions with lower evaluation scores, indicating that under the compressed feature space and sample size of this embodiment, the gradient boosting decision tree has a certain advantage in fitting window statistical features. Of course, under different data distributions and sample sizes, the model family corresponding to the optimal configuration may change, which is precisely the value of the AutoML automatic search mechanism.
[0035] In each iteration, the Bayesian optimization method samples one set of candidate configurations from the AutoML model search space. Each set of candidate configurations consists of a candidate model family identifier (i.e., one of gradient boosting decision trees, temporal convolutional networks, or lightweight multilayer perceptrons) and a set of hyperparameter values for the corresponding model family. The cloud server instantiates the model according to the candidate configuration and performs 5-fold cross-validation evaluation using a globally compressed training feature table and the corresponding actual yield per acre labeled value: the globally compressed training feature table is randomly divided into 5 equal subsets. Each time, 4 subsets are used as the training set and the remaining subset is used as the validation set for training and prediction. The root mean square error between the predicted yield per acre value and the actual yield per acre labeled value on the validation set is calculated. After repeating 5 times, the arithmetic mean square error of the 5 root mean square errors is taken as the evaluation score of the current candidate configuration. In this embodiment, the Bayesian optimization performs 80 iterations. After all 80 iterations are completed, the candidate configuration with the smallest evaluation score is selected as the optimal configuration. In an optional implementation, the number of iterations can be set to any integer between 50 and 200.
[0036] After determining the optimal configuration, the cloud server uses the candidate model family and hyperparameter values specified by the optimal configuration to train the entire model using all 190 samples of the globally compressed training feature table and their corresponding actual yield per acre labels, without performing cross-validation splitting. After training, the cloud yield prediction base model is obtained.
[0037] Once the new tobacco planting cycle begins, a chain-like online self-calibration process can be initiated when some plots have completed harvesting and obtained actual yield values per acre. In actual production, different plots have different tobacco varieties and planting altitudes, resulting in variations in harvesting times. Typically, a significant number of plots have already completed harvesting and recorded yields within the first 60% to 70% of the harvest season. This invention utilizes data from these harvested plots to calibrate the cloud-based yield prediction base model, enabling the model to adapt to changes in climate and soil conditions in the new season without requiring retraining.
[0038] For plots where actual yield per acre has been obtained, each edge node processes the multi-source time-series acquisition sequences using the same sliding window segmentation, statistical calculation, and compressed projection method as in step 1, generating a new season's compressed feature matrix and reporting it to the cloud server. The cloud server merges the new season's compressed feature matrices reported by all edge nodes row by row to obtain a new season's global compressed feature table. Assume there are 95 harvested plot samples in this season.
[0039] The cloud server randomly divides all samples in the new season's global compressed feature table into a calibration sample set and a test sample set. The calibration sample set contains 70% of all samples, i.e., 66 samples, and the test sample set contains the remaining 30%, i.e., 29 samples. Allocating 70% of the samples to the calibration sample set is because the stability of the conformal prediction interval width and coverage directly depends on the number of calibration samples. Too few samples will lead to unstable conformal threshold estimation, while too many samples will result in insufficient test samples to verify the calibration effect. In an optional implementation, the proportion of the calibration sample set can be set between 60% and 80%.
[0040] Next, we will conduct the Mondrian partitioning. The core idea of the Mondrian partitioning is that different tobacco varieties have different genetic yield potentials, and the soil fertility and microclimate conditions of different planting plots also vary. Therefore, the prediction error distribution of the cloud-based yield prediction baseline model may differ significantly across different combinations of varieties and plots. If uniform calibration parameters are applied to all samples, the prediction intervals for some combinations will be too wide while those for others will be too narrow. The Mondrian partitioning groups samples with similar varieties and plot conditions together, and calibrates them independently within each group, thus ensuring that the prediction intervals achieve the target coverage within each group.
[0041] Specifically, the cloud server reads the tobacco variety label and planting plot label attached to each sample, and uses the concatenated string of the tobacco variety label and planting plot label as the Mondrian partition key. For example, a sample with the variety "Yunyan 87" and the plot number "A03" has the Mondrian partition key "Yunyan 87_A03". Samples with the same Mondrian partition key are grouped into the same Mondrian partition.
[0042] In real-world data, some combinations of varieties and plots may contain only a very small number of samples. When the number of calibration samples within a Mondrian zone is less than 30, the calibration samples for that zone are insufficient to support reliable residual scaling and conformal threshold calculation. In this case, the samples within the current Mondrian zone are reassigned to variety-level Mondrian zones using only the tobacco variety label as the Mondrian zone key, ignoring plot differences and aggregating only by variety. The threshold of 30 samples is based on the assumption that, under the Gaussian distribution, the relative standard error of the sample standard deviation calculated from 30 independent samples is approximately 13%, which is within an acceptable range for engineering purposes; if the number of samples is further reduced, the estimation error of the sample standard deviation will increase significantly. In an optional implementation, this threshold can be adjusted to between 20 and 50 depending on the actual amount of data.
[0043] After the above partitioning operation, the calibration sample set is divided into several Mondrian calibration subsets, and the test sample set is divided into several Mondrian test subsets according to the same final effective Mondrian partition key. Each Mondrian calibration subset and the Mondrian test subset with the same Mondrian partition key constitute a Mondrian partition pair. Assume that a total of 5 Mondrian partition pairs are obtained after partitioning.
[0044] Within each Mondrian partition, the cloud server executes three operations sequentially: residual scaling, temperature scaling, and conformal prediction. The execution order of these three operations is irreversible because the temperature scaling operation depends on the output of the residual scaling operation, and the conformal prediction operation depends on the output of the temperature scaling operation; together, they form a cascaded calibration pipeline.
[0045] Taking one Mondrian partition pair as an example, assume that the Mondrian calibration subset of this Mondrian partition pair contains 40 calibration samples.
[0046] The residual scaling operation is performed as follows: The compressed feature vector of each calibration sample in the Mondrian calibration subset is input into the cloud-based yield prediction base model to obtain the predicted yield per acre. The difference between the predicted yield per acre and the actual yield per acre for each calibration sample is calculated, and this difference is used as the prediction residual. For example, if the predicted yield per acre for a calibration sample is 152.3 kg and the actual yield per acre is 148.7 kg, the prediction residual is 3.6 kg. After collecting the prediction residuals of all 40 calibration samples in the Mondrian calibration subset, the sample standard deviation of these 40 prediction residuals is calculated. The sample standard deviation is calculated as follows: first, the arithmetic mean of the 40 prediction residuals is calculated; then, the square of the difference between each prediction residual and the arithmetic mean is calculated; the sum of the 40 squared values is divided by 39 (i.e., sample size minus 1); finally, the square root of the quotient is taken. The obtained sample standard deviation is determined as the partition residual scale for the current Mondrian partition pair. The partitioned residual scale reflects the overall prediction error level of the cloud-based yield prediction base model on the current variety and plot combination, providing an initial scale estimate of the error distribution for subsequent temperature scaling operations.
[0047] refer to Figure 3 , Figure 3The data consists of two subplots. The left subplot shows the residual distribution fitting before the temperature scaling operation, and the right subplot shows the residual distribution fitting after the temperature scaling operation. The horizontal axis of both subplots represents the prediction residual, expressed in kilograms per acre. The prediction residual is defined as the predicted yield per acre value from the actual yield per acre label value in the cloud-based yield prediction base model. The vertical axis represents the probability density. The left subplot presents the empirical distribution of the prediction residuals for all 40 calibration samples in a Mondrian calibration subset as a histogram. The height of each bar in the histogram reflects the probability density of samples falling within that residual interval relative to the total sample. A smooth curve is overlaid on the histogram; this curve is a Gaussian probability density function centered at 0 with the partition residual scale as the standard deviation. The values for the partition residual scale are given in the label box of the left subplot. As seen in the left subplot, the peak height of the Gaussian fitted curve deviates somewhat from the central region of the histogram. The curve's peak is slightly higher than the actual density of the histogram, while the tail regions on both sides are slightly lower. This indicates that the Gaussian distribution constructed directly using the sample standard deviation as the standard deviation does not perfectly match the shape of the actual residual distribution; the tails of the residual distribution are slightly thicker than the standard Gaussian distribution. In the right subplot, the histogram is identical to the left subplot because the temperature scaling operation does not change the predicted residuals themselves, only the scale parameter of the fitted distribution. Two curves are superimposed in the right subplot: one is the temperature-scaled Gaussian fitted curve, centered at 0, with the temperature-calibrated partition scale as the standard deviation (the temperature-calibrated partition scale equals the partition residual scale multiplied by the optimal temperature scaling factor); the other is the Gaussian fitted curve before scaling, drawn as a dashed line for comparison. The values of the optimal temperature scaling factor and the temperature-calibrated partition scale are given in the label boxes of the right subplot. In this embodiment, the optimal temperature scaling factor is greater than 1, meaning that the temperature scaling operation appropriately amplifies the standard deviation of the Gaussian distribution, resulting in a lower peak and a higher tail of the fitted curve, thus better covering the thicker tail region of the actual residual distribution. As can be observed from the right-hand subplot, the fit between the scaled Gaussian fitted curve and the histogram is significantly improved compared to before scaling, especially in the tail region where the deviation between the fitted curve and the histogram is significantly reduced. The practical significance of this improvement is that in the subsequent conformal prediction operation, the normalized inconsistency score is obtained by dividing the absolute value of the predicted residual by the temperature-calibrated partition scale. After temperature scaling makes the fitted distribution more closely match the actual distribution, the distribution of the normalized inconsistency score is closer to the standardized form, making the estimation of the conformal threshold more accurate. The final constructed self-calibrated prediction interval achieves a better balance between width and coverage.
[0048] The temperature scaling operation further refines the scaling parameters of the error distribution based on the residual scaling estimation operation. The motivation is that the partitioned residual scale is directly calculated from the sample standard deviation, implicitly assuming that the predicted residuals follow a Gaussian distribution with a mean of 0. However, the actual residual distribution may deviate slightly, with the tails potentially thicker or thinner than the standard Gaussian distribution. The temperature scaling factor, as a multiplicative adjustment coefficient, can stretch or compress the scale of the residual distribution, allowing the adjusted Gaussian distribution to better fit the actual residual distribution shape, thus providing a more accurate normalized benchmark for subsequent shape-preserving predictions.
[0049] The specific execution process of the temperature scaling operation is as follows: Set one temperature scaling factor to be optimized for the current Mondrian partition, with an initial search range of 0.01 to 100. The lower bound of the initial search range, 0.01, is close to 0 but positive, to avoid the probability density function degrading due to a standard deviation of 0; the upper bound, 100, is large enough to cover various scale shifts that may actually occur.
[0050] For each calibration sample in the Mondrian calibration subset, a Gaussian probability density function is constructed centered at 0, with the standard deviation being the product of the temperature scaling factor and the scale of the partitioned residuals. The reason for centering at 0 is that the expected value of the predicted residuals should be 0 under an unbiased model; temperature scaling only adjusts the width of the distribution without changing its central position. The predicted residuals of the calibration sample are substituted into this Gaussian probability density function to obtain the probability density value at that point. The natural logarithm of the probability density value is taken to obtain the log-likelihood value of that calibration sample. The log-likelihood values of all 40 calibration samples in the Mondrian calibration subset are summed to obtain the total log-likelihood value under the current temperature scaling factor. The total log-likelihood value measures the overall goodness of fit of the constructed Gaussian distribution to the 40 observed predicted residuals under a given temperature scaling factor.
[0051] The golden section search method is used to find the temperature scaling factor that maximizes the total log-likelihood value within an initial search interval. The golden section search method is an interval search method suitable for unimodal functions. In each iteration, the current search interval is divided into two interior points according to the golden ratio (approximately 0.618). The total log-likelihood value at each interior point is calculated, and the sub-interval containing the larger total log-likelihood value is retained, while the other sub-interval is discarded. This process is repeated until the search interval is reduced to 0.618 times its original size. After 50 iterations, the width of the search interval is reduced to a fraction of the initial interval width. times, approximately the initial width The accuracy has been achieved by a factor of [number], reaching an extremely high level. The temperature scaling factor corresponding to the midpoint of the final interval is determined as the optimal temperature scaling factor for the current Mondrian partition pair. In an optional implementation, the golden section search method can be replaced by a ternary search method or the Brent method, and the number of search iterations can be set between 30 and 100.
[0052] Multiplying the partition residual scale by the optimal temperature scaling factor yields the temperature-calibrated partition scale. The temperature-calibrated partition scale is the standard deviation of the error distribution after data-driven adjustment, and it more accurately describes the actual prediction error distribution within the current Mondrian partition compared to the original partition residual scale.
[0053] Conformal prediction constructs a prediction interval with guaranteed coverage based on the output of temperature scaling. The core advantage of conformal prediction is that it does not require assuming that the prediction residuals strictly follow a Gaussian distribution or any specific family of parameter distributions; it only needs to utilize the empirical quantiles of the calibration samples to construct a prediction interval that satisfies the coverage guarantee for a finite sample. The prediction interval calculated on the calibration sample is relative to the . The coverage probability of each exchangeable sample is no less than ,in To calibrate the sample size, The significance level is indicated by the temperature scaling. This invention combines conformal prediction with the aforementioned temperature scaling: first, temperature scaling is used to make the normalized inconsistency score as close as possible to the standard distribution, and then conformal quantiles are used to provide distribution-independent coverage as a fallback.
[0054] The specific execution process of conformal prediction is as follows: For each calibration sample in the Mondrian calibration subset, the absolute value of the prediction residual is calculated. The absolute value of the prediction residual is then divided by the temperature-calibrated partition scale to obtain the normalized inconsistency score. The purpose of dividing by the temperature-calibrated partition scale is to unify the residuals with different error scales in different Mondrian partitions to a comparable dimensionless scale, so that the subsequent quantile thresholds have a consistent interpretation across partitions.
[0055] Sort the normalized inconsistency scores of all 40 calibration samples in the Mondrian calibration subset from smallest to largest. Set the confidence level to 95%, i.e., expect the prediction interval to cover 95% of future samples. Calculate the quantile index. Adding 1 to the total number of normalized inconsistency scores of 40 gives 41, multiplying by 0.95 gives 38.95, and rounding up gives 39. Where 40 represents the number of calibration samples and 0.95 represents the confidence level. The normalized inconsistency score at the 39th position (39th from smallest to largest) is taken from the sorted set of normalized inconsistency scores and determined as the conformal threshold for the current Mondrian partition pair. The theoretical basis for the increment operation stems from the finite-sample correction of conformal prediction: for The nth exchangeable calibration sample, take the nth One inconsistency score is used as the threshold, where For floor operations, To calibrate the sample size, At a significance level, it can be ensured that the marginal coverage probability of the new sample is not less than [value missing]. In an optional implementation, the confidence level can be set to 90% or 99% depending on production management needs.
[0056] For each test sample in the Mondrian test subset, its compressed feature vector is input into the cloud-based yield prediction base model to obtain the predicted yield per acre. The calibration radius is obtained by multiplying the temperature-calibrated partition scale by the conformal threshold. The calibration radius is defined as the half-width of the coverage determined by the conformal prediction under the error distribution after temperature scaling calibration. The calibration radius is extended both above and below the predicted yield per acre value to form a self-calibrated prediction interval. For example, if the predicted yield per acre value for a test sample is 155.0 kg, the temperature-calibrated partition scale is 8.2 kg, and the conformal threshold is 1.73, then the calibration radius is... The self-calibrated prediction range is 140.81 kg to 169.19 kg.
[0057] After all Mondrian partition pairs have completed the above three operations, the cloud server summarizes and stores the Mondrian partition key, partition residual scale, optimal temperature scaling factor, and conformal threshold of each Mondrian partition pair into a partition calibration parameter table. Each record in the partition calibration parameter table contains one Mondrian partition key and three corresponding values: partition residual scale, optimal temperature scaling factor, and conformal threshold. In this embodiment, the five Mondrian partition pairs correspond to five records in the partition calibration parameter table. The data volume of the partition calibration parameter table is extremely small, typically only a few hundred bytes, and can be quickly distributed to edge nodes under extremely low bandwidth conditions.
[0058] refer to Figure 4 , Figure 4 The horizontal axis represents the test sample number, and the vertical axis represents the tobacco yield per mu (unit: kilograms per mu). The figure contains test samples from five Mondrian zones, separated by vertical dashed lines, with each zone number labeled below the horizontal axis. Each test sample corresponds to one vertical error bar and two marker points: the upper and lower ends of the error bar represent the upper and lower bounds of the self-calibration prediction interval, respectively; the upper bound equals the predicted yield per mu plus the calibration radius, and the lower bound equals the predicted yield per mu minus the calibration radius. The square marker point, located at the center of the error bar, represents the predicted yield per mu output by the edge inference model; the circular hollow marker point represents the actual yield per mu labeled value for the corresponding sample. When the actual yield per mu labeled value falls within the upper and lower bounds of the self-calibration prediction interval, it indicates that the sample is successfully covered by the prediction interval. Figure 4Several characteristics can be observed: Within each Mondrian partition, the length of the error bars (i.e., the width of the self-calibrated prediction interval) remains consistent. This is because all test samples within the same Mondrian partition share the same partition residual scale, optimal temperature scaling factor, and conformal threshold, resulting in the same calibration radius. Between different Mondrian partitions, the length of the error bars varies, reflecting different prediction error levels of the cloud-based yield prediction base model under different combinations of tobacco varieties and planting plots. For example, Mondrian partitions with larger partition residual scales correspond to longer error bars, indicating higher prediction uncertainty in that partition and requiring a wider prediction interval to achieve the target coverage. Conversely, Mondrian partitions with smaller partition residual scales correspond to shorter error bars, indicating more stable predictions in that partition, allowing for the same coverage guarantee with a narrower interval. This differentiated interval width is the core advantage of the Mondrian partitioning mechanism: without partitioning and using globally uniform calibration parameters, combinations of varieties and plots with smaller prediction errors will obtain unnecessarily wide intervals, while combinations of varieties and plots with larger prediction errors may obtain narrow intervals with insufficient coverage. Figure 4 The top-left corner annotation box displays the overall coverage value for all test samples. This coverage is close to the preset 95% confidence level, verifying that the self-calibration prediction interval constructed by Mondrian partition conformal prediction and temperature scaling chain calibration has effective coverage guarantee. The actual yield per acre values of the vast majority of test samples fall within the corresponding self-calibration prediction interval. Only a very small number of samples have actual values outside the interval. These uncovered samples statistically meet the expected uncovered proportion of 5%, indicating that no systematic bias occurred in the calibration process.
[0059] It should be noted that as the new harvest continues and more plots obtain actual yield values per acre, the cloud server can periodically re-execute the above-mentioned chain-like online self-calibration process, incorporate newly harvested samples into the calibration sample set, update the partition calibration parameter table and redistribute it to each edge node, thereby achieving rolling refresh of calibration parameters.
[0060] The optimal configuration of the cloud-based production prediction base model, determined through AutoML search, may consist of a gradient boosting decision tree ensemble containing hundreds of decision trees, or a temporal convolutional network with multiple convolutional and fully connected layers. While these models have ample computational resources on cloud servers, direct deployment to edge nodes faces two challenges: firstly, edge nodes are typically equipped with only low-power processors and limited memory, making it difficult to handle the inference operations of large-scale models; secondly, the inference process of some model families (such as gradient boosting decision trees) involves numerous conditional branch judgments, resulting in significantly lower execution efficiency on edge embedded hardware compared to well-structured matrix operations. Therefore, this invention employs knowledge distillation to transfer the predictive capabilities of the cloud-based production prediction base model to a compact, lightweight multilayer perceptron, enabling efficient operation on edge nodes while preserving as much of the prediction accuracy as possible from the cloud model.
[0061] The basic idea behind knowledge distillation is to avoid requiring student models to directly learn the input-output mapping from the original labeled data. Instead, student models mimic the teacher model's predicted output for each input sample. The teacher model's predicted output contains complex nonlinear relationships extracted from the globally compressed training feature table during training, including interaction patterns between different features and strategies for handling extreme samples. This information is richer than a single actual yield per acre label. By using the teacher model's predicted output as a learning target, student models can inherit the teacher model's understanding of the feature space with a smaller model size.
[0062] The specific distillation training process is as follows: The cloud server uses the cloud yield prediction base model as the teacher model and constructs a lightweight multilayer perceptron as the student model. The structural design of the student model needs to strike a balance between model capacity and edge inference efficiency. In this embodiment, the student model is set as a multilayer perceptron with two hidden layers. The first hidden layer contains 64 nodes, the second hidden layer contains 32 nodes, and the output layer contains one node for outputting the predicted yield per acre value. The activation function of each hidden layer is the ReLU function. In contrast, if the teacher model is a gradient boosting decision tree, it typically contains 300 to 800 decision trees with a depth of 5 to 8; if the teacher model is a temporal convolutional network, it typically contains 4 to 6 convolutional layers with 64 to 128 channels. Regardless of the model family the teacher model belongs to, the number of hidden layers and the number of nodes per layer in the student model are less than the equivalent parameter size of the teacher model, ensuring that the inference computation of the student model is suitable for the hardware conditions of the edge nodes. In an optional implementation, the number of hidden layers in the student model can be set to 1 to 3, and the number of nodes in each layer can be set to 16 to 128. The specific values can be adjusted according to the processor computing power and memory capacity of the edge nodes.
[0063] In the data preparation phase of distillation training, the cloud server inputs each sample from the globally compressed training feature table into the teacher model, obtaining the predicted yield per acre value output by the teacher model for each sample. This predicted yield per acre value is used as the soft target value for the corresponding sample. The soft target value is called a "soft" target to distinguish it from the "hard" label of the actual yield per acre: the actual yield per acre is the objectively measured real yield, while the soft target value is a prediction given by the teacher model based on the feature mapping relationship learned internally, reflecting the teacher model's understanding of the data distribution. In this embodiment, the globally compressed training feature table contains 190 samples, thus generating 190 soft target values.
[0064] During the distillation training phase, the compressed feature vector of each sample in the globally compressed training feature table is used as the input to the student model, and the soft target value of the corresponding sample is used as the supervision signal. The mean squared error loss function is used to train the student model. The mean squared error loss function calculates the square of the difference between the output value of the student model and the soft target value, and the arithmetic mean of the squared differences of all training samples is taken as the loss value. The training process uses the Adam optimizer, with an initial learning rate set to 0.001, a batch size set to 32, and 200 training epochs. In each training epoch, all 190 samples are randomly shuffled and grouped according to the batch size before being input into the student model. The loss value is calculated, and the connection parameters of each layer in the student model are updated through backpropagation. In an optional implementation, the optimizer can be replaced with the SGD optimizer (with a momentum term of 0.9), the learning rate can be gradually decayed from 0.01 to 0.0001 using a cosine annealing scheduling strategy, and the number of training epochs can be set between 100 and 500.
[0065] During training, the cloud server records the loss value of each training round after it ends. When the loss value decreases by no more than 0.1% of the previous round's loss value for 20 consecutive training rounds, the student model is considered to have converged sufficiently, and training is terminated early. In this embodiment, the student model triggers the early termination condition at round 163. The student model obtained after training is the edge inference model.
[0066] To verify the distillation effect, the cloud server input all samples from the globally compressed training feature table into the teacher model and the edge inference model respectively, and calculated the root mean square error between their outputs. In a typical experimental scenario, the root mean square error of the outputs of the teacher model and the edge inference model was approximately 2.1 kg, while the root mean square error of the teacher model itself in 5-fold cross-validation was approximately 9.8 kg. The additional error introduced by distillation is relatively small compared to the prediction error of the teacher model itself, indicating that the edge inference model has inherited the prediction ability of the teacher model quite well.
[0067] After distillation training is complete, the cloud server serializes the edge inference model into a model file that can be loaded and run on edge nodes. In this embodiment, the student model is a multilayer perceptron with two hidden layers, an input dimension of 12 (consistent with the dimension of the compressed feature vector), 64 nodes in the first hidden layer, 32 nodes in the second hidden layer, and an output dimension of 1. The total number of parameters in the model file is [number missing]. The model file contains a set of floating-point numbers, where each parameter represents, in order, the number of connection parameters in the first hidden layer, the number of bias parameters in the first hidden layer, the number of connection parameters in the second hidden layer, the number of bias parameters in the second hidden layer, the number of connection parameters in the output layer, and the number of bias parameters in the output layer. Stored as single-precision floating-point numbers, the model file size is approximately [size missing]. The total data size is less than 12 kilobytes, with 4 bytes representing single-precision floating-point numbers. Each record in the partition calibration parameter table contains one Mondrian partition key (string) and three floating-point values (partition residual scale, optimal temperature scaling factor, and conformal threshold). In this embodiment, there are a total of 5 records, with a data size not exceeding 1 kilobyte. The total data size is less than 13 kilobytes, which can be distributed within seconds in a low-bandwidth network environment.
[0068] The cloud server distributes the edge inference model and the partition calibration parameter table to each edge node. After receiving the data, each edge node loads the edge inference model into its local runtime environment and stores the partition calibration parameter table in its local storage for later retrieval during inference.
[0069] The complete process of each edge node performing inference locally is as follows. When an edge node needs to predict the tobacco yield of a plot of land to be predicted, it first processes the newly collected multi-source time-series acquisition sequence of the plot in the same way as in step 1: it performs sliding window segmentation (the window length is one observation cycle of the tobacco growth period, and the sliding step size is half of the observation cycle of the tobacco growth period), calculates four statistics (mean, variance, maximum, and minimum) for each channel in each window, and concatenates them into a window feature vector. The window feature vectors are stacked row by row to form a local feature matrix. Linear projection compression is performed on each row of the local feature matrix using the Gaussian random measurement matrix pre-stored in the edge node to obtain the compressed feature vector to be predicted. It is worth noting that the Gaussian random measurement matrix used by the edge node in the inference stage is the same copy as the Gaussian random measurement matrix used in the training stage in step 1, ensuring that the feature space of training and inference is consistent.
[0070] The compressed feature vector to be predicted is input into the edge inference model. After forward propagation, the edge inference model outputs a single scalar value, which is the predicted yield per acre. Since the edge inference model is a lightweight multilayer perceptron with only two hidden layers, the forward propagation involves only two matrix-vector multiplications and two ReLU activation operations. On a typical edge processor (such as an ARM Cortex-A72 with a clock frequency of 1.5 GHz), the inference time is less than 1 millisecond, meeting the requirements for real-time inference.
[0071] After obtaining the predicted yield per acre, the edge node determines the corresponding Mondrian partition key based on the tobacco variety label and planting plot label of the sample to be predicted. The determination method is the same as in step 2: the concatenated string of the tobacco variety label and planting plot label is used as the Mondrian partition key. The edge node searches for the record that matches the current Mondrian partition key in the partition calibration parameter table and reads the corresponding three values: partition residual scale, optimal temperature scaling factor, and conformal threshold.
[0072] The calibration radius is calculated by multiplying the partition residual scale by the optimal temperature scaling factor and then by the conformal threshold. This can be expressed as: ,in For calibration radius, The residual scale is for the partition. The optimal temperature scaling factor. To preserve the shape threshold. The logic for multiplying the three factors is: This provides the basic scale for the current prediction error within the Mondrian partition. The basic scale is finely adjusted using data-driven methods to better reflect the actual error distribution; this is the temperature-calibrated partitioned scale. Based on the temperature-calibrated zoning scale, the interval multiples that meet the target coverage rate are further determined.
[0073] Using the predicted yield per mu (a Chinese unit of area, approximately 0.165 acres) as the center, a self-calibrated prediction interval is formed by extending the calibration radius both above and below it. The upper bound of the self-calibrated prediction interval is the predicted yield per mu plus the calibration radius, and the lower bound is the predicted yield per mu minus the calibration radius. For a specific inference example: Suppose the tobacco variety of a plot to be predicted is "Yunyan 87" and the plot number is "A03". The predicted yield per mu output by the edge inference model is 158.5 kg. The edge node looks up the table using the Mondrian partition key "Yunyan 87_A03", reads the partition residual scale as 8.2 kg, the optimal temperature scaling factor as 1.05, and the conformal threshold as 1.73. Then the calibration radius is... The self-calibrated prediction range is 143.61 kg to 173.39 kg. The final output of the edge node is a predicted yield of 158.5 kg per acre and a self-calibrated prediction range of 143.61 kg to 173.39 kg per acre as the predicted tobacco yield.
[0074] If an edge node encounters a situation where there is no record in the partition calibration parameter table that matches the current Mondrian partition key (e.g., a new variety not present during the training phase is introduced in the new season), the edge node degenerates the Mondrian partition key into one consisting only of tobacco variety labels and searches the partition calibration parameter table again. If no matching record is found, the maximum value of the partition residual scale, the maximum value of the optimal temperature scaling factor, and the maximum value of the conformal threshold of all records in the partition calibration parameter table are taken as the partition residual scale, optimal temperature scaling factor, and conformal threshold of the current sample, respectively. A wider self-calibration prediction interval is output in a conservative estimation manner to ensure that the coverage is not lower than the target confidence level.
[0075] The application scenarios of self-calibrated prediction intervals in actual production include: tobacco purchasing departments can formulate purchasing plans and storage capacity allocation schemes in advance based on the upper and lower boundaries of the prediction interval; planting managers can judge the uncertainty of the current plot prediction based on the width of the prediction interval, and strengthen field observation and manual verification for plots with excessively wide prediction intervals. When the chain-like online self-calibration process in step 2 updates the partition calibration parameter table as the harvest progresses, the cloud server redistributes the updated partition calibration parameter table to each edge node. The edge nodes replace the old version in their local storage with the new partition calibration parameter table, and subsequent inference automatically adopts the updated calibration parameters without requiring any modification to the edge inference model itself.
[0076] While specific embodiments of the present invention have been described above, those skilled in the art should understand that these specific embodiments are merely illustrative. Those skilled in the art can omit, substitute, and modify the details of the above methods and systems in various ways without departing from the principles and essence of the present invention. For example, combining the above method steps to perform substantially the same function and achieve substantially the same result according to substantially the same method falls within the scope of the present invention. Therefore, the scope of the present invention is defined only by the appended claims.
Claims
1. A method of tobacco yield fast modeling and online self-calibration of cloud-edge collaborative AutoML, characterized in that, Includes the following steps: Step 1: Edge-side data compression and cloud-based AutoML modeling: Multiple edge nodes collect multi-source time-series acquisition sequences, extract them through sliding window statistics and compress them through random projection to obtain a compressed feature matrix, and report it to the cloud server. The cloud server merges the data to obtain a global compressed training feature table. Through Bayesian optimization, model selection and hyperparameter optimization are performed in the compressed feature space to train a cloud-based production prediction base model. Step 2, Chain-based online self-calibration of Mondrian partition conformal prediction and temperature scaling: Each edge node generates a new season compressed feature matrix for the newly harvested plots in the same way as in Step 1 and reports it. The cloud server merges the new season global compressed feature table and divides it into calibration sample set and test sample set. Based on tobacco variety label and planting plot label, Mondrian partitioning is performed. Within each Mondrian partition, residual scaling operation, temperature scaling operation and conformal prediction operation are performed in sequence to generate a partition calibration parameter table. Step 3, Model Distillation and Edge Calibration Inference Deployment: The cloud server uses the cloud-based yield prediction base model as the teacher model to distill the edge inference model. The edge inference model and the partition calibration parameter table are then distributed to each edge node. Each edge node uses the edge inference model to obtain the predicted yield per acre value and calculates the calibration radius based on the partition calibration parameter table. The self-calibrated prediction interval is formed by extending the calibration radius above and below the predicted yield per acre value.
2. The method of claim 1, wherein, The multi-source time-series acquisition sequence is composed of meteorological sensor data, soil sensor data, and crop growth sensor data. The sliding window statistics are extracted as follows: the window length is set to one tobacco growth period observation cycle, the window sliding step size is set to half of the tobacco growth period observation cycle, and four statistics, namely mean, variance, maximum and minimum values, are calculated for each channel in each window. The statistics of all channels in the same window are concatenated end to end to form a window feature vector, and all window feature vectors are stacked row by row to form a local feature matrix.
3. The method of claim 1, wherein, Random projection compression uses a Gaussian random measurement matrix pre-stored in the edge nodes. Linear projection is performed on each row of the local feature matrix to compress the dimension of each row from the original dimension to one-third of the original dimension. After all rows are projected, they are stacked to obtain the compressed feature matrix. The cloud server utilizes the approximate distance-preserving property of Gaussian random projection satisfying the Johnson-Lindenstrauss lemma to directly perform model training in the compressed feature space using a globally compressed training feature table.
4. The method of claim 1, wherein, Bayesian optimization is performed within a predefined AutoML model search space, which includes three candidate model families: gradient boosting decision trees, temporal convolutional networks, and lightweight multilayer perceptrons. Each candidate model family has its own predefined range of hyperparameter values. Bayesian optimization adopts a Bayesian optimization method based on a tree-structured Parzen estimator. In each iteration, one set of candidate configurations is sampled from the AutoML model search space. Each set of candidate configurations includes a candidate model family identifier and a corresponding set of hyperparameter values.
5. The method of claim 4, wherein, The global compressed training feature table and the corresponding actual yield per acre are used to perform 5-fold cross-validation evaluation. The mean of the 5-fold root mean square error is used as the evaluation score of the candidate configuration. After a preset number of iterations, the candidate configuration with the smallest evaluation score is selected as the optimal configuration. The cloud yield prediction base model is then fully trained on the global compressed training feature table according to the candidate model family and hyperparameter values specified by the optimal configuration.
6. The method of claim 1, wherein, The calibration sample set contains 70% of all samples in the new season's global compressed feature table, and the test sample set contains the remaining 30%. The Mondrian partitioning is performed as follows: the concatenated string of the tobacco variety label and the planting plot label attached to each sample is used as the Mondrian partitioning key, and samples with the same Mondrian partitioning key are grouped into the same Mondrian partition. When the number of calibration samples in a Mondrian partition is less than 30, the samples in the current Mondrian partition are reassigned to the variety-level Mondrian partition using only the tobacco variety label as the Mondrian partitioning key.
7. The method of claim 1, wherein, The execution process of residual scaling estimation is as follows: input the compressed feature vector of each calibration sample in the Mondrian calibration subset into the cloud-based yield prediction base model to obtain the predicted yield per acre value, calculate the difference between the predicted yield per acre value and the actual yield per acre value as the prediction residual, calculate the sample standard deviation of the prediction residuals of all calibration samples in the Mondrian calibration subset, and determine the sample standard deviation of the prediction residuals as the partition residual scale.
8. The method of claim 1, wherein, The temperature scaling operation is performed as follows: the initial search range of the temperature scaling factor is set to 0.01 to 100; for each calibration sample in the Mondrian calibration subset, a Gaussian probability density function is constructed with 0 as the center and the product of the temperature scaling factor and the partition residual scale as the standard deviation; the prediction residual of the calibration sample is substituted into the Gaussian probability density function to obtain the probability density value and the natural logarithm is taken to obtain the log-likelihood value; the log-likelihood values of all calibration samples in the Mondrian calibration subset are summed to obtain the total log-likelihood value. The golden section search method is used to perform 50 rounds of search iterations within the initial search interval. The temperature scaling factor that maximizes the total log-likelihood value is determined as the optimal temperature scaling factor. The partition residual scale is multiplied by the optimal temperature scaling factor to obtain the temperature-calibrated partition scale.
9. The method of claim 1, wherein, The conformal prediction operation is performed as follows: Calculate the absolute value of the prediction residual for each calibration sample in the Mondrian calibration subset. Divide the absolute value of the prediction residual by the temperature-calibrated partition scale to obtain the normalized inconsistency score. Sort all normalized inconsistency scores from smallest to largest. Calculate the quantile number by multiplying the total number of normalized inconsistency scores by 1, multiplying by the 95% confidence level, and rounding up to obtain the integer. Take the normalized inconsistency score corresponding to the quantile number as the conformal threshold. For each test sample in the Mondrian test subset, multiply the temperature-calibrated partition scale by the conformal threshold to obtain the calibration radius. Extend the calibration radius above and below the predicted yield per acre to form a self-calibrated prediction interval.
10. The method of claim 1, wherein, The model distillation process is as follows: A lightweight multilayer perceptron with fewer hidden layers and fewer nodes per layer than the cloud-based yield prediction base model is constructed as a student model. The predicted yield per acre value obtained by inputting each sample from the global compressed training feature table into the cloud-based yield prediction base model is used as the soft target value. The compressed feature vector is used as the input of the student model, and the soft target value is used as the supervision signal. The mean squared error loss function is used to train the student model to obtain the edge inference model. Each edge node generates a compressed feature vector to be predicted for the newly acquired multi-source time-series acquisition sequence in the same way as in step 1. The compressed feature vector to be predicted is input into the edge inference model to obtain the predicted yield per acre value. The corresponding record is found in the zoning calibration parameter table according to the tobacco variety label and the planting plot label. The zoning residual scale is multiplied by the optimal temperature scaling factor and then multiplied by the conformal threshold to obtain the calibration radius.