Training method and estimation method of crop key yield parameter integrated estimation network based on feature alignment
By using a feature-aligned integrated estimation network for key crop yield parameters, the problems of high cost and low accuracy in crop yield estimation are solved, achieving low-cost, high-efficiency, and high-precision yield estimation, and improving feature utilization and network generalization ability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HARBIN ENG UNIV
- Filing Date
- 2026-03-19
- Publication Date
- 2026-06-23
AI Technical Summary
Existing methods for estimating crop yields are costly and have limited accuracy. Furthermore, the isolation between high-level estimation and segmentation tasks leads to insufficient utilization of features.
An integrated estimation network for key crop yield parameters based on feature alignment is adopted. By sharing a height estimation and target segmentation network with an encoder-decoder backbone architecture, feature analysis and alignment layer selection are used to jointly train the network to achieve collaborative learning of height estimation and segmentation.
It achieves low-cost, high-precision crop yield estimation, reduces data acquisition costs and processing complexity, improves the accuracy of height maps and segmentation maps, and enhances the discriminative and generalization capabilities of feature representations.
Smart Images

Figure CN122265833A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the interdisciplinary field of agricultural remote sensing and artificial intelligence, specifically relating to a training method and estimation method for a network for estimating key crop yield parameters. Background Technology
[0002] Precision agriculture is a core direction of modern agricultural development, and the rapid and accurate estimation of crop yields is of paramount importance for food security, agricultural insurance, agricultural trade, and government decision-making. Traditional yield estimation methods mainly rely on manual field surveys, statistical reports, or experience-based models. These methods are time-consuming, labor-intensive, highly subjective, and difficult to achieve wide-ranging coverage.
[0003] With the development of remote sensing technology, yield estimation using satellite or UAV remote sensing images has become a research hotspot. Existing technologies mainly fall into two categories: one is statistical models based on vegetation indices (such as NDVI), which estimate yield by establishing an empirical relationship between vegetation indices and yield. However, this method struggles to capture the three-dimensional structural information of crops, resulting in limited accuracy. The other category is methods based on deep learning, such as directly using convolutional neural networks to regress yield values, or first segmenting crop areas and then combining other features for estimation. However, these methods have significant shortcomings: direct regression methods have poor interpretability and ignore crop height, a key biomass indicator; while existing segmentation and feature-combined methods often rely on height information from active sensors such as LiDAR, which is costly, or depend on multi-view image reconstruction, leading to complex data processing.
[0004] Significant progress has been made in monocular height estimation, particularly in building height estimation based on single satellite images. The AMFHENet single-image height estimation network can effectively extract height information from a single image. However, directly applying this technique to agriculture presents challenges: crops and buildings differ greatly in texture and structure, and agricultural yield estimation requires both accurate crop region segmentation maps and high-precision height maps, along with their effective fusion. Existing methods typically treat height estimation and segmentation as two independent tasks, neglecting their inherent connection at the feature level. This leads to suboptimal utilization of feature representations, impacting the accuracy of the final yield estimation.
[0005] Therefore, there is an urgent need for a technology that can estimate crop yields at low cost, high efficiency, and high accuracy, fully utilize information from a single remote sensing image, and improve model performance through collaborative learning between tasks. Summary of the Invention
[0006] This invention aims to address the problems of high cost, limited accuracy, and insufficient utilization of features caused by the isolation of high estimation and segmentation tasks in existing crop yield estimation methods.
[0007] A training method for an integrated estimation network of key crop yield parameters based on feature alignment includes:
[0008] S1. For a single remote sensing image, train a height estimation network and a target segmentation network independently. The height estimation network is used to regress a height map from the input image. The target segmentation network is used to perform pixel-level classification of crop regions in the input image. The two networks share the same encoder-decoder backbone architecture.
[0009] S2, Feature Analysis and Alignment Layer Selection, including:
[0010] S21. Feature Extraction: Joint training is performed using validation set images. A single remote sensing image is input into the two pre-trained networks, and the output feature maps of L corresponding layers in the two networks are extracted, denoted as... and , L refers to the total number of layers in the encoder and decoder networks with the same structure.
[0011] S22. For each layer l in the two networks, perform the following processing:
[0012] (a) Feature map and (a) Perform global average pooling to generate semantic feature vectors; (b) Group the semantic feature vectors of all samples in the validation set according to the segmentation label by the "crop" category; (c) Calculate the intra-layer semantic aggregation degree of this layer for the "crop" category. , and The corresponding semantic aggregation degrees are denoted as follows: and ;
[0013] S23. Calculate the semantic similarity between the two networks at each corresponding layer l: ,in and For each of the two encoder-decoder results ;choose The layer with the largest result is taken as the alignment layer and denoted as l*.
[0014] S3. Jointly train the two networks; the total loss function is: ,in, It is a balancing hyperparameter; To accurately estimate network losses, The loss of the target segmentation network, The feature alignment loss is calculated based on the semantic feature vector of the alignment layer of the height estimation network corresponding to the alignment layer l* and the semantic feature vector of the alignment layer of the target segmentation network.
[0015] Furthermore, the semantic feature vector generated in step (a) , where i represents the i-th sample in the batch; Indicates that the i-th crop sample is in Feature maps generated by layers or H and W represent the height and width of the corresponding feature map.
[0016] Further, in step (c), the intra-layer semantic aggregation degree for the "crops" category is calculated. as follows:
[0017]
[0018] in, This is the number of samples in the "crops" category. This indicates that the i-th crop sample of the c-th crop category is in The semantic feature vector of the layer It is the center vector of the crop category.
[0019] Furthermore, the feature alignment loss as follows:
[0020]
[0021] Where B is the batch size. To highly estimate the semantic feature vectors of the network alignment layer, Semantic feature vectors of the alignment layer of the target segmentation network.
[0022] Furthermore, the feature alignment loss Kullback-Leibler divergence loss is used.
[0023] Furthermore, the network loss is highly estimated. Scale-invariant logarithmic loss is used:
[0024]
[0025] in, , It is a predicted height value. λ is the actual height value, N is the total number of valid pixels, and λ is the weight parameter.
[0026] Furthermore, the loss of the target segmentation network Binary cross-entropy loss or Dice loss.
[0027] Furthermore, the height estimation network and the target segmentation network adopt an encoder-decoder based on the AMFHENet network. The encoder adopts the network structure corresponding to the DINO v2 encoder, and the decoder includes N AMHF multi-scale feature fusion modules.
[0028] Furthermore, the decoder includes four AMHF multi-scale feature fusion modules.
[0029] A feature-aligned integrated estimation method for key crop yield parameters is proposed. This method involves training a height estimation network and a target segmentation network based on a training method for a feature-aligned integrated estimation network for key crop yield parameters. The image to be tested is then input into both networks to obtain a height map and a segmentation map. The segmentation map is used to mask the height map, extracting the height values of the crop region. Finally, the pixel area and average height features of the crop region are combined and input into a yield model to estimate the final yield.
[0030] The present invention has the following effects:
[0031] Low cost and high efficiency: Only a single optical remote sensing image is required, eliminating the need for expensive or complex data sources such as lidar and multi-view images, which significantly reduces data acquisition costs and processing complexity.
[0032] High-precision estimation: By using feature alignment technology, the powerful semantic discrimination ability of pixel-level target segmentation network is transferred to the height estimation task, making the obtained height map more accurate in crop areas, thereby improving the accuracy of yield estimation based on height and area.
[0033] Task collaboration and feature enhancement: Breaking away from the traditional model of isolated tasks in high-level estimation and segmentation, this approach achieves synergistic effects between tasks through joint fine-tuning. The feature alignment mechanism enables the network to learn more discriminative shared feature representations.
[0034] Strong generalization and flexibility: The method is insensitive to the specific network architecture (such as the number of layers N in AMFHENet), can adaptively select the optimal alignment layer, adapt to different crop types, growth stages and imaging conditions, and has good generalization ability. Attached Figure Description
[0035] Figure 1 This is a flowchart of the overall process for an integrated estimation method of key crop yield parameters based on feature alignment.
[0036] Figure 2 This is a schematic diagram of the AMFHENet network structure, which can be used for height estimation and target segmentation.
[0037] Figure 3 This is a schematic diagram illustrating an example of crop yield estimation using a single graph. Detailed Implementation
[0038] The present invention discloses an integrated estimation method for key crop yield parameters based on feature alignment. The core of this method lies in feeding a single acquired remote sensing image into a pre-trained height estimation network and a pixel-level target segmentation network for parallel processing, obtaining initial crop height estimation and segmentation maps. Then, it innovatively introduces feature visualization and alignment techniques to analyze the semantic similarity of features in the intermediate layers of the two networks, selects the optimal alignment layer, and jointly fine-tunes the two networks using a feature alignment loss function. This guides the learning process of the height estimation network with the rich semantic features of the segmentation network. Finally, the optimized height map and segmentation map are fused to accurately estimate crop yield. The following detailed description, in conjunction with specific implementation methods, further illustrates this approach.
[0039] Specific implementation method one: Combining Figure 1 and Figure 2 This implementation method is described below.
[0040] This embodiment presents a training and estimation method for an integrated estimation network of key crop yield parameters based on feature alignment, comprising the following steps:
[0041] S1. Network Pre-training Stage: The height estimation network and the pixel-level object segmentation network are trained independently. The two networks share the same encoder-decoder backbone architecture but have different task output layers. This implementation uses an encoder-decoder based on the AMFHENet network. The encoder adopts the network structure corresponding to the DINO v2 encoder, and the decoder includes N AMHF multi-scale feature fusion modules.
[0042] For a single remote sensing image, two networks are trained separately using the training set.
[0043] A height estimation network is used to regress a height map from an input image, and its loss function is... Preferred scale-invariant logarithmic loss:
[0044]
[0045] in, , It is a predicted height value. λ is the actual height value, N is the total number of valid pixels, and λ is the weight parameter.
[0046] Pixel-level object segmentation networks are used to perform pixel-level classification of crop regions in input images, and their loss function is... The preferred loss method is binary cross-entropy loss or Dice loss.
[0047] S2, Feature Analysis and Alignment Layer Selection Stage: This stage aims to find the optimal layer for feature alignment between the two networks.
[0048] S21. Feature Extraction: Joint training is performed using validation set images. A single remote sensing image is input into the two pre-trained networks, and the output feature maps of L corresponding layers in the two networks are extracted, denoted as... and ( L here refers to the number of encoder and decoder layers in two networks with the same structure.
[0049] S22. Calculate the semantic aggregation degree within the layer: For each candidate layer l, where a candidate layer is selected from two networks as the alignment layer, perform the following operations:
[0050] (a) Perform global average pooling on the feature map to generate semantic feature vectors:
[0051]
[0052] Where i represents the i-th sample in the batch; Indicates that the i-th crop sample is in Feature maps generated by layers or H and W represent the height and width of the corresponding feature map. It should be noted that, for the sake of convenience in representing the samples and expressing subsequent formulas, we will use H and W here. or use To indicate;
[0053] (b) Based on the segmentation labels, group the semantic feature vectors of all samples in the validation set into the category of "crops".
[0054] (c) Calculate the intra-layer semantic aggregation degree of this layer for the "crops" category:
[0055]
[0056] in, This is the number of samples in the "crops" category. This indicates that the i-th crop sample of the c-th crop category is in The semantic feature vector of the layer It is the center vector of the crop category.
[0057] S23. Selecting Alignment Layers: Calculate the semantic similarity between the two networks at each corresponding layer l:
[0058]
[0059] in and For each of the two encoder-decoder results .
[0060] choose The layer with the largest result is taken as the alignment layer, denoted as . .
[0061] S3. Joint Network Fine-Tuning Stage: Feature alignment loss is introduced into the total loss function to jointly train the two networks. The total loss function is:
[0062]
[0063] in, It is a balancing hyperparameter. To accurately estimate network losses, The loss of the pixel-level target segmentation network is calculated in the same way as S1.
[0064] Feature alignment loss Preferred cosine distance loss:
[0065]
[0066] Where B is the batch size. To highly estimate the semantic feature vectors of the network alignment layer, Semantic feature vectors of the alignment layer of the target segmentation network.
[0067] This loss guides the eigenvectors of the highly estimated network to converge towards the segmentation network.
[0068] As an alternative Kullback-Leibler divergence loss can also be used.
[0069] S4. Yield Estimation Stage: The image to be tested is input into the fine-tuned network to obtain an optimized height map and segmentation map. The height map is masked using the segmentation map to extract the height values of the crop region. Combining features such as the pixel area and average height of the crop region, the final yield is estimated by inputting it into a predefined yield model (such as a linear regression or machine learning model).
[0070] Therefore, the present invention has the following characteristics:
[0071] Low cost and high efficiency: Only a single optical remote sensing image is required, eliminating the need for expensive or complex data sources such as lidar and multi-view images, which significantly reduces data acquisition costs and processing complexity.
[0072] High-precision estimation: By using feature alignment technology, the powerful semantic discrimination ability of pixel-level target segmentation network is transferred to the height estimation task, making the obtained height map more accurate in crop areas, thereby improving the accuracy of yield estimation based on height and area.
[0073] Task collaboration and feature enhancement: Breaking away from the traditional model of isolated tasks in high-level estimation and segmentation, this approach achieves synergistic effects between tasks through joint fine-tuning. The feature alignment mechanism enables the network to learn more discriminative shared feature representations.
[0074] Strong generalization and flexibility: The method is insensitive to the specific network architecture (such as the number of layers N in AMFHENet), can adaptively select the optimal alignment layer, adapt to different crop types, growth stages and imaging conditions, and has good generalization ability.
[0075] Example: The integrated estimation method for key crop yield parameters based on feature alignment described in this example has the following specific steps:
[0076] Step 1: Data Preparation and Parameter Setting
[0077] Collect satellite or UAV remote sensing images covering the crop growth cycle and prepare corresponding true height maps (obtainable via LiDAR or photogrammetry) and pixel-level crop segmentation label maps. Set network training parameters, such as image cropping size (image_size), batch size (B), maximum height (max_height), and balancing hyperparameters α and λ.
[0078] Step 2, Network Pre-training (S1):
[0079] Build and train two independent networks:
[0080] Height estimation network: Its backbone architecture adopts AMFHENet. The encoder uses DINO v2 and outputs 4 layers of features (features 1 to 4). The decoder consists of 4 AMHF multi-scale feature fusion modules. The network output is a normalized height map, which is multiplied by max_height to obtain the final height map. SiLogLoss is used as the loss function. Conduct training.
[0081] Pixel-level object segmentation network: Shares the same AMFHENet backbone architecture as the height estimation network. Only at the end of the network, the number of output channels is adjusted to 1 (for binary classification), and a Sigmoid activation function is used to output the probability that each pixel is a crop. The Dice loss function is used as... Training was then performed. Both networks were trained using the Adam optimizer until performance converged on the validation set.
[0082] Step 3: Feature Analysis and Alignment Layer Selection (S2):
[0083] Feature Extraction (S21): Randomly sample a batch of images from the validation set and input them into two pre-trained networks. Set hooks in the code to capture the 4-layer features (F_height^1 to F_height^4 and F_seg^1 to F_seg^4) output from the encoders of the two networks, as well as the fused features (F_height^5 to F_height^8 and F_seg^5 to F_seg^8) output from the four AMHF modules of the decoder. A total of L=8 pairs of feature maps are generated.
[0084] Calculate semantic aggregation degree and select alignment layer (S22, S23):
[0085] For each pair of feature maps (e.g., F_height^3 and F_seg^3), their intra-layer semantic aggregation degrees IntraSim_height^3 and IntraSim_seg^3 for the "crops" category are calculated respectively. The calculation process is as described in the invention description: first, GAP is performed on the feature map of each sample to obtain vectors, and then the mean cosine similarity between all vectors of that class and the class center vector is calculated.
[0086] Calculate the semantic similarity of this layer: Similarity^3 = IntraSim_height^3 * IntraSim_seg^3.
[0087] Iterate through all 8 pairs of layers and calculate their respective Similarity^l. Assuming that the calculation finds that the Similarity^3 of the 3rd layer (the deep features of the encoder) is the largest, then select l* = 3 as the core alignment layer.
[0088] Step 4: Network Joint Fine-tuning (S3):
[0089] Construct a joint training framework: connect the backbones (encoder and decoder) of two networks, but retain their respective task-specific output layers.
[0090] Define the total loss function: .in, The cosine distance loss is used to calculate the inconsistency between the feature vectors output by the two networks at the 3rd layer (l*=3).
[0091] Training: Fine-tune the entire joint network end-to-end using a small learning rate. During training, the total loss drives the network to align the feature representations of the high-level network with the segmentation network while ensuring the performance of each task.
[0092] Step 5, Production Estimation (S4):
[0093] Inference: The remote sensing image to be estimated is input into the jointly fine-tuned network, and the optimized height map and segmentation probability map are obtained simultaneously.
[0094] Post-processing: Threshold the segmentation probability map (e.g., to 0.5) to obtain a binarized crop mask. Apply this mask to the height map, retaining only the height values of the crop regions.
[0095] Feature extraction and yield calculation: Calculate the total area (Area) and average height (AvgHeight) of pixels within the mask. Use [Area, AvgHeight] as features and input them into a pre-trained simple regression model (e.g., a linear regression model trained using historical data: Yield = w1 * Area + w2 * AvgHeight + b) to calculate the estimated crop yield for the image. Single-image crop yield estimation is as follows: Figure 3 As shown.
[0096] The above examples of the present invention are merely illustrative of the computational model and process of the present invention, and are not intended to limit the implementation of the present invention. Those skilled in the art will recognize that other variations or modifications can be made based on the above description. It is impossible to exhaustively list all possible implementations here. Any obvious variations or modifications derived from the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A training method for an integrated estimation network of key crop yield parameters based on feature alignment, characterized in that, include: S1. For a single remote sensing image, train the height estimation network and the target segmentation network independently, respectively; The height estimation network is used to regress a height map from the input image; the object segmentation network is used to perform pixel-level classification of crop regions in the input image; the two networks share the same encoder-decoder backbone architecture. S2, Feature Analysis and Alignment Layer Selection, including: S21. Feature Extraction: Joint training is performed using validation set images. A single remote sensing image is input into the two pre-trained networks, and the output feature maps of L corresponding layers in the two networks are extracted, denoted as... and , L refers to the total number of layers in the encoder and decoder networks with the same structure. S22. For each layer l in the two networks, perform the following processing: (a) Feature map and (a) Perform global average pooling to generate semantic feature vectors; (b) Group the semantic feature vectors of all samples in the validation set according to the segmentation label and the "crop" category; (c) Calculate the intra-layer semantic aggregation degree of this layer for the "crop" category. , and The corresponding semantic aggregation degrees are denoted as follows: and ; S23. Calculate the semantic similarity between the two networks at each corresponding layer l: ,in and For each of the two encoder-decoder results ;choose The layer with the largest result is taken as the alignment layer and denoted as l*. S3. Jointly train the two networks; the total loss function is: ,in, It is a balancing hyperparameter; To accurately estimate network losses, The loss of the target segmentation network, The feature alignment loss is calculated based on the semantic feature vector of the alignment layer of the height estimation network corresponding to the alignment layer l* and the semantic feature vector of the alignment layer of the target segmentation network.
2. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 1, characterized in that, The semantic feature vector generated in step (a) , where i represents the i-th sample in the batch; Indicates that the i-th crop sample is in Feature maps generated by layers or H and W represent the height and width of the corresponding feature map.
3. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 2, characterized in that, In step (c), the intra-layer semantic aggregation degree for the "crops" category is calculated. as follows: in, This is the number of samples in the "crops" category. This indicates that the i-th crop sample of the c-th crop category is in The semantic feature vector of the layer It is the center vector of the crop category.
4. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 3, characterized in that, The feature alignment loss as follows: Where B is the batch size. To highly estimate the semantic feature vectors of the network alignment layer, Semantic feature vectors of the alignment layer of the target segmentation network.
5. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 3, characterized in that, The feature alignment loss Kullback-Leibler divergence loss is used.
6. A training method for an integrated estimation network of key crop yield parameters based on feature alignment, as described in any one of claims 4 to 5, characterized in that, Highly estimating network losses Scale-invariant logarithmic loss is used: in, , It is a predicted height value. λ is the actual height value, N is the total number of valid pixels, and λ is the weight parameter.
7. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 6, characterized in that, Loss of target segmentation network Binary cross-entropy loss or Dice loss.
8. A training method for an integrated estimation network of key crop yield parameters based on feature alignment, as described in any one of claims 4 to 5, characterized in that... The height estimation network and the target segmentation network adopt an encoder-decoder based on the AMFHENet network. The encoder adopts the network structure corresponding to the DINO v2 encoder, and the decoder includes N AMHF multi-scale feature fusion modules.
9. The training method for an integrated estimation network of key crop yield parameters based on feature alignment according to claim 8, characterized in that, The decoder includes four AMHF multi-scale feature fusion modules.
10. An integrated estimation method for key crop yield parameters based on feature alignment, characterized in that, The training method of the integrated estimation network for key crop yield parameters based on feature alignment according to any one of claims 1 to 9 obtains a height estimation network and a target segmentation network; then the image to be tested is input into the two networks to obtain a height map and a segmentation map; the height map is masked using the segmentation map to extract the height value of the crop region; the pixel area and average height features of the crop region are combined and input into the yield model to estimate the final yield.