Crop type extraction method, apparatus, device, and medium

CN122244694APending Publication Date: 2026-06-19AEROSPACE INFORMATION RES INST CAS +1

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: AEROSPACE INFORMATION RES INST CAS
Filing Date: 2026-05-22
Publication Date: 2026-06-19

Application Information

Patent Timeline

22 May 2026

Application

19 Jun 2026

Publication

CN122244694A

IPC: G06V20/10; G06V10/62; G06V10/77; G06V10/774; G06V10/80; G06V10/82; G06N3/0455; G06N3/0464; G06N3/084; G06N3/0985

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122244694A_ABST

Patent Text Reader

Abstract

This application relates to the field of artificial intelligence technology, and provides a method, apparatus, device, and medium for crop type extraction. The method includes: constructing a spatiotemporal data cube based on multi-temporal, multi-band remote sensing images; inputting the spatiotemporal data cube into a crop type extraction model, and outputting crop type extraction results; wherein the crop type extraction model is trained by: performing temporal perturbation enhancement on the spatiotemporal cube samples and performing multi-stage feature extraction, while simultaneously performing adaptive weight calibration on the features from three dimensions in each stage, and performing stage-by-stage decoding processing on the encoded features; and training based on the decoded and predicted results and pre-labeled results. The crop type extraction method provided in this application, through temporal perturbation enhancement, enables the model to learn more robust time-series patterns, and simultaneously, through a three-dimensional spatial attention mechanism, suppresses background noise and focuses on key bands and phenological windows, significantly improving the model's classification accuracy and robustness.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus, equipment and medium for crop type extraction. Background Technology

[0002] Remote sensing technology has advantages such as wide coverage, short revisit cycles, and low cost, making it a core technology for rapid surveys of crop types over large areas. With the development of deep learning technology, schemes that construct temporal features based on multi-temporal remote sensing images and achieve automatic crop identification through classification models have been widely used, but there are still significant technical shortcomings in practical applications.

[0003] The core bottlenecks of existing solutions are concentrated on two levels: First, existing models mainly rely on convolutional layers to implicitly learn fixed time series patterns, without taking into account the phenological drift that may exist for the same crop in the same region. The mapping relationship between time and spectrum learned by the model will become invalid due to time misalignment, resulting in a significant drop in model performance. Second, existing feature extraction mechanisms usually treat all spectral bands and all observation phases equally. A large number of invalid bands, non-critical growth period features, and background noise will interfere with the recognition process, limiting the classification accuracy and robustness of the model in complex and variable scenarios. Summary of the Invention

[0004] This application provides a method, apparatus, device, and medium for crop type extraction, which addresses the problem that existing technologies limit the classification accuracy and robustness of models in complex and variable scenarios.

[0005] Firstly, this application provides a method for crop type extraction, including: A spatiotemporal data cube is constructed based on multi-temporal, multi-band remote sensing images. The spatiotemporal data cube is input into the crop type extraction model to obtain the crop type extraction result output by the crop type extraction model; The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

[0006] In one embodiment, the step of performing temporal perturbation enhancement on the spatiotemporal cube sample to obtain temporally enhanced samples includes: The time phases of the spatiotemporal cube sample are perturbed to obtain the corresponding perturbed time phases; Each of the aforementioned disturbance phases is time-coded to obtain the corresponding time characteristics; Each of the aforementioned time features is embedded into the spatiotemporal cube sample to obtain a time feature-enhanced sample.

[0007] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phases includes: The corresponding perturbation phases are obtained by summing each phase of the spatiotemporal cube sample with the random offset.

[0008] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phase further includes: Several time phases to be disturbed are randomly selected from each time phase of the spatiotemporal cube sample. The disturbance time phases are summed with the random jitter amount to obtain the corresponding disturbance time phases.

[0009] In one embodiment, the adaptive weight calibration of features from the spectral, spatial, and temporal dimensions at each stage includes, for each stage: Channel attention is calculated along the spatial and temporal dimensions for the input features of the aforementioned stage to obtain three-dimensional channel attention; The input features of the aforementioned three-dimensional channel attention are weighted to obtain a channel-weighted result. Spatiotemporal attention is calculated along the spectral dimension using the channel weighting results to obtain three-dimensional spatiotemporal attention; The three-dimensional spatiotemporal attention is used to weight the channel weighting results to obtain a spatiotemporal weighted result.

[0010] In one embodiment, performing channel attention calculation on the input features of the stage along the spatial and temporal dimensions to obtain three-dimensional channel attention includes: The input features of the aforementioned stage are subjected to global max pooling in both the time and spatial dimensions to obtain the first channel descriptor; the input features of the aforementioned stage are subjected to global average pooling in both the time and spatial dimensions to obtain the second channel descriptor. The first channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the first channel feature; the second channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the second channel feature. The first channel features and the second channel features are fused together and then processed by an activation function to generate a three-dimensional channel attention.

[0011] In one embodiment, performing spatiotemporal attention calculation along the spectral dimension on the channel weighting result to obtain three-dimensional spatiotemporal attention includes: The channel-weighted result is subjected to global max pooling along the channel dimension to obtain the first spatiotemporal descriptor; the channel-weighted result is subjected to global average pooling along the channel dimension to obtain the second spatiotemporal descriptor. The first spatiotemporal descriptor and the second spatiotemporal descriptor are fused to obtain a spatiotemporal fused descriptor; Spatiotemporal features are extracted from the spatiotemporal fusion descriptor and a three-dimensional spatiotemporal attention is generated through an activation function.

[0012] Secondly, this application also provides a crop type extraction device, comprising: The building module is used to construct spatiotemporal data cubes based on multi-temporal, multi-band remote sensing images; The crop type extraction module is used to input the spatiotemporal data cube into the crop type extraction model and obtain the crop type extraction result output by the crop type extraction model. The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

[0013] Thirdly, this application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the crop type extraction methods described above.

[0014] Fourthly, this application also provides a non-transitory computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of any of the crop type extraction methods described above.

[0015] The crop type extraction method, apparatus, equipment, and medium provided in this application simulate phenological drift by introducing temporal perturbation enhancement processing during the model training stage, enabling the model to learn more robust time series patterns and effectively alleviating the performance degradation problem caused by time misalignment. At the same time, it also introduces three-dimensional spatial attention mechanisms such as spectral, spatial, and temporal attention in multi-stage feature extraction, which can suppress background noise and focus on key bands and key phenological windows, thereby significantly improving the classification accuracy and robustness of the model in complex and variable scenarios. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is one of the flowcharts illustrating the crop type extraction method provided in this application.

[0018] Figure 2 This is the second flowchart of the crop type extraction method provided in this application.

[0019] Figure 3 This is a schematic diagram of the network structure of the crop type extraction model provided in this application.

[0020] Figure 4 This is a schematic diagram of the structure of the convolutional attention module provided in this application.

[0021] Figure 5 This is a schematic diagram of the channel attention module provided in this application.

[0022] Figure 6 This is a schematic diagram of the spatiotemporal attention module provided in this application.

[0023] Figure 7 This is a box-shaped diagram comparing the performance evaluation indicators of the various models provided in this application.

[0024] Figure 8 This is a diagram showing the comparison of the classification effects of the various models provided in this application.

[0025] Figure 9 This is a schematic diagram of the crop type extraction device provided in this application.

[0026] Figure 10 This is a schematic diagram of the structure of the electronic device provided in this application. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0028] The terms "first," "second," etc., used in this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein.

[0029] The following is combined Figures 1-10 This application describes the crop type extraction method, apparatus, equipment, and medium provided.

[0030] The crop type extraction method provided in this application embodiment can be implemented based on a crop type extraction device. Therefore, this application embodiment uses a crop type extraction device as the execution subject to describe the crop type extraction method.

[0031] Combination Figure 1 and Figure 2 , Figure 1 This is one of the flowcharts illustrating the crop type extraction method provided in this application. Figure 2 This is the second flowchart of the crop type extraction method provided in this application.

[0032] like Figure 1 As shown, the crop type extraction method includes the following steps: Step 101: Construct a spatiotemporal data cube based on multi-temporal, multi-band remote sensing images; Step 102: Input the spatiotemporal data cube into the crop type extraction model to obtain the crop type extraction result output by the crop type extraction model; The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

[0033] Specifically, compared to traditional manual field surveys and mapping of crop types, remote sensing imagery can quickly cover a large area of planting without the need for door-to-door visits and surveys. This not only significantly reduces the manpower and material resources required for field surveys but also shortens the information collection cycle, making it more efficient in scenarios such as agricultural statistics and yield forecasting.

[0034] Multi-temporal, multi-band remote sensing imagery can capture the unique phenological changes of different crops throughout their entire growth cycle. For example, spring wheat is sown in spring and harvested in summer, while winter wheat is sown in autumn and harvested the following summer. Different crops have different growth cycle nodes, and their spectral characteristics will show differentiated changes in remote sensing imagery at different temporal phases as the crops grow. At the same time, multi-band information can also capture differences in vegetation structure, leaf chlorophyll content, and surface moisture content of different crops, helping to distinguish crop types with similar spectral characteristics but different phenological patterns. This solves the problem of crop mixing that easily occurs in single-temporal remote sensing imagery, ultimately resulting in more accurate crop type extraction results.

[0035] Multi-temporal, multi-band remote sensing images can completely record the spectral variation patterns of different crops throughout their entire growth cycle. By organizing these time-series multispectral information into a unified-dimensional spatiotemporal data cube, the spatial location information of the images, the spectral information of different bands, and the phenological characteristics that change over time can be preserved simultaneously. This provides a complete and continuous data foundation for subsequent models to extract crop types, helping models to accurately capture the unique growth patterns of different crops and fundamentally improve the accuracy of classification and extraction.

[0036] In the actual reasoning process, multi-temporal, multi-band remote sensing images covering the study area are acquired. These images are selected from T time phases (e.g., T=12) covering the complete crop growth cycle, with each phase containing C key bands (e.g., C=10). The acquired multi-temporal, multi-band remote sensing images can be preprocessed, such as through spatial registration and cloud / shadow masking, and invalid values can be filled with 0 to ensure accurate feature extraction in the subsequent process.

[0037] After preprocessing, multi-band spectral reflectance data from the same spatial location but different temporal phases are stacked in a dimensional order of "spatial row-spatial column-band-temporal phase" to obtain a tensor of dimension H×W×C×T, where H represents the number of vertical spatial rows, W represents the number of horizontal spatial columns, C represents the number of bands, and T represents the number of temporal phases. Then, the spectral reflectance values of the four-dimensional spatiotemporal tensor are normalized, compressing them to the range of 0 to 1 to eliminate the numerical fluctuations caused by dimensional differences between different bands. Through data stacking and normalization, a spatiotemporal data cube that meets the model input requirements is obtained.

[0038] Each spatial unit in the spatiotemporal data cube corresponds to a pixel on the remote sensing image. Each pixel contains complete spectral information of all bands at that location at all observation times. It can simultaneously carry the spatial distribution characteristics of crop growth, the spectral reflectance characteristics of different bands, and the phenological characteristics that change with the growing season.

[0039] It's important to note that the size of the spatiotemporal data cube is typically influenced by the spatial extent of the study region. Larger study regions generally result in larger values for H and W. In practical applications, to optimize the computational efficiency of subsequent deep learning models, large-scale spatiotemporal data cubes are often divided into multiple uniformly sized (e.g., 512 pixels) tiles using a sliding window strategy, with each tile having a fixed spatial window size. This process can be selectively performed based on the size of the spatiotemporal data cube.

[0040] A spatiotemporal data cube of the appropriate model input size is input into a pre-trained crop type extraction model to extract deep features that combine spatial correlation, spectral differences, and temporal phenological patterns. Crop type prediction is performed based on these deep features, and the final output covers the crop types within the study area.

[0041] Figure 3 This is a schematic diagram of the network structure of the crop type extraction model provided in this application. The crop type extraction model is an end-to-end spatiotemporal deep learning network model (TABS-Net) that adopts an encoder-decoder architecture.

[0042] During the model training phase, firstly, multi-temporal, multi-band remote sensing image samples (such as Sentinel-2 images) covering the target area and corresponding crop type ground truth labels are acquired.

[0043] Preprocessing is performed on multi-band remote sensing image samples from each temporal phase, including spatial registration and cloud / shadow masking, with invalid values filled with 0. The multi-band remote sensing image samples from each temporal phase cover T temporal phases covering the complete crop growth cycle, with each phase containing C keybands. After preprocessing, the multi-band remote sensing image samples from each temporal phase are stacked into a tensor of dimension H×W×C×T, and the spectral reflectance data is normalized to obtain spatiotemporal cube samples. If necessary, a sliding window strategy is used to crop the large-size spatiotemporal cube samples and the corresponding crop type ground truth label maps into fixed-size tiles. The crop type ground truth label maps are binarized (if there is only one crop type) or multi-classified encoded (if there are multiple crop types), forming training sample pairs with the spatiotemporal cube samples.

[0044] Phenological drift refers to the phenomenon where the calendar dates corresponding to the same growth stages (e.g., heading and maturity) of the same crop in the same region become systematically misaligned due to interannual differences in climatic factors such as temperature and precipitation. When the calendar dates corresponding to the same crop's growth cycle (e.g., heading and maturity) in the same region are systematically shifted due to the influence of sowing date, temperature, and precipitation, the "time-spectrum" mapping learned by the model will become invalid due to the time misalignment, resulting in a significant decrease in model performance.

[0045] Existing technologies implicitly assume strict timeline alignment, lack adaptability to phenological drift, and fail to introduce time augmentation strategies and explicit time encoding to correct for nonlinear growth cycle shifts caused by climate differences, resulting in a significant decrease in accuracy. To address this issue, this application innovatively introduces a time perturbation augmentation layer to augment spatiotemporal cube samples by randomly perturbing their timelines to generate new enhanced temporal feature samples. This simulates phenological drift that may occur in real-world scenarios, allowing the model to learn the feature change patterns under phenological drift in advance and enhancing its adaptability to phenological drift. This augmentation method does not require additional labeled data; it expands the diversity of training samples simply by transforming the time dimension of existing samples, significantly reducing the model's dependence on large-scale, accurately labeled samples, while effectively improving the model's generalization ability across different interannual datasets.

[0046] After obtaining temporally enhanced samples, an encoder with multiple downsampling stages is used to complete feature extraction. This process can progressively extract high-dimensional semantic features and compress spatial resolution. The features extracted in the last stage are used as the output encoded features. Figure 3 Taking the decoder as an example in the model architecture, which includes four stages, each stage contains a composite feature extraction module. Figure 3(represented by the dark blue cuboid on the left), convolutional attention module ( Figure 3 (represented as an orange cuboid), regularization processing module ( Figure 3 (represented by a yellow cuboid) and time compression module ( Figure 3 (represented as a green cuboid). Each stage will sequentially pass through the composite feature extraction module, convolutional attention module, regularization module, and time compression module for corresponding processing, and finally output the features extracted in the current stage.

[0047] The composite feature extraction module extracts spatiotemporal features through 3D convolutional layers (3D Conv), and accelerates convergence and introduces nonlinearity by combining batch normalization (BN) and activation functions (such as ReLU). The Convolutional Block Attention Module (CBAM Model) calculates channel attention and spatiotemporal attention based on the input features, achieving adaptive weight calibration of features from spectral, spatial, and temporal dimensions. This allows the model to learn to focus on the most important feature channels and spatiotemporal features, suppressing irrelevant background noise and preventing the model from being biased by a large number of irrelevant features during training, thus improving the targeting of feature extraction. The regularization module (DropBlock3D) employs a regularization method specifically designed for 3D convolution, preventing overfitting by randomly discarding continuous 3D block regions. The time compression module gradually compresses or fuses temporal information as the network deepens, thereby reducing the size of the temporal dimension while retaining key feature information from different temporal stages, reducing subsequent computational load. After the above processing, the features obtained in each stage still need to undergo downsampling (Max-Pooling). Through the max-pooling operation, the spatial resolution of the feature map is halved, but the number of channels in the feature map ( Figure 3 As the numbers (such as 32, 64, 128) gradually increase, increasingly abstract high-level features are extracted.

[0048] When the stage with the most abstract features and the lowest spatial resolution is reached, a bottleneck layer is introduced. Figure 3 The bottleneck layer is represented as the bottom cuboid of the U-shaped architecture. It adopts a parallel structure of MaxPool (maximum pooling) + AvgPool (average pooling). The features are processed by max pooling and average pooling respectively and then fused. This helps to capture the most salient features (max pooling) and global background features (average pooling) at the same time, providing rich global information for subsequent feature reconstruction.

[0049] After obtaining the encoded features, a decoder comprising multiple upsampling stages is used to reconstruct the features, progressively restoring the abstract features to their original spatial resolution. The features reconstructed in the final stage are then used as the output decoded features. Within each stage, skip connection modules ( Figure 3 Represented as a gray cuboid, the shallow detail features of the corresponding stage in the encoder are concatenated with the deep semantic features obtained by upsampling in the decoder on the channel (Concat). This effectively solves the problem of spatial detail loss during upsampling, ensuring that the generated crop plot boundaries are clear and complete. After skip connection fusion, a composite feature reconstruction module ( Figure 3 (Represented by the light blue cuboid on the right), it undergoes 3D convolutional layers, batch normalization (BN), and activation function processing to fuse information from different depths, and gradually reduces the number of channels (e.g., ...). Figure 3 (160→32). Then, the features output by the composite feature reconstruction module are upsampled. The spatial size of the features is enlarged by 3D transposed convolution (i.e., deconvolution). The upsampled features will be used as the input for the next stage of feature reconstruction.

[0050] After obtaining the decoded features, a convolutional layer maps the decoded features to the crop type probability distribution of the corresponding pixel, and the number of feature channels maps to the number of categories, thus obtaining the crop type prediction result for each spatial pixel. The output is a 2D classification map with the same spatial resolution as the input, where each pixel is assigned a category label (which can be represented by different colors), thereby completing the final semantic classification task.

[0051] Furthermore, model training is performed based on the crop type prediction results and the ground truth crop type labels. A loss function is used to calculate the model's prediction loss. The loss function compares the predicted category of each pixel with the true category, and the average of the losses of all pixels is accumulated to obtain the overall loss value of the current batch of training samples. Subsequently, the weight parameters of each convolutional kernel in the network are updated through backpropagation and stochastic gradient descent to continuously reduce the deviation between the prediction results and the true labels. After iterative training, model training stops when the model converges or reaches the preset training termination condition, finally yielding a usable crop type extraction model.

[0052] To address the learning problems of imbalanced positive and negative samples and difficult-to-distinguish samples (such as small plots and boundaries) in remote sensing imagery, a weighted mixture loss function is used during model training: in, and Indicates weight, The focus is on optimizing the overall overlap of the plots to ensure continuity within the plots; The focus is on identifying difficult-to-separate samples to improve boundary segmentation accuracy. It is the total loss function.

[0053] The crop type extraction model obtained through the above training process can accurately extract crop types within the study area. If the spatiotemporal data cube is divided into multiple fixed-size tiles before being input into the model for prediction, the model can make predictions for each tile individually to obtain the crop type extraction results for each tile. Only by smoothly stitching together the crop type extraction results of these tiles can the crop type extraction results for the entire study area be obtained.

[0054] Two blank images of the same size as the original image are created: a probability accumulation image (to store the accumulated predicted probabilities of each patch) and a count image (to record the number of times each pixel is covered by a patch). Then, the predicted probability output of each input patch is read piece by piece, and the predicted probability for each pixel is added to the corresponding position in the probability accumulation image. Simultaneously, the count is incremented by 1 at the corresponding pixel position in the count image. After all patches have been predicted, the value at each position in the probability accumulation image is divided by the value at the corresponding position in the count image to obtain the final average predicted probability for each pixel. The category with the highest average predicted probability is then selected as the final crop type for that pixel. This effectively avoids obvious segmentation gaps at patch stitching points, ensuring the spatial continuity of the extraction results.

[0055] The crop type extraction method provided in this application simulates phenological drift by introducing temporal perturbation enhancement processing during the model training stage, enabling the model to learn more robust time series patterns and effectively alleviating the performance degradation problem caused by time misalignment. At the same time, it also introduces three-dimensional spatial attention mechanisms such as spectral, spatial and temporal in multi-stage feature extraction, which can suppress background noise and focus on key bands and key phenological windows, thereby significantly improving the classification accuracy and robustness of the model in complex and variable scenarios.

[0056] In one embodiment, the step of performing temporal perturbation enhancement on the spatiotemporal cube sample to obtain temporally enhanced samples includes: The time phases of the spatiotemporal cube sample are perturbed to obtain the corresponding perturbed time phases; Each of the aforementioned disturbance phases is time-coded to obtain the corresponding time characteristics; Each of the aforementioned time features is embedded into the spatiotemporal cube sample to obtain a time feature-enhanced sample.

[0057] Specifically, to address the issue of temporal misalignment in crop growth cycles, explicit time-unknown information is introduced. The Day of Year (DOY) is the sequential value of a given date in satellite imagery within a year, with a possible value range from 1 to 365 / 366. This is the most intuitive feature for describing phenological information.

[0058] To further enhance the model's generalization ability to irregular time sampling, a Temporal Jitter Augmentation Layer (TJA Layer) is used during the training phase to augment the sampling of each time phase. Adding small-range random perturbations to the values prevents the model from overfitting to phenological features on fixed dates and improves the model's robustness to time variations.

[0059] Then, the perturbation phase is processed by the Day of Year Encoder (DOY Encoder). The values are positionally encoded to inject time information such as "the day of the year" into the data, helping the model understand the specific time point of different image frames, which is very important for crop growth cycle classification.

[0060] The value can be normalized to achieve position encoding, and the specific calculation method is as follows: in, Indicates the first A perturbation phase Time characteristics after value normalization; Indicates the first A perturbation phase Value; the denominator 365 can also be converted to 366, depending on the actual situation.

[0061] The temporal features of each perturbation phase are treated as an independent band and directly spliced into the channel dimension of the original input tensor (i.e., the time feature enhanced sample) to form the expanded input tensor. (Where H represents the number of vertical spatial rows, W represents the number of horizontal spatial columns, C+1 represents the number of bands, and T represents the number of time phases) This enables the model to sense and observe the relative seasonal position of time phases.

[0062] The embodiments of this application employ a time alignment enhancement mechanism, which uses the date within the year as an explicit location code added to the image channel. By simulating phenological drift and irregular image time intervals, the model is forced to learn relative phenological characteristics, thereby significantly reducing performance fluctuations and improving boundary consistency.

[0063] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain corresponding perturbed phases includes: The corresponding perturbation phases are obtained by summing each phase of the spatiotemporal cube sample with the random offset.

[0064] Specifically, this embodiment adopts an overall translational jitter strategy.

[0065] First, a preset offset interval is defined, and then an offset is randomly sampled within this interval. For example, ±5~10 days, this random offset is directly added to the original values of all time phases. In terms of values, a global time shift is performed across all time phases. Compared to sampling offsets individually for each time phase, this global shift only requires generating a single random offset, simplifying the calculation process. It also preserves the original time intervals between different time phases to the greatest extent possible, maintaining the relative interval patterns of the time series and better reflecting the characteristics of phenological drift in real-world scenarios. Phenological drift typically involves the shift of overall time nodes rather than the disruption of time intervals; therefore, this perturbation method more closely resembles the time errors that may occur in practical applications, allowing the model to learn more robust phenological features.

[0066] The calculation method for overall translational jitter is as follows: in, Indicates the first A perturbation phase value; Indicates the first Phase of time value; This represents a random offset.

[0067] This application embodiment simulates the overall phenological shift problem that easily occurs in real-world scenarios by perturbing the overall time translation. This allows the model to learn more generalized phenological features and avoids the bias caused by the model overfitting the fixed phenological time nodes in the training data.

[0068] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain corresponding perturbed phases further includes: Several time phases to be disturbed are randomly selected from each time phase of the spatiotemporal cube sample. The disturbance time phases are summed with the random jitter amount to obtain the corresponding disturbance time phases.

[0069] Specifically, this embodiment employs a local timing perturbation strategy.

[0070] Unlike overall translational jitter, which addresses scenarios where the entire time series shifts, local disturbances target scenarios where acquisition errors occur in a single or a few time phases during the time series acquisition process. In actual remote sensing satellite image acquisition, due to the influence of clouds and fog, it often happens that the acquisition time of some time phases deviates from the planned time, while other time phases are still acquired according to the original planned time. This local time error does not change the overall phenological nodes of the entire time series, but it will cause time deviations at some nodes.

[0071] Therefore, the local timing perturbation strategy in this embodiment does not uniformly add the same random offset to all timing phases. Instead, it first randomly selects several timing phases to be perturbed from all timing phases in the entire timing series according to a preset probability. The number of selected phases can be dynamically adjusted according to the timing series length. After the selection is completed, a random micro-jitter amount that meets the preset range requirement is generated for each selected timing phase to be perturbed, such as ±2~5 days. The value is added to the corresponding random jitter amount to obtain the perturbation phase after local temporal perturbation; unselected phases retain their original values. The value remains unchanged.

[0072] The calculation method for local timing disturbances is as follows: in, Indicates the first A perturbation phase value; Indicates the first Phase of time value; This indicates the corresponding amount of random jitter.

[0073] This application embodiment simulates the situation where time errors occur in local time phases in real acquisition scenarios through perturbation processing of local time series perturbations. This enables the model to learn more generalized phenological features and avoids the bias caused by the model overfitting the fixed phenological time nodes in the training data.

[0074] In one embodiment, the adaptive weight calibration of features from the spectral, spatial, and temporal dimensions at each stage includes, for each stage: Channel attention is calculated along the spatial and temporal dimensions for the input features of the aforementioned stage to obtain three-dimensional channel attention; The input features of the aforementioned three-dimensional channel attention are weighted to obtain a channel-weighted result. Spatiotemporal attention is calculated along the spectral dimension using the channel weighting results to obtain three-dimensional spatiotemporal attention; The three-dimensional spatiotemporal attention is used to weight the channel weighting results to obtain a spatiotemporal weighted result.

[0075] Specifically, the convolutional attention module includes a channel attention module and a spatiotemporal attention module. The channel attention module focuses on spectral dimension feature selection, while the spatiotemporal attention module focuses on feature weight calibration based on spatial location and temporal sequence. The two work together to achieve adaptive weight adjustment in three dimensions.

[0076] Reference Figure 4 , Figure 4 This is a schematic diagram of the structure of the convolutional attention module provided in this application.

[0077] First, channel attention is calculated along the spatial and temporal dimensions of the input features at the current stage to obtain three-dimensional channel attention. 3D channel attention preserves the original three-dimensional structure of the input features, with each channel corresponding to a uniform attention weight, reflecting the contribution of different spectral bands to the current crop type extraction task. Feature spectra that more clearly distinguish different crops are assigned higher weights, while noisy spectra that are less helpful for crop classification are assigned lower weights, achieving the goal of filtering effective features from the spectral dimension. The calculated three-dimensional channel attention is then multiplied channel-by-channel with the original input features to obtain the channel-weighted result after channel dimension calibration.

[0078] Then, spatiotemporal attention is calculated along the spectral dimension using the channel-weighted results to obtain three-dimensional spatiotemporal attention. This three-dimensional spatiotemporal attention retains the independent weights of each channel in the spectral dimension and outputs a unified attention map for both spatial location and temporal series dimensions. This map reflects the contribution of different spatial locations and temporal nodes within the same spectral band to the current crop type extraction task. Higher attention weights are assigned to regions and time periods with clearer crop outlines and more typical phenological characteristics, while lower weights are assigned to invalid sampling points obscured by clouds or fog, or those with non-critical phenological features. This further completes feature calibration from both spatial and temporal dimensions. The calculated three-dimensional spatiotemporal attention is multiplied positionally and temporally with the channel-weighted results to obtain the spatiotemporal weighted result that completes the calibration of the channel, spatial, and temporal dimensions.

[0079] This application embodiment performs adaptive weight calibration on features from the spectral, spatial, and temporal dimensions. This not only preserves the spectral information that is effective for classification tasks, but also enhances the feature contribution of key spatial regions and key phenological periods, providing a more accurate feature foundation for subsequent crop type feature extraction.

[0080] In one embodiment, performing channel attention calculation on the input features of the stage along the spatial and temporal dimensions to obtain three-dimensional channel attention includes: The input features of the aforementioned stage are subjected to global max pooling in both the time and spatial dimensions to obtain the first channel descriptor; the input features of the aforementioned stage are subjected to global average pooling in both the time and spatial dimensions to obtain the second channel descriptor. The first channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the first channel feature; the second channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the second channel feature. The first channel features and the second channel features are fused together and then processed by an activation function to generate a three-dimensional channel attention.

[0081] Specifically, refer to Figure 5 , Figure 5 This is a schematic diagram of the channel attention module provided in this application.

[0082] Global max pooling is performed on the input features along the time and spatial dimensions to obtain the first channel descriptor. Global max pooling can retain the spectral information with the highest contrast and strongest response in the input features, capture the differences in spectral features in different spatial regions and different time nodes, and avoid the typical spectral features being masked by averaging.

[0083] Simultaneously, global average pooling is performed on the input features along both the temporal and spatial dimensions to obtain the second channel descriptor. Global average pooling can aggregate statistical information across the entire spatiotemporal range, fully preserving the overall spectral distribution of all spatial locations and temporal nodes, thus overcoming the deficiency of global max pooling, which only retains the strongest response and loses overall information.

[0084] Furthermore, the first and second channel descriptors are subjected to nonlinear transformations and feature extraction operations through a shared multilayer perceptron (MLP) to extract more abstract channel dependencies, resulting in corresponding first and second channel features. Then, the first and second channel features are added together, fusing the strongest response feature and global statistical features. Finally, the output is mapped to the 0-1 range using a sigmoid activation function, generating three-dimensional channel attention weights that reflect the importance of spectral channels. These three-dimensional channel attention weights are used to automatically enhance key bands that contribute significantly to crop differentiation, such as the red edge and shortwave infrared.

[0085] This application embodiment retains the spectral features of the strongest response in the input features and the statistical spectral features of the global range, and then fuses them to generate a three-dimensional channel attention. This can accurately capture the typical spectral responses of key phenological stages and key spatial regions during crop growth, while also taking into account the overall spectral distribution pattern of the entire spatiotemporal range. This allows for better focus on key bands that are effective for classification tasks, ultimately improving the accuracy of crop type extraction.

[0086] In one embodiment, performing spatiotemporal attention calculation along the spectral dimension on the channel weighting result to obtain three-dimensional spatiotemporal attention includes: The channel-weighted result is subjected to global max pooling along the channel dimension to obtain the first spatiotemporal descriptor; the channel-weighted result is subjected to global average pooling along the channel dimension to obtain the second spatiotemporal descriptor. The first spatiotemporal descriptor and the second spatiotemporal descriptor are fused to obtain a spatiotemporal fused descriptor; Spatiotemporal features are extracted from the spatiotemporal fusion descriptor and a three-dimensional spatiotemporal attention is generated through an activation function.

[0087] Specifically, refer to Figure 6 , Figure 6 This is a schematic diagram of the spatiotemporal attention module provided in this application.

[0088] Global max pooling is performed on the channel-weighted results along the channel dimension to obtain the first spatiotemporal descriptor. Global max pooling can extract the spatiotemporal feature information with the highest response intensity among all features, and can accurately anchor the most identifiable spatial location of different crops within a specific phenological period.

[0089] Simultaneously, global average pooling is performed on the channel-weighted results along the channel dimension to obtain the second spatiotemporal descriptor. Global average pooling can retain the overall response statistics of spectral features at all spatial locations and phenological stages, without missing spatiotemporal features with weaker response intensity but still helpful for overall classification.

[0090] Furthermore, the first and second spatiotemporal descriptors are added together, fusing the strongest response features and global statistical features to obtain a spatiotemporal fusion descriptor. This preserves the feature differences between different spatiotemporal locations without losing overall growth information. Then, a large-kernel (e.g., 7×7×7) 3D convolutional layer is used to extract spatiotemporal features from the spatiotemporal fusion descriptor. Finally, a sigmoid activation function is used to map the output to the 0-1 range, generating a three-dimensional spatiotemporal attention that reflects the importance of each spatiotemporal location. The three-dimensional spatiotemporal attention is used to automatically focus on key phenological stages of crop growth (such as the heading stage) and spatial texture within the plot, while suppressing cloud shadows and background noise.

[0091] This application embodiment retains the spatiotemporal features with the strongest response and the global statistical spatiotemporal features from the input features respectively, and then fuses them to generate a three-dimensional spatiotemporal attention. This can not only accurately capture the differences in typical growth characteristics of different crops at key phenological stages and key spatial locations, but also take into account the overall growth patterns of the entire temporal range and spatial region. This allows for better focus on phenological information and spatial texture information that are effective for classification tasks, and ultimately improves the accuracy of crop type extraction.

[0092] In summary, this application discloses a crop classification method based on a hybrid 3D convolution and attention mechanism. It constructs a model with a 3D convolutional neural network as its backbone to directly extract spatiotemporal coupling features from multi-temporal and multispectral images. A three-dimensional convolutional attention module is embedded in the network, extending two-dimensional convolutional attention to three-dimensional volumetric data. A cascaded saliency labeling of "spectral channel → spatiotemporal position" is implemented for T×H×W spatiotemporal volumetric features, adaptively enhancing key temporal phases, bands, and spatial textures to improve the separability between crops. Simultaneously, a time alignment enhancement mechanism is employed, using the date within the year as an explicit positional code added to the image channel. During the training phase, temporal jitter involving overall translation and local perturbations is applied. By simulating phenological drift and irregular image time intervals, the model is forced to learn relative phenological features, thereby significantly reducing performance fluctuations and improving boundary consistency.

[0093] To compare the technology of this application (TABS-Net) with the closest existing technologies (3D-CNN, 3D-2D-CNN), the publicly available Cropland Data Layer (CDL) dataset was used for training and evaluation, and the following significant technical advancements and beneficial effects were achieved.

[0094] 1. Reference Figure 7 , Figure 7 This is a box-type diagram comparing the performance evaluation indicators of the various models provided in this application. Figure 7 It is evident that the technology presented in this application (TABS-Net) significantly improves the robustness and stability of classification. Existing technologies (3D-CNN, 3D-2D-CNN) lack explicit modeling of phenological drift, resulting in large performance fluctuations. Comparison results: such as Figure 7 As shown, the model in this application (TABS-Net) has a more compact box distribution and a narrower whisker range in terms of overall accuracy (OA), Kappa coefficient, macro average F1 score, and mean intersection-union ratio (mLoU). Beneficial effect: This proves that this application has been approved. The location encoding and temporal jitter enhancement strategy effectively aligns phenological information, giving the model strong stability against cross-sample temporal distribution differences, and no longer relying on strict temporal synchronization between training and test data.

[0095] 2. Reference Figure 8 , Figure 8 This is a comparative diagram of the classification performance of the various models provided in this application, where orange represents corn and green represents soybeans. Figure 8It can be seen that this effectively solves the problems of fragmentation and blurred boundaries within the plots, significantly improving spatial consistency. However, existing technologies lack an adaptive feature selection mechanism and are easily affected by noise from cloud shadows, soil background, and other sources, often resulting in "salt and pepper noise" and jagged boundaries in the classification results. Comparison results: such as Figure 8 As shown, the comparative models (especially 3D-2D-CNN) exhibit severe plot fragmentation and class confusion in regions 1 and 2. In contrast, the classification map generated in this application shows highly continuous and smooth plots, and accurately preserves the boundaries of subtle linear features such as field ridges and roads in region 3. Beneficial effects: This demonstrates that the 3D convolutional attention module embedded in this application successfully suppresses redundant noise in the spatiotemporal dimension and adaptively focuses on key spatial textures, significantly improving the average intersection-union ratio (mIoU, which is about 4-8 percentage points higher than the comparison technique), and achieving high-precision complete extraction at the plot level.

[0096] 3. It overcomes the recognition bias caused by class imbalance; Comparison results: In terms of macro-average F1 score, the median score of this application exceeded 93%, which is significantly better than the comparison model; Beneficial effects: This application demonstrates that it can extract key band features (such as red edge and shortwave infrared) of different crops (including small sample crops) in a balanced manner, effectively avoiding the problem of model overfitting to large crop categories.

[0097] Figure 9 This is a schematic diagram of the crop type extraction device provided in this application.

[0098] like Figure 9 As shown, the crop type extraction device includes: Module 910 is used to construct a spatiotemporal data cube based on multi-temporal, multi-band remote sensing images; The crop type extraction module 920 is used to input the spatiotemporal data cube into the crop type extraction model and obtain the crop type extraction result output by the crop type extraction model. The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

[0099] The crop type extraction device provided in this application simulates phenological drift by introducing temporal perturbation enhancement processing during the model training stage, enabling the model to learn more robust time series patterns and effectively alleviating the performance degradation problem caused by time misalignment. At the same time, it introduces three-dimensional spatial attention mechanisms such as spectral, spatial and temporal in multi-stage feature extraction, which can suppress background noise and focus on key bands and key phenological windows, thereby significantly improving the classification accuracy and robustness of the model in complex and variable scenarios.

[0100] In one embodiment, the step of performing temporal perturbation enhancement on the spatiotemporal cube sample to obtain temporally enhanced samples includes: The time phases of the spatiotemporal cube sample are perturbed to obtain the corresponding perturbed time phases; Each of the aforementioned disturbance phases is time-coded to obtain the corresponding time characteristics; Each of the aforementioned time features is embedded into the spatiotemporal cube sample to obtain a time feature-enhanced sample.

[0101] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phases includes: The corresponding perturbation phases are obtained by summing each phase of the spatiotemporal cube sample with the random offset.

[0102] In one embodiment, the step of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phase further includes: Several time phases to be disturbed are randomly selected from each time phase of the spatiotemporal cube sample. The disturbance time phases are summed with the random jitter amount to obtain the corresponding disturbance time phases.

[0103] In one embodiment, the adaptive weight calibration of features from the spectral, spatial, and temporal dimensions at each stage includes, for each stage: Channel attention is calculated along the spatial and temporal dimensions for the input features of the aforementioned stage to obtain three-dimensional channel attention; The input features of the aforementioned three-dimensional channel attention are weighted to obtain a channel-weighted result. Spatiotemporal attention is calculated along the spectral dimension using the channel weighting results to obtain three-dimensional spatiotemporal attention; The three-dimensional spatiotemporal attention is used to weight the channel weighting results to obtain a spatiotemporal weighted result.

[0104] In one embodiment, performing channel attention calculation on the input features of the stage along the spatial and temporal dimensions to obtain three-dimensional channel attention includes: The input features of the aforementioned stage are subjected to global max pooling in both the time and spatial dimensions to obtain the first channel descriptor; the input features of the aforementioned stage are subjected to global average pooling in both the time and spatial dimensions to obtain the second channel descriptor. The first channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the first channel feature; the second channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the second channel feature. The first channel features and the second channel features are fused together and then processed by an activation function to generate a three-dimensional channel attention.

[0105] In one embodiment, performing spatiotemporal attention calculation along the spectral dimension on the channel weighting result to obtain three-dimensional spatiotemporal attention includes: The channel-weighted result is subjected to global max pooling along the channel dimension to obtain the first spatiotemporal descriptor; the channel-weighted result is subjected to global average pooling along the channel dimension to obtain the second spatiotemporal descriptor. The first spatiotemporal descriptor and the second spatiotemporal descriptor are fused to obtain a spatiotemporal fused descriptor; Spatiotemporal features are extracted from the spatiotemporal fusion descriptor and a three-dimensional spatiotemporal attention is generated through an activation function.

[0106] It should be noted that the crop type extraction device provided in this application can execute the crop type extraction method described in any of the above embodiments during specific operation, which will not be elaborated in this embodiment.

[0107] Figure 10 This is a schematic diagram of the structure of the electronic device provided in this application, such as... Figure 10As shown, the electronic device may include: a processor 1010, a communications interface 1020, a memory 1030, and a communications bus 1040, wherein the processor 1010, the communications interface 1020, and the memory 1030 communicate with each other through the communications bus 1040. The processor 1010 can call logical instructions in the memory 1030 to execute a crop type extraction method, which includes: constructing a spatiotemporal data cube based on multi-temporal, multi-band remote sensing images; inputting the spatiotemporal data cube into a crop type extraction model to obtain crop type extraction results output by the crop type extraction model; wherein the crop type extraction model is trained by: performing temporal perturbation enhancement on the spatiotemporal cube samples to obtain temporally enhanced samples; performing multi-stage feature extraction on the temporally enhanced samples, and adaptively calibrating the features from the spectral, spatial, and temporal dimensions in each stage to obtain encoded features; performing stage-by-stage decoding processing on the encoded features to obtain decoded features; and training the model based on the crop type prediction results obtained through the decoded features and the ground truth labels of the crop types corresponding to the spatiotemporal cube samples to obtain the crop type extraction model.

[0108] Furthermore, the logical instructions in the aforementioned memory 1030 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0109] On the other hand, this application also provides a computer program product, which includes a computer program stored on a non-transitory computer-readable storage medium. The computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the crop type extraction method provided in the above embodiments. The method includes: constructing a spatiotemporal data cube based on multi-temporal multi-band remote sensing images; inputting the spatiotemporal data cube into a crop type extraction model to obtain a crop type extraction result output by the crop type extraction model; wherein the crop type extraction model is trained by: performing temporal perturbation enhancement on the spatiotemporal cube samples to obtain temporally enhanced samples; performing multi-stage feature extraction on the temporally enhanced samples, and adaptively calibrating the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain encoded features; performing stage-by-stage decoding processing on the encoded features to obtain decoded features; and training the model based on the crop type prediction result obtained through the decoded features and the crop type ground truth label corresponding to the spatiotemporal cube samples to obtain a crop type extraction model.

[0110] In another aspect, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon. When executed by a processor, the computer program is implemented to perform the crop type extraction method provided in the above embodiments. The method includes: constructing a spatiotemporal data cube based on multi-temporal, multi-band remote sensing images; inputting the spatiotemporal data cube into a crop type extraction model to obtain a crop type extraction result output by the crop type extraction model; wherein the crop type extraction model is trained by: performing temporal perturbation enhancement on the spatiotemporal cube samples to obtain temporally enhanced samples; performing multi-stage feature extraction on the temporally enhanced samples, and adaptively calibrating the features from the spectral, spatial, and temporal dimensions in each stage to obtain encoded features; performing stage-by-stage decoding processing on the encoded features to obtain decoded features; and training the model based on the crop type prediction result obtained through the decoded features and the crop type ground truth label corresponding to the spatiotemporal cube samples to obtain a crop type extraction model.

[0111] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0112] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0113] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for extracting crop types, characterized in that, The crop type extraction method includes: A spatiotemporal data cube is constructed based on multi-temporal, multi-band remote sensing images. The spatiotemporal data cube is input into the crop type extraction model to obtain the crop type extraction result output by the crop type extraction model; The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

2. The crop type extraction method according to claim 1, characterized in that, The process of enhancing the spatiotemporal cube samples through temporal perturbation to obtain temporally enhanced samples includes: The time phases of the spatiotemporal cube sample are perturbed to obtain the corresponding perturbed time phases; Each of the aforementioned disturbance phases is time-coded to obtain the corresponding time characteristics; Each of the aforementioned time features is embedded into the spatiotemporal cube sample to obtain a time feature-enhanced sample.

3. The crop type extraction method according to claim 2, characterized in that, The process of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phases includes: The corresponding perturbation phases are obtained by summing each phase of the spatiotemporal cube sample with the random offset.

4. The crop type extraction method according to claim 2, characterized in that, The step of perturbing each phase of the spatiotemporal cube sample to obtain the corresponding perturbed phases further includes: Several time phases to be disturbed are randomly selected from each time phase of the spatiotemporal cube sample. The disturbance time phases are summed with the random jitter amount to obtain the corresponding disturbance time phases.

5. The crop type extraction method according to any one of claims 1 to 4, characterized in that, The adaptive weight calibration of features from the spectral, spatial, and temporal dimensions is performed at each stage, and for each stage, it includes: Channel attention is calculated along the spatial and temporal dimensions for the input features of the aforementioned stage to obtain three-dimensional channel attention; The input features of the aforementioned three-dimensional channel attention are weighted to obtain a channel-weighted result. Spatiotemporal attention is calculated along the spectral dimension using the channel weighting results to obtain three-dimensional spatiotemporal attention; The three-dimensional spatiotemporal attention is used to weight the channel weighting results to obtain a spatiotemporal weighted result.

6. The crop type extraction method according to claim 5, characterized in that, The process of performing channel attention calculations on the input features of the aforementioned stage along the spatial and temporal dimensions to obtain three-dimensional channel attention includes: The input features of the aforementioned stage are subjected to global max pooling in both the time and spatial dimensions to obtain the first channel descriptor; the input features of the aforementioned stage are subjected to global average pooling in both the time and spatial dimensions to obtain the second channel descriptor. The first channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the first channel feature; the second channel descriptor is subjected to nonlinear transformation and feature extraction to obtain the second channel feature. The first channel features and the second channel features are fused together and then processed by an activation function to generate a three-dimensional channel attention.

7. The crop type extraction method according to claim 5, characterized in that, The process of performing spatiotemporal attention calculation along the spectral dimension on the channel weighting result to obtain three-dimensional spatiotemporal attention includes: The channel-weighted result is subjected to global max pooling along the channel dimension to obtain the first spatiotemporal descriptor; the channel-weighted result is subjected to global average pooling along the channel dimension to obtain the second spatiotemporal descriptor. The first spatiotemporal descriptor and the second spatiotemporal descriptor are fused to obtain a spatiotemporal fused descriptor; Spatiotemporal features are extracted from the spatiotemporal fusion descriptor and a three-dimensional spatiotemporal attention is generated through an activation function.

8. A crop type extraction device, characterized in that, The crop type extraction device includes: The building module is used to construct spatiotemporal data cubes based on multi-temporal, multi-band remote sensing images; The crop type extraction module is used to input the spatiotemporal data cube into the crop type extraction model and obtain the crop type extraction result output by the crop type extraction model. The crop type extraction model is trained in the following way: Temporal perturbation enhancement is applied to spatiotemporal cube samples to obtain temporally enhanced samples; Multi-stage feature extraction is performed on the time-enhanced samples, and adaptive weight calibration is performed on the features from the spectral dimension, spatial dimension and temporal dimension in each stage to obtain the encoded features; The encoded features are decoded in stages to obtain the decoded features; The crop type extraction model is obtained by training the model based on the crop type prediction results obtained through the decoded features and the crop type ground truth labels corresponding to the spatiotemporal cube samples.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the crop type extraction method as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium, wherein a computer program is stored on the non-transitory computer-readable storage medium, characterized in that, When the computer program is executed by a processor, it implements the steps of the crop type extraction method as described in any one of claims 1 to 7.