A rice detection method and system based on environmental data and deep learning
By generating environmental feature maps and extracting environmental context vectors, the problem of insensitive multimodal data fusion in existing technologies is solved, enabling early and accurate detection of rice growth status.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN AGRI UNIV
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies employ static and balanced strategies for multimodal data fusion, resulting in insensitivity to key areas at critical moments, delayed response, and low accuracy.
By acquiring spatial data and point-distributed time-series data, spatial interpolation is performed to generate an environmental feature map. The first neural network extracts the spatial feature map, and the second neural network extracts the environmental context vector. The attention generation unit is then used to combine and weight the features to generate a fused spatiotemporal feature map.
It achieves dynamic and unbalanced feature fusion, which improves the detection sensitivity of early anomalies and key areas and reduces detection lag.
Smart Images

Figure CN122244690A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and more specifically, to a method and system for detecting rice based on environmental data and deep learning. Background Technology
[0002] In many fields such as precision agriculture, environmental monitoring, autonomous driving, and weather forecasting, systems need to process data from different sources and modalities. For example, in precision agriculture, the system simultaneously acquires high-resolution drone or satellite aerial images (i.e., spatial data) and time-series data collected by sensors deployed at specific ground locations (such as weather stations and soil moisture meters) (i.e., point-distributed time-series data). How to effectively fuse these two spatiotemporally heterogeneous data to achieve early and accurate detection of crop growth, pests and diseases, or environmental stresses (such as drought and salinization) is a key technical problem in this field.
[0003] Currently, a common multimodal data fusion technique is feature-level fusion. This approach typically employs parallel neural network branches; for example, a convolutional neural network (CNN) is used to extract spatial feature maps from images, while a recurrent neural network (RNN) is used to extract temporal feature vectors from sensor time-series data. Then, in the intermediate layers of the network, these two feature vectors are simply concatenated or added to form the fused features, which are then fed into downstream networks for analysis.
[0004] However, the aforementioned existing technical solutions have a significant technical flaw: this fusion strategy is "static" and "balanced." Whether feature stitching or addition, the fusion weights assigned to spatial features from images and temporal features from sensors are fixed when processing data, or in other words, the fusion logic is "blind." The model cannot dynamically and unbalancedly adjust the analysis priority or attention weight of "key regions" or "key features" (e.g., newly appearing "leaf curling" textures or "sudden increase in soil salinity" areas in the spatial feature map based on "critical moment" information presented in the temporal data (e.g., sensor data showing "continuous high temperature and drought"). This "blind" fusion results in the model being insensitive to early, subtle anomalous signals, leading to detection lag and low accuracy.
[0005] In view of this, a rice detection method and system based on environmental data and deep learning is proposed. Summary of the Invention
[0006] The purpose of this invention is to provide a rice detection method and system based on environmental data and deep learning, so as to solve the technical problem that the use of static and balanced strategies in the fusion of multimodal features in the prior art leads to insensitivity and lag in the detection of key areas at critical moments.
[0007] To solve the above-mentioned technical problems, this invention provides a rice detection method based on environmental data and deep learning, comprising the following steps: S1. Acquire spatial data and time series data that are distributed in a point-like pattern; S2. Spatial interpolation is performed on the point-distributed time series data to generate an environmental feature map; S3. The spatial data is processed using a first neural network to extract spatial feature maps; S4. The environmental feature map is processed using a second neural network to extract the environmental context vector; S5. Combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; use the spatial attention map to weight the spatial feature map to obtain a fused spatiotemporal feature map. S6. Generate detection results based on the fused spatiotemporal feature map.
[0008] As a further improvement to this technical solution, the step of spatial interpolation of the point-distributed time series data includes: The point-distributed time series data are converted into a spatially continuous environmental feature map using Kriging interpolation, inverse distance weighting, or spline interpolation.
[0009] As a further improvement to this technical solution, the steps for the second neural network to process the environmental feature map include: Obtain a sequence of environmental feature maps within a time window; The environmental feature map sequence is input into a three-dimensional convolutional network or a convolutional long short-term memory network to extract the environmental context vector that encodes the temporal dynamics.
[0010] As a further improvement to this technical solution, the feature combination in step S5 includes: Broadcast or tile the environment context vector in a spatial dimension, so that its spatial dimension corresponds to the spatial features. Figure 1 To; The broadcast environment context vector is concatenated with the spatial feature map along the channel dimension.
[0011] As a further improvement to this technical solution, the attention generation unit in step S5 includes: at least one A convolutional layer followed by a Sigmoid or Softmax activation function.
[0012] As a further improvement to this technical solution, the step of processing the spatial data using a first neural network includes: Extract the spatial feature maps at multiple scales; Step S5 is performed at at least one of the multiple scales to generate the fused spatiotemporal feature map at the corresponding scale.
[0013] As a further improvement to this technical solution, the first neural network is part of a U-shaped network architecture with an encoder and a decoder; step S5 is performed at multiple downsampling levels of the encoder; the decoder receives the fused spatiotemporal feature maps generated at the multiple scales through skip connections to generate the detection result.
[0014] As a further improvement to this technical solution, the method further includes: Before performing the weighting step in step S5, the reliability of the time series data or the environmental feature map is evaluated; Based on the reliability, the weighting strength of the spatial attention map on the spatial feature map is dynamically adjusted.
[0015] As a further improvement to this technical solution, the method further includes: using the spatial feature map or its derived features to perform reverse weighting on the environmental context vector or the environmental feature map to achieve bidirectional attention fusion.
[0016] A rice detection system based on environmental data and deep learning, wherein the rice detection system based on environmental data and deep learning is used to implement the above-mentioned rice detection method based on environmental data and deep learning, comprising: The data acquisition module is used to acquire spatial data and time series data that are distributed in a point-like manner; An interpolation module is used to perform spatial interpolation on the point-distributed time-series data to generate an environmental feature map; The first feature extraction module is used to process the spatial data using a first neural network to extract a spatial feature map; The second feature extraction module is used to process the environmental feature map using a second neural network to extract the environmental context vector; The spatiotemporal attention fusion module is used to combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; and use the spatial attention map to weight the spatial feature map to obtain the fused spatiotemporal feature map. The result generation module is used to generate detection results based on the fused spatiotemporal feature map.
[0017] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. In this rice detection method and system based on environmental data and deep learning, an environmental context vector is extracted through a second neural network, and a spatial attention map is dynamically generated using this vector. The spatial feature map is then weighted using this attention map. This abandons the "static and balanced" fusion method of traditional technology and realizes a "dynamic and unbalanced" feature fusion "guided" by the environmental context.
[0018] 2. This rice detection method and system based on environmental data and deep learning solves the technical problem of difficulty in aligning point-like time-series data with surface-like spatial data and difficulty in heterogeneous fusion by generating environmental feature maps through spatial interpolation of point-like time-series data.
[0019] 3. In this rice detection method and system based on environmental data and deep learning, because the fusion process is dynamic and focused, the model can actively and preferentially analyze the "key areas" in spatial features (such as the area where the "leaf curling" feature is located) based on "critical moments" (such as "high temperature" information encoded by environmental context vectors), thereby significantly improving the model's sensitivity and accuracy in detecting early anomalies or key areas and reducing detection lag. Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating the overall method of the present invention; Figure 2 This is an overall system block diagram of the present invention. Detailed Implementation
[0021] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0022] Example 1 In many fields such as precision agriculture, environmental monitoring, autonomous driving, and weather forecasting, systems need to process data from different sources and modalities. How to effectively integrate this data from different sources and modalities to achieve early and accurate detection of crop growth, pests and diseases, or environmental stresses (such as drought and salinization) is a key technical problem in this field.
[0023] Currently, a common multimodal data fusion technique is feature-level fusion: This approach typically employs parallel neural network branches. For example, a convolutional neural network (CNN) is used to extract spatial feature maps of the image, while a recurrent neural network (RNN) is used to extract temporal feature vectors from the sensor's time-series data. Then, in the intermediate layers of the network, these two feature vectors are simply concatenated or added to form a fused feature, which is then fed into the downstream network for analysis.
[0024] However, the aforementioned existing technical solutions have a significant technical flaw: this fusion strategy is "static" and "balanced." Whether feature stitching or addition, the fusion weights assigned to spatial features from images and temporal features from sensors are fixed when processing data, or in other words, the fusion logic is "blind." The model cannot dynamically and unbalancedly adjust the analysis priority or attention weight of "key regions" or "key features" (e.g., newly appearing "leaf curling" texture or "sudden increase in soil salinity" areas in the spatial feature map based on "critical moment" information presented in the temporal data (e.g., sensor data showing "continuous high temperature and drought"). This "blind" fusion results in the model being insensitive to early, subtle anomalous signals, leading to detection lag and low accuracy. In view of this, please refer to Figure 1 As shown, one of the objectives of this invention is to provide a rice detection method based on environmental data and deep learning, which includes: S1. Acquire spatial data and time series data that are distributed in a point-like pattern; Considering that rice growth is influenced by both spatial morphology (such as leaf texture and canopy structure) and environmental factors (such as temperature, humidity, and soil nutrients), both types of data are needed to comprehensively reflect its growth status. However, the two types of data come from different sources and are morphologically heterogeneous (spatial data is "area-like," while time-series data is "point-like"). Therefore, effective acquisition and preprocessing of these data is fundamental to subsequent data fusion. For spatial data, aerial images taken by UAVs or satellite remote sensing images (such as Sentinel-2 satellite imagery) are used, with a resolution set to 10–30 meters (adjusted according to the required detection accuracy; for early detection of pests and diseases, a high resolution of 10 meters is selected). The spectral bands include visible light (red, green, and blue) and near-infrared bands (used for vegetation cover retrieval). The acquisition time interval is 3–7 days (matching the growth rate of rice; shortened to 3 days for key stages such as tillering and grain filling). Data is automatically downloaded through data interfaces (such as UAV SDKs and satellite data platform APIs). Preprocessing includes radiometric correction (removing atmospheric scattering effects) and geometric correction (aligning with ground coordinates), and finally saved as image data in PNG or TIFF format. For point-distributed time-series data, data was collected using a sensor array deployed in the paddy fields. Sensor types included: air temperature and humidity sensors (sampling frequency 1 time / hour), soil moisture sensors (monitoring soil moisture in the 0-20cm layer, 1 time / 2 hours), light intensity sensors (1 time / hour), and soil pH sensors (1 time / day). The sensor distribution density was one sensor per 500 square meters (ensuring uniform coverage, with additional sensors at the edge of each plot, one sensor per 300 square meters). Data was wirelessly transmitted to the server via LoRa. Preprocessing included outlier removal (based on the 3σ principle, removing data exceeding the mean ± 3 standard deviations) and missing value completion (using linear interpolation; data with consecutive missing values exceeding 24 hours were marked as low-reliability). The final data was saved as a CSV file containing timestamps, sensor IDs, and monitored values. Through the above steps and technical means, the spatial data obtained can reflect the spatial morphological characteristics of rice, and the temporal data can capture the dynamic changes of environmental factors, providing high-quality input for subsequent fusion.
[0025] S2. Spatial interpolation is performed on the point-distributed time series data to generate an environmental feature map; the step of spatial interpolation on the point-distributed time series data includes: using Kriging interpolation, inverse distance weighting, or spline interpolation to convert the point-distributed time series data into a spatially continuous environmental feature map; Since point-based time-series data only have values at discrete locations, while spatial data is a continuous surface, the spatial dimensions of the two do not match. Direct fusion of these two data will lead to information misalignment and make it impossible to establish the correlation between environmental factors and the spatial distribution of rice. Therefore, a spatial interpolation algorithm is used to convert the point-based time-series data into a continuous environmental feature map with the same resolution as the spatial data. Specifically, this includes: Kriging interpolation (preferred method, suitable for scenarios with strong spatial correlation): First, based on the spatial coordinates of the sensor data (latitude and longitude converted to planar coordinates), the degree of variation of the data at different distances is calculated and fitted to a spherical model (parameters: sill value = 1.2 times the data variance, range = 1.5 times the average sensor spacing; for example, if the sensor spacing is 50 meters, the range is set to 75 meters), and then the variability function is calculated. Next, for each pixel in the spatial data, search for its three nearest sensors (to ensure coverage), calculate the weights based on the variogram function, and sum the weighted values to obtain the environmental value of the pixel, thus completing the interpolation calculation; Finally, for pixels whose edges extend beyond the sensor's coverage area, an inverse distance-weighted method is used to supplement them (the weight is inversely proportional to the square of the distance) for edge processing. Alternatively, the inverse distance weighting method (as a backup, suitable for densely distributed sensor scenarios) can be used. For each pixel, the five nearest sensors are selected, and the weights are normalized by the sum of the squares of the inverse distances, and the weighted average is calculated. Or the spline interpolation method (used for parameters with smooth changes such as soil pH) can be used. The sensor data is fitted by a cubic spline function to generate a continuous surface, ensuring that the interpolation results are consistent with the measured values at the sensor locations. Through the above steps and technical means, the "point-like" time series data is transformed into "area-like" environmental feature maps, which solves the spatial dimension alignment problem and enables environmental information to be directly associated with the spatial characteristics of rice.
[0026] S3. The spatial data is processed using a first neural network to extract spatial feature maps; the step of processing the spatial data using the first neural network includes: extracting spatial feature maps at multiple scales; step S5 is performed at at least one of the multiple scales to generate the fused spatiotemporal feature map at the corresponding scale; the first neural network is part of a U-shaped network architecture with an encoder and a decoder; step S5 is performed at multiple downsampling levels of the encoder; the decoder receives the fused spatiotemporal feature maps generated at the multiple scales through skip connections to generate the detection result; Considering that spatial data (such as remote sensing images) contains a large amount of redundant information (such as background soil and weeds), it is necessary to extract key spatial features related to rice (such as leaf texture and canopy density). Furthermore, features at different scales (global distribution and local details) are meaningful for detection (e.g., global features reflect overall growth, while local features reflect early pests and diseases). Therefore, the first neural network adopts a U-shaped network with an encoder and decoder structure (such as U-Net or DeepLabv3+). The focus is on extracting multi-scale spatial features through the encoder. The specific technical solution and operational details are as follows: Encoder structure: It contains 4 downsampling levels, each of which consists of 2 convolutional layers (3×3 convolutional kernels, stride 1), a batch normalization layer, a ReLU activation function and 1 max pooling layer (2×2, stride 2); Level 1 (Small Scale): Input raw spatial image (e.g., 3 channels visible light + 1 channel near infrared, a total of 4 channels), output 64-channel feature map (resolution is 1 / 2 of the input), capturing detailed features such as leaf texture and local color changes; Level 2: Output 128-channel feature map (1 / 4 resolution) to capture the morphological features of plant clusters; Level 3: Outputs a 256-channel feature map (1 / 8 resolution) to capture the regional features of rice distribution within the plot; Level 4 (Large Scale): Outputs a 512-channel feature map (1 / 16 resolution) to capture the global growth trend of the entire paddy field.
[0027] Multi-scale feature extraction: The output of each downsampling level is used as a spatial feature map (4 scales in total) for subsequent fusion at different granularities (small scale focuses on local anomalies, large scale focuses on global trends).
[0028] Through the above steps and technical means, multi-scale feature extraction is achieved, comprehensively capturing the spatial information of rice from local details to global distribution, providing a spatial feature foundation for subsequent dynamic fusion.
[0029] S4. The environmental feature map is processed by a second neural network to extract the environmental context vector; the step of the second neural network processing the environmental feature map includes: obtaining a sequence of environmental feature maps within a time window; inputting the environmental feature map sequence into a three-dimensional convolutional network or a convolutional long short-term memory network to extract the environmental context vector that encodes the temporal dynamics; Considering that environmental feature maps are sequences that change over time (e.g., one map per day), it is necessary to capture their temporal dynamics (e.g., continuous high temperatures, sudden drops in humidity). This dynamic information is key to guiding spatial attention (e.g., high temperatures should focus on areas where leaves are curled). Therefore, a second neural network is used to encode the temporal dynamics of the environmental feature maps. The specific technical solution and operational details are as follows: Time window selection: Based on the response cycle of rice to environmental changes (such as leaf curling appearing after 3 consecutive days of high temperature), a 7-day environmental feature map sequence was selected as input (the window sliding step is 1 day to ensure coverage of key temporal changes). Network structure: Convolutional Long Short-Term Memory (ConvLSTM) or 3D Convolutional Network (3DCNN) are used. ConvLSTM (preferred choice, excels at capturing long-term dependencies): The input is 7×H×W×C (7 days, where H / W is the height / width of the feature map, and C is the dimension of environmental parameters, such as temperature, humidity, etc., which are 4 parameters, then C=4). It contains two hidden layers: the first layer has 64 3×3 convolutional kernels (time step 1) and the second layer has 32 3×3 convolutional kernels. Through gating mechanisms (input gate, forget gate, output gate), it selectively preserves temporal information (such as forgetting short-term random fluctuations and preserving continuous high temperature trends). The final output is a 1×1×D environmental context vector (D=32), which encodes the environmental dynamics over 7 days (such as "continuous high temperature and drought" and "sudden drop in humidity").
[0030] 3DCNN (suitable for short-term, highly variable scenarios): The input is the same as above, using 3 3D convolutional layers (3×3×3 kernels, 3 temporal dimensions) with strides of (1,2,2) to progressively compress the spatiotemporal dimensions; The output environment context vector is generated by global average pooling, emphasizing the spatiotemporal abrupt changes in environmental characteristics (such as the surge in humidity after a rainstorm). Through the above steps and technical means, the environmental feature map sequence is encoded into an environmental context vector containing temporal dynamics, providing a basis for the subsequent generation of "targeted" spatial attention.
[0031] S5. Combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; use the spatial attention map to weight the spatial feature map to obtain a fused spatiotemporal feature map. Traditional fusion methods simply splice or add features, failing to dynamically adjust the focus on spatial features based on the environment (e.g., prioritizing analysis of potentially arid areas during high temperatures). This leads to the neglect of subtle early anomalies. Therefore, step S5 guides spatial attention through environmental context vectors to achieve dynamic fusion, specifically including the following steps: S5.1.1 Feature Combination: Broadcast or tile the environment context vector in the spatial dimension, so that its spatial dimension is consistent with the spatial features. Figure 1 To; concatenate the broadcasted environment context vector with the spatial feature map along the channel dimension; specifically: The context vector is 1×1×D and needs to be aligned spatially with the spatial feature map (H×W×C): Broadcast (copy) the context vector spatially as H×W×D, so that its spatial dimensions match the spatial feature map. Figure 1 To; By concatenating the broadcast environment context vector and spatial feature map along the channel dimension, a combined feature of H×W×(C+D) is obtained (e.g., 128 channels of spatial features + 32 channels of environment vector, for a total of 160 channels). Through the above steps S5.1.1, the initial dimensional alignment of environmental information and spatial features is achieved, providing input for attention map generation.
[0032] S5.1.2 Spatial Attention Map Generation: At least one 1×1 convolutional layer followed by a Sigmoid or Softmax activation function; specifically: The attention generation unit consists of a 1×1 convolutional layer (reduced to 1 channel) and a sigmoid activation function; The combined features are input into this unit: a 1×1 convolutional layer compresses 160 channels into 1 channel (weights are learned through training to highlight environmental and spatially related features), and the sigmoid outputs an H×W×1 spatial attention map (value range 0-1, the higher the value, the more attention is needed to the region). Example: If the environmental context vector encodes "continuous high temperature", the attention map assigns high values (e.g., 0.8-1.0) to areas where the leaves may curl (texture anomalies in the spatial feature map) and low values (e.g., 0.2-0.3) to normal areas. Through the above step S5.1.2, based on the initial environment context vector, the regions of interest in the spatial features are initially located.
[0033] S5.1.5 Weighted Fusion: The spatial feature map is weighted pixel by pixel using the spatial attention map (pixel value of each channel of the spatial feature map × attention value of the corresponding position) to obtain the spatiotemporal feature map H×W×C after preliminary fusion (while preserving spatial features, the signal of key areas is enhanced). Through the above step S5.1.5, feature fusion is completed, key area features are enhanced, and misleading is avoided.
[0034] Considering that in actual deployments, sensors may provide abnormal data (e.g., humidity is always 0) or missing data due to damage, power outages, or communication failures, blind use could lead to weighting errors. Therefore, before performing step S5.1.5, weighted fusion, the reliability of the time series data or the environmental feature map is evaluated; based on the reliability, the weighting strength of the spatial attention map on the spatial feature map is dynamically adjusted; specifically, between steps S5.1.2 and S5.1.5, the following steps are also set: S5.1.3, Assess the reliability of environmental data: Calculate the standard deviation (reflecting stability) and missing rate (reflecting completeness) of sensor data within the time window; if the standard deviation > mean × 20% (excessive fluctuation) or the missing rate > 10% (incomplete data), it is marked as low reliability; specifically: First, extract the "raw sensor data within the time window" (such as 7 days of temperature, humidity, and soil moisture data) used to generate the environmental context vector in step S4. Then, calculate the standard deviation. For each environmental parameter (such as temperature), calculate the mean (μ) and standard deviation (σ) of all data within the time window, and determine if σ > μ × 20% (excessive fluctuation). Next, calculate the missing rate. Count the amount of missing data for a parameter within the time window, divide it by the total amount of data to obtain the missing rate, and determine if the missing rate > 10% (incomplete data). Finally, if either the standard deviation > mean × 20% (excessive fluctuation) or the missing rate > 10% (incomplete data) is met, mark the environmental data as low reliability; otherwise, mark it as high reliability. By using the above steps S5.1.3, the stability (standard deviation) and completeness (missing rate) of environmental data are quantified, providing a basis for weight adjustment and avoiding unreliable data from misleading attention.
[0035] S5.1.4 Dynamically Adjust Weights: For low reliability, multiply the attention map value by 0.5 (reduce the weight of environment guidance to avoid misguidance); for high reliability, maintain the original weights; specifically: If marked as low reliability: multiply all pixel values in the initial spatial attention map generated in step S5.1.2 by 0.5 (e.g., original attention value 0.8, adjusted to 0.4; original 0.3, adjusted to 0.15). If marked as high reliability: keep the pixel values of the initial spatial attention map unchanged; Through the above steps S5.1.4, the guiding strength of unreliable environmental data on attention is reduced, thus reducing false focusing; when reliability is high, the original guiding weight is retained to ensure effective focus on key areas.
[0036] Next, in step S5.1.5, the spatial feature map is weighted pixel by pixel using the adjusted initial spatial attention map. In this way, based on the environment guidance adjusted for reliability, the first feature fusion is completed, the key region features are initially strengthened and erroneous guidance is avoided.
[0037] Since steps S5.1.1 to S5.1.5 are actually unidirectional guidance of visual representation guided by environmental data, and in reality, visual representation can also guide the understanding of environmental data, this embodiment 1 adds a reverse weighting branch of visual guidance of the environment in addition to the unidirectional guidance fusion of environment-guided vision. That is, the spatial feature map or its derived features are used to reverse weight the environmental context vector or the environmental feature map to achieve bidirectional attention fusion. Specifically: S5.2.1, Back-weighting the context vector using the spatial feature map: Global average pooling is performed on the spatial feature map to obtain a spatial global vector (1×1×C), specifically: Global average pooling is performed on the "preliminary fused spatiotemporal feature map" obtained in step S5.1.5: the H×W×C feature map is compressed into a 1×1×C spatial global vector (the global average value within the time window is taken for each channel), thereby extracting the global information of spatial features (reflecting the overall spatial distribution trend) as a feedback signal to evaluate the matching degree with environmental information.
[0038] S5.2.2 Perform a dot product operation between the spatial global vector (1×1×C) and the original environmental context vector (1×1×D) output in step S4: calculate the similarity between the two vectors (the result is a weight coefficient of 0-1, the higher the value, the higher the matching degree between spatial features and environmental information), thereby quantifying the global correlation between spatial features and environmental information, and providing a quantitative basis for optimizing the direction of environmental guidance.
[0039] S5.2.3. Scale and adjust the original environment context vector using the weight coefficients obtained in step S5.2.2, specifically as follows: The optimized environment context vector = original environment context vector × weight coefficient (e.g., if the weight coefficient is 0.6 and the original vector has a certain dimension value of 1.2, then the optimized value is 0.72), thereby weakening mismatched environment information (when the weight is low) and retaining matching environment information (when the weight is high), thus correcting the accuracy of environment guidance.
[0040] Then, iterative fusion is performed to generate the final optimized fused feature map. The specific steps are as follows: S5.3.1 Repeat the operation of step S5.1.1 on the "optimized environment context vector" obtained in step S5.2.3, namely spatial broadcasting (H×W×D) and concatenation with spatial feature map channels (H×W×(C+D)) to obtain the optimized combined features; In step S5.3.1, based on the corrected environmental guidance information, dimension alignment is re-achieved, thereby providing input for the generation of accurate attention maps.
[0041] S5.3.2 Input the optimized combined features into the attention generation unit (same structure as step S5.1.2) to generate the optimized spatial attention map (H×W×1). Through step S5.3.2, based on the environment guidance with higher matching degree, the area that really needs attention in the spatial features is accurately located (such as "high humidity environment" only focuses on "early stage blast disease lesion area" rather than the whole paddy field).
[0042] S5.3.3 Directly reuse the "low / high reliability" flag from step S5.1.3 to adjust the optimized attention map in step S5.3.2. For low reliability, multiply all pixel values by 0.5; for high reliability, keep the pixel values unchanged. By using step S5.3.3, since the reliability of the environmental data remains unchanged (still based on sensor data within the same time window), reusing the evaluation results can avoid redundant calculations, while ensuring that the weights of the secondary fusion still avoid unreliable guidance.
[0043] S5.3.4. Use the "adjusted optimized attention map" obtained in step S5.3.3 to reweight the spatial feature map (in the same way as in step S5.1.5) to obtain the final fused spatiotemporal feature map (H×W×C). Step S5.3 combines "reliability adjustment" and "two-way feedback optimization" to strengthen the signals in key areas of spatial characteristics that match environmental information, and suppress irrelevant interference and misguidance.
[0044] Finally, multi-scale fusion and downstream processes are performed, with the following specific steps: S5.4.1 For other scale spatial feature maps output by the first neural network (such as 1 / 2, 1 / 8, 1 / 16 resolution), repeat the process of steps S5.1-S5.3 above to generate the "final fused spatiotemporal feature map" of the corresponding scale. For example, the small-scale (1 / 2 resolution) fused map is used to capture early anomalies at the leaf level, and the large-scale (1 / 16 resolution) map is used to capture overall stress at the plot level. Step S5.4.1 achieves dual assurance of "reliability and two-way optimization" for multi-scale features, taking into account both local details (small scale, such as leaf lesions) and global trends (large scale, such as overall drought in the field).
[0045] S5.4.2. The "final fused spatiotemporal feature map" of all scales is passed into the decoder of the U-shaped network through a skip connection (step S3) for subsequent detection result generation (step S6), thereby providing "interference-free and high-focus" fusion features for high-precision detection, ensuring that early subtle anomalies (such as small areas of rice blast spots) are not missed, and avoiding false detection.
[0046] Through steps S5.1 to S5.4 above, dynamic and unbalanced fusion guided by environmental context is achieved, key regional features are enhanced, and early anomalies (such as slight leaf curling under high temperature) can be accurately captured, solving the "blindness" of traditional static fusion.
[0047] S6. Generate detection results based on the fused spatiotemporal feature map; Considering that the fused features need to be transformed into specific detection results (such as rice growth level, pest and disease type and location), it is necessary to combine multi-scale features to achieve accurate localization and classification. Therefore, a U-shaped network decoder is used to process the multi-scale fused features. The specific operation steps and technical means are as follows: Decoder structure: It contains 4 upsampling levels. Each level doubles the resolution of the feature map through transposed convolution (2×2, stride 2) and splices it with the fused feature map of the corresponding scale of the encoder through skip connections. Output layer: The last upsampling layer outputs a feature map with the same resolution as the original spatial data. This map is then processed by a 1×1 convolutional layer (number of channels = number of detection categories, such as "normal", "drought stress", and "rice blast" 3 categories) and a Softmax activation function to generate pixel-level classification results (each pixel corresponds to one detection category). Post-processing: Morphological filtering is applied to the classification results (noise areas with an area of less than 5 pixels is removed), and the detection result map (with annotated abnormal areas and types) and statistical reports (such as the percentage of abnormal area and the main stress types) are output. By using the above steps and technical means, combined with multi-scale fusion features, we can achieve accurate detection of rice growth status and detect early abnormalities (such as early lesions of rice blast).
[0048] Example 2 To implement the method of Example 1, therefore, please refer to Figure 2 As shown, the purpose of Embodiment 2 is to provide a rice detection system based on environmental data and deep learning. This rice detection system based on environmental data and deep learning includes: The data acquisition module is used to acquire spatial data and time-series data that are distributed in a point-like pattern; specifically: Hardware interfaces: connect drones (such as DJI M300RTK), satellite data platforms (such as Google Earth Engine), and sensor networks (LoRa gateways). Software functions: Automatically acquires spatial and time-series data at scheduled times (e.g., 9 AM daily), performs preprocessing (correction, noise reduction), stores the data in a distributed database (e.g., PostgreSQL + PostGIS), and supports queries by time / spatial range.
[0049] The interpolation module is used to perform spatial interpolation on the point-distributed time-series data to generate an environmental feature map; specifically: Integrate Kriging interpolation, inverse distance weighting, and spline interpolation algorithm libraries (such as Python's GDAL library); The system receives preprocessed point-based time-series data, automatically selects an interpolation algorithm based on the data type (e.g., kriging for temperature, splines for pH), generates an environmental feature map with the same resolution as the spatial data, and saves it in TIFF format.
[0050] The first feature extraction module is used to process the spatial data using a first neural network to extract a spatial feature map; specifically: Deploy the encoder of the U-shaped network (based on the PyTorch framework) and load the pre-trained weights (pre-trained on a public rice remote sensing dataset). Input spatial data and output spatial feature maps at four scales (saved as tensor format for use by subsequent modules).
[0051] The second feature extraction module is used to process the environmental feature map using a second neural network to extract the environmental context vector; specifically: Deploy ConvLSTM or 3DCNN networks (based on the TensorFlow framework). Input environment feature map sequence (7-day window), output environment context vector (saved as a 1×1×D tensor).
[0052] The spatiotemporal attention fusion module is used to combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; and use the spatial attention map to weight the spatial feature map to obtain the fused spatiotemporal feature map. Implement feature broadcasting, channel concatenation, attention map generation (convolutional layer + sigmoid), and weighted fusion logic; The system integrates a reliability assessment module (which calculates the standard deviation and missing rate) and bidirectional attention logic to output a spatiotemporal feature map after multi-scale fusion.
[0053] The result generation module is used to generate detection results based on the fused spatiotemporal feature map; specifically: A decoder with a U-shaped network is deployed to receive multi-scale fused features and output pixel-level classification results; The system integrates post-processing modules (morphological filtering, result statistics) to generate a visual detection report (PDF format) and an API interface (for use by agricultural management systems).
[0054] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely preferred examples and are not intended to limit the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the appended claims and their equivalents.
Claims
1. A rice detection method based on environmental data and deep learning, characterized in that, Includes the following steps: S1. Acquire spatial data and time series data that are distributed in a point-like pattern; S2. Spatial interpolation is performed on the point-distributed time series data to generate an environmental feature map; S3. The spatial data is processed using a first neural network to extract spatial feature maps; S4. The environmental feature map is processed using a second neural network to extract the environmental context vector; S5. Combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; use the spatial attention map to weight the spatial feature map to obtain a fused spatiotemporal feature map. S6. Generate detection results based on the fused spatiotemporal feature map.
2. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The steps for spatial interpolation of the point-distributed time series data include: The point-distributed time series data are converted into a spatially continuous environmental feature map using Kriging interpolation, inverse distance weighting, or spline interpolation.
3. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The steps for the second neural network to process the environmental feature map include: Obtain a sequence of environmental feature maps within a time window; The environmental feature map sequence is input into a three-dimensional convolutional network or a convolutional long short-term memory network to extract the environmental context vector that encodes the temporal dynamics.
4. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The feature combinations in step S5 include: Broadcast or tile the environment context vector in the spatial dimension so that its spatial dimension is consistent with the spatial feature map; The broadcast environment context vector is concatenated with the spatial feature map along the channel dimension.
5. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The attention generation unit in step S5 includes: at least one A convolutional layer followed by a Sigmoid or Softmax activation function.
6. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that: The step of processing the spatial data using a first neural network includes: Extract the spatial feature maps at multiple scales; Step S5 is performed at at least one of the multiple scales to generate the fused spatiotemporal feature map at the corresponding scale.
7. The rice detection method based on environmental data and deep learning according to claim 6, characterized in that: The first neural network is part of a U-shaped network architecture with an encoder and a decoder; Step S5 is performed on multiple downsampling levels of the encoder; The decoder receives the fused spatiotemporal feature maps generated at the multiple scales via skip connections to generate the detection results.
8. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The method further includes: Before performing the weighting step in step S5, the reliability of the time series data or the environmental feature map is evaluated; Based on the reliability, the weighting strength of the spatial attention map on the spatial feature map is dynamically adjusted.
9. The rice detection method based on environmental data and deep learning according to claim 1, characterized in that, The method further includes: using the spatial feature map or its derived features to perform reverse weighting on the environmental context vector or the environmental feature map to achieve bidirectional attention fusion.
10. A rice detection system based on environmental data and deep learning, wherein the rice detection system based on environmental data and deep learning is used to implement the rice detection method based on environmental data and deep learning as described in any one of claims 1 to 9, characterized in that, include: The data acquisition module is used to acquire spatial data and time series data that are distributed in a point-like manner; An interpolation module is used to perform spatial interpolation on the point-distributed time-series data to generate an environmental feature map; The first feature extraction module is used to process the spatial data using a first neural network to extract a spatial feature map; The second feature extraction module is used to process the environmental feature map using a second neural network to extract the environmental context vector; The spatiotemporal attention fusion module is used to combine the environmental context vector with the spatial feature map; input the combined features into the attention generation unit to calculate and generate a spatial attention map; and use the spatial attention map to weight the spatial feature map to obtain the fused spatiotemporal feature map. The result generation module is used to generate detection results based on the fused spatiotemporal feature map.