A plant nitrogen nutrition monitoring method based on multi-modal data

CN122200549APending Publication Date: 2026-06-12SUZHOU ACAD OF AGRI SCI (JIANGSU TAIHU REGIONAL AGRI SCI INST)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SUZHOU ACAD OF AGRI SCI (JIANGSU TAIHU REGIONAL AGRI SCI INST)
Filing Date
2026-03-19
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing methods for monitoring plant nitrogen nutrition based on hyperspectral images suffer from problems such as limited feature extraction, insufficient model robustness, and inadequate information utilization, making it difficult to achieve high-precision and stable monitoring in real field environments.

Method used

A multimodal deep learning model is constructed, which integrates hyperspectral images and spatiotemporal metadata. Through adaptive size adjustment and global average pooling layer design, combined with multidimensional data augmentation and feature fusion, the model is optimized to improve robustness and accuracy.

🎯Benefits of technology

It significantly improved the accuracy of plant nitrogen nutrition monitoring from 33% to over 94%, enhanced the model's adaptability to complex field scenarios and the interpretability of results, and demonstrated high accuracy and strong robustness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200549A_ABST
    Figure CN122200549A_ABST
Patent Text Reader

Abstract

The present application relates to a kind of plant nitrogen nutrition monitoring methods based on multi-modal data, the method includes collecting the hyperspectral image of plant, space-time metadata and label data, is constructed by key band screening multi-modal dataset;Adaptive size adjustment is carried out to hyperspectral image and multidimensional data enhancement, while space-time metadata is characterized and encoded splicing;Multi-modal deep learning model is constructed, and model is optimized using end-to-end training multi-modal dataset;Finally, the hyperspectral image of the plant to be tested and space-time metadata are extracted, mapped and fused using the trained model, and the nitrogen nutrition state classification result is output.The present application solves the problem of single feature extraction and poor model robustness in the prior art by fusing multi-source information, improves the monitoring accuracy from 33% to more than 94%, and has high adaptability, strong interpretability and easy scalability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of smart agriculture and crop phenotyping technology, and particularly relates to a method for monitoring plant nitrogen nutrition based on multimodal data. Background Technology

[0002] Plant nitrogen nutrition status is a key factor affecting crop yield and quality. Traditional nitrogen nutrition monitoring methods (such as chemical analysis), while highly accurate, have limitations such as destructive sampling, time-consuming and labor-intensive processes, and the inability to monitor in real time. In recent years, non-destructive testing technologies based on hyperspectral imaging have developed rapidly. This technology indirectly reflects the internal physiological and biochemical information of plants by capturing the reflectance spectral characteristics of plant leaves to different wavelengths of light, thereby achieving the assessment of nutritional status.

[0003] However, existing monitoring methods based on hyperspectral images still face significant drawbacks:

[0004] First, feature extraction is overly simplistic. Existing technologies largely rely solely on the spectral or spatial texture information of the hyperspectral images themselves, failing to adequately consider key spatiotemporal factors influencing spectral responses (such as leaf position information corresponding to leaf physiological age and the number of days of nutrient deficiency corresponding to the duration of nutrient stress). Under the same nutrient conditions, the spectral characteristics of new leaves differ from those of old leaves, and the spectral responses in the early and late stages of stress also differ. Classification based solely on image data makes it difficult for models to learn these deep spatiotemporal patterns, resulting in limited generalization ability and interpretability.

[0005] Secondly, the model lacks robustness. In real field environments, the collected leaf images often have problems such as varying sizes, diverse postures, and background interference. Existing models mostly assume that the input image size is fixed, are highly dependent on preprocessing (such as cropping and scaling), and have poor adaptability to natural changes in the image (such as leaf size and position), affecting the stability and accuracy of actual deployment.

[0006] Third, information is not fully utilized; much of the rich metadata recorded during the collection process (such as processing type, sampling time, leaf position, etc.) is discarded or used only as sample labels, failing to be effectively integrated with image features, resulting in a waste of prior knowledge.

[0007] Therefore, there is an urgent need to develop a robust intelligent monitoring method for plant nitrogen nutrition that can integrate multi-source heterogeneous information to overcome the above-mentioned shortcomings. Summary of the Invention

[0008] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method for monitoring plant nitrogen nutrition based on multimodal data.

[0009] This method for monitoring plant nitrogen nutrition based on multimodal data includes the following steps:

[0010] Step S1: Collect hyperspectral images, spatiotemporal metadata, and label data of the plants; perform key band filtering on the hyperspectral images; and construct a multimodal dataset.

[0011] Step S2: Adaptively resize and perform multi-dimensional data augmentation on the hyperspectral images in the multimodal dataset; perform feature encoding and concatenation of spatiotemporal metadata to obtain numerical feature vectors;

[0012] Step S3: Construct a multimodal deep learning model;

[0013] Step S4: Train the multimodal deep learning model end-to-end using a multimodal dataset, and optimize the multimodal deep learning model.

[0014] Step S5: Obtain hyperspectral images and spatiotemporal metadata of the leaves of the plant to be tested. Use the multimodal deep learning model trained and optimized in step S4 to monitor the nutrition of the plant to be tested. Based on the principle of maximizing probability, determine the nitrogen nutrition status of the plant to be tested and output the results: The multimodal deep learning model extracts features from the hyperspectral images, maps the spatiotemporal metadata after feature encoding and splicing, and fuses the hyperspectral images after feature extraction and the mapped spatiotemporal metadata to output the nutrition monitoring classification results.

[0015] As a preferred option, step S1 specifically involves:

[0016] Hyperspectral images, spatiotemporal metadata, and label data of the plants were collected. The spatiotemporal metadata included the leaf collection time, leaf position information, number of days of nutrient deficiency, and nutrient deficiency stage of the plant. The label data consisted of nutrient status labels, which included at least normal, nitrogen-deficient, and potassium-deficient labels. A continuous projection algorithm combined with principal component analysis (SPA-PCA algorithm) was used to screen key bands in the hyperspectral images to obtain key bands that could characterize the nitrogen nutrient status of the plants. The key bands of the hyperspectral images were then paired with the structured spatiotemporal metadata. A multimodal dataset including the key bands of the hyperspectral images, spatiotemporal metadata, and label data was constructed, and the multimodal dataset was divided into training and validation sets according to a set ratio.

[0017] As a preferred option, step S2 specifically involves:

[0018] For hyperspectral images in the training and validation sets that are smaller than the target size, reflection filling is performed. For hyperspectral images in the training and validation sets that are larger than the target size, bilinear interpolation scaling is performed. Then, spatial domain data augmentation and spectral domain data augmentation are performed on the hyperspectral images after reflection filling or bilinear interpolation scaling.

[0019] The spatiotemporal metadata is characterized by encoding and concatenation to obtain numerical feature vectors, which are then processed by the constructed model.

[0020] As a preferred option:

[0021] Spatial domain data augmentation includes random flipping, random rotation, and random cropping of hyperspectral images scaled by reflection filling or bilinear interpolation to increase spatial invariance.

[0022] Spectral domain data augmentation includes randomly adding noise and dropping random bands to hyperspectral images that have been scaled by reflection filling or bilinear interpolation. This can simulate sensor noise and missing band information to improve the robustness of multimodal deep learning models to spectral perturbations.

[0023] As a preferred option, the multimodal deep learning model is a 2D-CNN fusion network model; the multimodal deep learning model includes an image feature extraction branch, a spatiotemporal metadata processing branch, and a feature fusion classification module;

[0024] The image feature extraction branch is used to extract the spatial-spectral features of the hyperspectral image; the spatiotemporal metadata processing branch is used to map the spatiotemporal metadata after feature encoding and splicing to obtain the spatiotemporal metadata feature vector; the feature fusion classification module is used to fuse the output of the image feature extraction branch and the output of the spatiotemporal metadata processing branch.

[0025] As a preferred option:

[0026] The image feature extraction branch is a two-dimensional convolutional neural network (2D Convolutional Neural Network), which includes multiple convolutional layers, batch normalization layers, and activation functions. The end of the image feature extraction branch also includes a Global Average Pooling (GAP) layer, which outputs a fixed-length image feature vector. The use of GAP allows the 2D Convolutional Neural Network to accept inputs of arbitrary spatial dimensions, making it the core of the robustness of multimodal deep learning models. The spatiotemporal metadata processing branch is a multilayer perceptron. The input to the spatiotemporal metadata processing branch is the numerical feature vector obtained by feature encoding and concatenation of the spatiotemporal metadata from step S2. The feature fusion classification module is used to fuse the outputs of the image feature extraction branch and the spatiotemporal metadata processing branch through concatenation operations along the channel dimension. The fused features are then passed through a classifier containing fully connected layers and Dropout layers to output the probability distribution of nitrogen nutrient status (normal, nitrogen-deficient, and potassium-deficient).

[0027] Preferably, in step S4, during the training of the multimodal deep learning model, the performance metrics such as confusion matrix, accuracy, and F1 score of the multimodal deep learning model are compared simultaneously with and without input spatiotemporal metadata, in order to verify the effectiveness of spatiotemporal metadata fusion.

[0028] As a preferred option, the specific method for optimizing the multimodal deep learning model in step S4 is as follows: using the cross-entropy loss function to calculate the prediction error of the multimodal deep learning model, using the Adam optimizer to adjust the model parameters of the multimodal deep learning model, monitoring the model performance indicators and iteratively optimizing.

[0029] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method described above.

[0030] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method described above.

[0031] The beneficial effects of this invention are as follows: By constructing a multimodal deep learning model and integrating hyperspectral image features with spatiotemporal metadata, this invention significantly improves the accuracy of plant nitrogen nutrition monitoring from 33% to over 94%, solving the problems of single feature extraction and poor model robustness in existing technologies. The adaptive size adjustment and global average pooling layer design enhance the model's adaptability to complex field scenarios and reduce preprocessing dependence. At the same time, the introduction of spatiotemporal metadata improves the interpretability of the results and makes the framework extensible (it can integrate more agricultural parameters), possessing high accuracy and strong robustness. Furthermore, the method of this invention is universal and easy to extend; the image branch can be replaced with other CNN architectures, and the metadata branch can easily integrate more types of agricultural information. Attached Figure Description

[0032] Figure 1 This is a flowchart of the plant nitrogen nutrition monitoring method based on multimodal data according to the present invention;

[0033] Figure 2 This is a schematic diagram of the structure of the multimodal deep learning model of the present invention. Detailed Implementation

[0034] The present invention will be further described below with reference to embodiments. The description of the embodiments below is only for the purpose of helping to understand the present invention. It should be noted that those skilled in the art can make several modifications to the present invention without departing from the principle of the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

[0035] Example 1

[0036] like Figure 1 and Figure 2 As shown, a method for monitoring plant nitrogen nutrition based on multimodal data includes the following steps:

[0037] Step S1: Collect hyperspectral images of tomato plants under different nitrogen and potassium nutrient treatments (normal CK, nitrogen-deficient Nd, potassium-deficient Kd) at different stress days (day 59, day 69, day 73, day 77) and different leaf positions (upper leaf, middle leaf, lower leaf). Use a continuous projection algorithm combined with principal component analysis (SPA-PCA algorithm) to screen key bands from the full spectrum of the hyperspectral images, and select the 10 key feature bands most relevant to the nitrogen response. Pair the key bands of the hyperspectral images with structured spatiotemporal metadata to generate corresponding TIFF format image files. Construct a multimodal dataset including key bands of the hyperspectral images, spatiotemporal metadata, and label data, and divide the multimodal dataset into training and validation sets according to a preset ratio of 7:3. Generate a structured metadata table containing information such as file name, category label, number of days of nutrient deficiency, leaf position code, and nutrient deficiency stage.

[0038] Step S2: Design a robust data preprocessing pipeline. Addressing the issue of varying input image sizes, perform reflection padding on hyperspectral images smaller than the target size in the training and validation sets, and perform bilinear interpolation scaling on hyperspectral images larger than the target size in the training and validation sets. This uniformly adjusts the hyperspectral images to 64×64 pixels to avoid distortion. Then, perform spatial domain data augmentation and spectral domain data augmentation on the hyperspectral images after reflection padding or bilinear interpolation scaling. Perform feature encoding and concatenation on the spatiotemporal metadata: encode textual or categorical spatiotemporal metadata (leaf positions: upper, middle, lower) into numerical vectors, and concatenate them with continuous metadata (standardized missing element days) to form fixed-dimensional numerical feature vectors, facilitating processing by the constructed model.

[0039] Spatial domain data augmentation includes random flipping, random rotation, and random cropping of hyperspectral images scaled by reflection filling or bilinear interpolation to increase spatial invariance.

[0040] Spectral domain data augmentation includes randomly adding noise and dropping random bands to hyperspectral images that have been scaled by reflection filling or bilinear interpolation. This can simulate sensor noise and missing band information to improve the robustness of multimodal deep learning models to spectral perturbations.

[0041] Step S3: Construct a multimodal deep learning model; the multimodal deep learning model includes an image feature extraction branch, a spatiotemporal metadata processing branch, and a feature fusion classification module; the image feature extraction branch is used to extract the spatial-spectral features of the hyperspectral image; the spatiotemporal metadata processing branch is used to map the feature-encoded and concatenated spatiotemporal metadata to obtain the spatiotemporal metadata feature vector; the feature fusion classification module is used to fuse the outputs of the image feature extraction branch and the spatiotemporal metadata processing branch;

[0042] The image feature extraction branch is a 2D-CNN fusion network model, which includes multiple convolutional layers, batch normalization layers, and activation functions. The end of the image feature extraction branch also includes a global average pooling layer (GAP). The input to the image feature extraction branch is a stack of 10×64×64 hyperspectral images in 10 bands. The global average pooling layer is used to output a fixed-length image feature vector. The use of the global average pooling layer (GAP) enables the two-dimensional convolutional neural network to accept inputs of arbitrary spatial size, which is the core of the strong robustness of the multimodal deep learning model.

[0043] The spatiotemporal metadata processing branch is a multilayer perceptron. The input of the spatiotemporal metadata processing branch is the numerical feature vector obtained by feature encoding and concatenation of the spatiotemporal metadata in step S2. Through fully connected layers and nonlinear activation functions, the spatiotemporal metadata is mapped to a latent space that matches the image features.

[0044] The feature fusion classification module is used to fuse the output of the image feature extraction branch and the output of the spatiotemporal metadata processing branch by concatenating them in the channel dimension. The concatenated fused features are then passed through a classifier containing a fully connected layer and a Dropout layer to output the probability distribution of nitrogen nutrition status (normal, nitrogen-deficient, and potassium-deficient).

[0045] Step S4: Perform end-to-end training of the multimodal deep learning model using a multimodal dataset: Employ a comparative training mode with / without metadata input. In each training epoch, the multimodal deep learning model simultaneously calculates the accuracy with metadata input and the baseline accuracy without metadata input. By continuously monitoring the difference between the two, the effectiveness of metadata fusion is verified. During training, metrics such as confusion matrix, precision, recall, and F1 score are calculated periodically, and training loss curves, accuracy curves, and performance trend charts for each category are visualized for model diagnosis and tuning to verify the effectiveness of spatiotemporal metadata fusion. Furthermore, the multimodal deep learning model is optimized: the cross-entropy loss function is used to calculate the prediction error of the multimodal deep learning model, the Adam optimizer is used to adjust the model parameters of the multimodal deep learning model, and model performance metrics are monitored and iteratively optimized.

[0046] Step S5: Obtain hyperspectral images of the leaves of the plant under test (10 bands that can characterize the nitrogen nutrition status of the plant after SPA-PCA screening) and spatiotemporal metadata (number of days collected, leaf position). Use the multimodal deep learning model trained and optimized in step S4 to monitor the nutrition of the plant under test. Based on the principle of maximizing probability, determine the nitrogen nutrition status of the plant under test and output the results: The multimodal deep learning model extracts features from the hyperspectral images, maps the spatiotemporal metadata after feature encoding and splicing, and fuses the feature-extracted hyperspectral images and the mapped spatiotemporal metadata to output the nutrition monitoring classification results (i.e., output the probability distribution of nitrogen nutrition status (normal, nitrogen deficiency, and potassium deficiency), and take the category corresponding to the maximum probability as the final nutrition status discrimination result (normal, nitrogen deficiency, or potassium deficiency)).

[0047] Example 2

[0048] Based on Example 1, a method for identifying nitrogen / potassium deficiency in tomato plants based on multimodal data includes the following steps:

[0049] Data: The provided hyperspectral image dataset of tomato leaves was used, containing 323 training samples (CK:109, Kd:103, Nd:111) and 66 validation samples (22 per class). Each image is a TIFF file containing 10 keybands.

[0050] Model: The multimodal deep learning model structure described in step S3 of Example 1 is adopted, with the image feature extraction branch being a 3-layer CNN, the spatiotemporal metadata processing branch being a 2-layer MLP, and the feature fusion classification module being a 2-layer fully connected layer.

[0051] In the image feature extraction branch (3-layer CNN):

[0052] Convolutional layer 1 is a convolutional layer with 10 input channels, 64 output channels, a kernel size of 3×3, 1 padding layer, followed by a batch normalization layer and a ReLU activation function.

[0053] Convolutional layer 2 is a convolutional layer with 64 input channels, 128 output channels, a kernel size of 3×3, 1 padding layer, followed by a batch normalization layer and a ReLU activation function.

[0054] Convolutional layer 3 has 128 input channels, 256 output channels, a kernel size of 3×3, 1 padding layer, and is followed by a batch normalization layer and a ReLU activation function. The output of convolutional layer 3 is fed into a global average pooling layer and outputs a 256-dimensional feature vector.

[0055] In the spatiotemporal metadata processing branch (2-layer MLP):

[0056] Input a 3-dimensional feature vector; set up the following layers in sequence: a fully connected layer 1 with an input dimension of 3, an output dimension of 16 and a ReLU activation function; a Dropout layer with a dropout rate of 0.3; and a fully connected layer 2 with an input dimension of 16, an output dimension of 32 and a ReLU activation function, outputting a 32-dimensional feature vector.

[0057] The processing procedure of the feature fusion classification module is as follows: the 256-dimensional feature vector output from the image feature extraction branch is concatenated with the 32-dimensional feature vector output from the spatiotemporal metadata processing branch to obtain a 288-dimensional fused feature. The fused feature is then fed into a fully connected layer 3 with an input dimension of 288, an output dimension of 64, followed by a ReLU activation function and a Dropout layer with a dropout rate of 0.4. The output of the fully connected layer 3 is fed into a fully connected layer 4 with an input dimension of 64, an output dimension of 3, and followed by a Softmax activation function.

[0058] Training: Follow step S4 as described in Example 1, train on the CPU for 100 epochs, with a batch size of 16 and an initial learning rate of 0.001.

[0059] Results: The training process showed that as training progressed, the model that incorporated spatiotemporal metadata achieved a significantly and consistently higher validation accuracy than the baseline model that did not use spatiotemporal metadata.

[0060] Best performance: At the 86th epoch, the multimodal deep learning model incorporating spatiotemporal metadata achieved a peak accuracy of 93.94% on the validation set, while the baseline model's accuracy was only 33.33%, representing a relative performance improvement of over 60 percentage points. This strongly demonstrates that the introduction of element deficiency days and leaf position information significantly enhances the discriminative ability of the multimodal deep learning model.

[0061] Sustained effectiveness: Throughout the later stages of training (Epochs 30-100), the performance improvement brought by metadata generally remained between +40% and +70%, indicating that the fusion strategy was not effective by chance, but provided stable and significant gains;

[0062] Robustness verification: The final training accuracy of the model reached 91.25%, and the validation accuracy reached 92.42%. No serious overfitting was found, indicating that adaptive data augmentation and multimodal structure effectively improved the model's generalization ability.

Claims

1. A method for monitoring plant nitrogen nutrition based on multimodal data, characterized in that, Includes the following steps: Step S1: Collect hyperspectral images, spatiotemporal metadata, and tag data of the plant; perform key band filtering on the hyperspectral images; and construct a multimodal dataset. Step S2: Perform adaptive resizing and multi-dimensional data augmentation on the hyperspectral images in the multimodal dataset; The spatiotemporal metadata is characterized by encoding and concatenation; Step S3: Construct a multimodal deep learning model; Step S4: Train the multimodal deep learning model using the multimodal dataset, and optimize the multimodal deep learning model. Step S5: Obtain the hyperspectral image and spatiotemporal metadata of the plant to be tested. Use the multimodal deep learning model trained and optimized in step S4 to monitor the nutrition of the plant to be tested, determine the nitrogen nutrition status of the plant to be tested and output the results: The multimodal deep learning model extracts features from the hyperspectral image, maps the spatiotemporal metadata after feature encoding and splicing, and fuses the hyperspectral image after feature extraction and the mapped spatiotemporal metadata to output the nutrition monitoring classification results.

2. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 1, characterized in that, Step S1 specifically involves: Hyperspectral images, spatiotemporal metadata, and label data of the plants are collected. The spatiotemporal metadata includes leaf acquisition time, leaf position information, number of days of nutrient deficiency, and nutrient deficiency stage. The label data consists of nutrient status labels, including at least normal, nitrogen-deficient, and potassium-deficient labels. A combination of continuous projection algorithm and principal component analysis is used to screen key bands in the hyperspectral images to identify key bands characterizing the nitrogen nutrient status of the plants. These key bands are then paired with the spatiotemporal metadata. A multimodal dataset is constructed, comprising the key bands of the hyperspectral images, the spatiotemporal metadata, and the label data. This multimodal dataset is then divided into training and validation sets according to a predetermined ratio.

3. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 2, characterized in that, Step S2 specifically involves: The hyperspectral images in the training and validation sets with a size smaller than the target size are filled with reflection, and the hyperspectral images in the training and validation sets with a size larger than the target size are scaled using bilinear interpolation. Then, the hyperspectral images after reflection filling or bilinear interpolation scaling are augmented with data in the spatial domain and data augmented with data in the spectral domain. The spatiotemporal metadata is characterized by encoding and concatenation.

4. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 3, characterized in that: The spatial domain data augmentation includes randomly flipping, randomly rotating, and randomly cropping the hyperspectral image after reflection filling or bilinear interpolation scaling. The spectral domain data enhancement includes randomly adding noise and randomly discarding bands to the hyperspectral image after reflection filling or bilinear interpolation scaling.

5. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 1, characterized in that: The multimodal deep learning model is a 2D-CNN fusion network model; the multimodal deep learning model includes an image feature extraction branch, a spatiotemporal metadata processing branch, and a feature fusion classification module; The image feature extraction branch is used to extract the spatial-spectral features of the hyperspectral image; The spatiotemporal metadata processing branch is used to map the spatiotemporal metadata after feature encoding and splicing to obtain a spatiotemporal metadata feature vector; The feature fusion classification module is used to fuse the output of the image feature extraction branch and the output of the spatiotemporal metadata processing branch.

6. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 5, characterized in that: The image feature extraction branch is a two-dimensional convolutional neural network, which includes multiple convolutional layers, batch normalization layers, and activation functions. The end of the image feature extraction branch also includes a global average pooling layer, which is used to output a fixed-length image feature vector. The spatiotemporal metadata processing branch is a multilayer perceptron, and the input of the spatiotemporal metadata processing branch is the spatiotemporal metadata in step S2, which is then characterized and concatenated. The feature fusion classification module is used to fuse the output of the image feature extraction branch and the output of the spatiotemporal metadata processing branch through a concatenation operation, and then output the probability distribution of nitrogen nutrient status through a classifier using the concatenated fused features.

7. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 1, characterized in that, In step S4, during the training of the multimodal deep learning model, the confusion matrix, accuracy, and F1 score of the multimodal deep learning model are compared simultaneously with and without inputting spatiotemporal metadata.

8. The plant nitrogen nutrition monitoring method based on multimodal data according to claim 7, characterized in that, The specific method for optimizing the multimodal deep learning model in step S4 is as follows: the cross-entropy loss function is used to calculate the prediction error of the multimodal deep learning model, and the Adam optimizer is used to adjust the model parameters of the multimodal deep learning model.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1-8.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-8.