An improved multi-modal transformer-based red tide anomaly detection method and system

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
The improved multimodal Transformer solves the problems of insufficient intermodal fusion capability and insufficient accuracy in anomaly area positioning in red tide detection, and realizes accurate detection and real-time early warning of red tide anomalies, which is suitable for marine edge equipment with limited resources.

CN120913074BActive Publication Date: 2026-06-19SHANDONG MARINE RESOURCE AND ENVIRONMENT RESEARCH INSTITUTE (SHANDONG MARINE ENVIRONMENTAL MONITORING CENTER SHANDONG AQUATIC PRODUCTS QUALITY INSPECTION CENTER)

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SHANDONG MARINE RESOURCE AND ENVIRONMENT RESEARCH INSTITUTE (SHANDONG MARINE ENVIRONMENTAL MONITORING CENTER SHANDONG AQUATIC PRODUCTS QUALITY INSPECTION CENTER)
Filing Date: 2025-08-01
Publication Date: 2026-06-19

Application Information

Patent Timeline

01 Aug 2025

Application

19 Jun 2026

Publication

CN120913074B

IPC: G06V20/10; G06V10/30; G06V10/46; G06V10/764; G06V10/80; G06V10/82; G06V10/86; G06N3/0455; G06N3/0464; G06N3/096

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing red tide detection methods have limited intermodal fusion capabilities, making it difficult to achieve the collaborative use of semantic information such as remote sensing images and red tide text descriptions. The accuracy of anomaly area localization is insufficient. Furthermore, traditional multimodal fusion methods have redundant model parameters and high computational costs, making them difficult to deploy in resource-constrained marine edge devices and failing to meet the dual requirements of real-time performance and accuracy.

Method used

An improved multimodal Transformer is adopted to acquire remote sensing images and text data, perform data preprocessing, visual localization and text selection, utilize the multimodal capsule mechanism for cross-modal feature learning, introduce a semantic path-guided attention mechanism for feature alignment, and predict marine red tide anomalies through multimodal knowledge distillation. It is also deployed in a lightweight manner using a teacher-student model architecture.

Benefits of technology

It achieves accurate perception of complex red tide scenarios from multiple angles, improves the semantic consistency and complementarity between image spatial structure and text semantic tags, has stable multimodal representation capabilities, takes into account prediction performance and edge deployment requirements, and realizes dynamic monitoring and real-time early warning of red tide anomalies.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120913074B_ABST

Patent Text Reader

Abstract

This invention relates to the field of red tide anomaly detection technology, and in particular to a red tide anomaly detection method and system based on an improved multimodal Transformer. The method includes acquiring remote sensing images and text data; performing data preprocessing on the acquired remote sensing images and text data respectively; performing visual localization and text selection based on the preprocessed data; performing cross-modal feature learning based on a hierarchical Transformer using a multimodal capsule mechanism; optimizing image-semantic feature alignment based on a semantic path-guided attention mechanism; and performing multimodal knowledge distillation on the optimized features. This invention combines image and text preprocessing, visual localization, and keyword extraction modules to achieve accurate multi-angle perception of complex red tide scenes, overcoming the bottleneck of traditional methods that struggle to accurately identify red tide areas under conditions of limited data and information dimensions.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of red tide anomaly detection technology, and in particular to a red tide anomaly detection method and system based on an improved multimodal Transformer. Background Technology

[0002] Red tides not only impact fishery resources and aquaculture output, but also cause losses to coastal tourism, exacerbate eutrophication, and even lead to the degradation of nearshore water ecosystems. Red tide events are characterized by complex spatiotemporal evolution, multiple intertwined causes, and short monitoring and response cycles. Therefore, there is an urgent need to construct an intelligent detection and early warning technology system for red tide anomalies to support the early perception, accurate identification, and dynamic tracking of abnormal marine phenomena.

[0003] Existing red tide detection methods can be broadly categorized into three types: First, physicochemical monitoring methods relying on fixed locations or manual sampling, primarily collecting parameters such as nutrients, chlorophyll, water temperature, and pH in the water body. These methods assess the risk of red tide occurrence by setting thresholds or establishing empirical models. However, their spatial coverage is limited, making it difficult to support large-scale dynamic monitoring. Second, visual recognition methods based on remote sensing images utilize medium- to high-resolution satellite or UAV images to observe sea surface color changes and algae density distribution. These methods offer wide-area observation and frequent updates, but are susceptible to weather conditions and lack accuracy in identifying weak signals or red tides with blurred boundaries. Third, machine learning-based model methods employ a large amount of historical image data or monitoring records to train deep neural networks for discrimination or prediction. While possessing a high level of automation, their effectiveness is prone to degradation under conditions of insufficient samples or inconsistent data modalities. Furthermore, they struggle to effectively integrate complex visual information from remote sensing images with semantic factors from monitoring text.

[0004] In addition, although current red tide detection methods have initially introduced image recognition and machine learning technologies, the following prominent problems still exist: (1) Limited intermodal fusion capability. Existing methods mostly rely on single images or numerical factors, making it difficult to achieve the coordinated use of semantic information such as remote sensing images and red tide text descriptions, resulting in fragmented multi-source information and difficulty in fully restoring the evolution characteristics of red tides; (2) Insufficient accuracy in locating abnormal areas. Traditional segmentation methods based on image thresholds or regional averages cannot accurately extract the boundaries of weak signal red tides, especially in complex backgrounds such as cloud cover and uneven illumination, where the robustness of identification is poor; (3) Traditional multimodal fusion methods have redundant model parameters and high computational costs, making them difficult to deploy in resource-constrained marine edge devices. Furthermore, they neglect the maintenance of cross-modal semantic transfer capabilities, which can easily lead to feature loss and expression deviation, making it difficult to meet the dual requirements of real-time performance and accuracy in red tide anomaly detection scenarios.

[0005] Therefore, in view of the limitations of the above methods, such as insufficient cross-modal modeling ability, limited utilization of semantic information, and fuzzy localization of abnormal areas, it is urgent to propose a red tide anomaly detection method and system based on an improved multimodal Transformer. Summary of the Invention

[0006] To address the aforementioned problems, this invention provides a red tide anomaly detection method and system based on an improved multimodal Transformer.

[0007] Firstly, the present invention provides a red tide anomaly detection method based on an improved multimodal Transformer, employing the following technical solution:

[0008] A red tide anomaly detection method based on an improved multimodal Transformer includes:

[0009] Acquire remote sensing images and text data;

[0010] Data preprocessing is performed on the acquired remote sensing images and text data respectively;

[0011] Visual localization and text selection based on preprocessed data;

[0012] Cross-modal feature learning using a hierarchical Transformer based on a multimodal capsule mechanism;

[0013] Image-semantic feature alignment optimization based on semantic path-guided attention mechanism;

[0014] Multimodal knowledge distillation is performed on the optimized features;

[0015] Predicting marine red tide anomalies using a student model derived from knowledge distillation that incorporates multimodal information.

[0016] Further, the data preprocessing based on the acquired remote sensing images and text data includes, for the remote sensing images, removing high-frequency noise components based on wavelet transform (DWT), extracting key points and their descriptions using scale-invariant feature transform (SIFT), and constructing an image feature set; for the red tide text data, eliminating noise information in unstructured text through text cleaning, extracting semantic relationships between identified entities, and obtaining structured triples using a structured relationship classifier based on a multi-head attention mechanism; and performing two parts of processing: timestamp unification and standardization, image feature completion, and pixel-level interpolation restoration based on neighboring pixels in the local image space to estimate the content of the missing regions, as expressed as:

[0017] ,

[0018] in, This represents the estimated pixel value at position (x, y) in the image, i.e., the pixel that was interpolated and padded. Ω represents the pixel value of the known neighboring pixels at position (x+i, y+j), wij represents the interpolation window range, and Z represents the weighting weight normalization factor.

[0019] Furthermore, the visual localization and text selection based on the preprocessed data includes constructing a cross-modal attention map based on remote sensing images, extracting deep visual features through the image encoder ViT, extracting global text semantics through the language encoder BERT, and constructing an attention map A using a cross-modal attention mechanism, where each value A... i,j The attention map A is used to represent the degree of correlation between the position (i,j) in the image and the semantics of the red tide text. Then, the U-Net decoding network is used to restore the attention map A to a spatial mask of the original image size. After normalization by the Sigmoid function, a preliminary anomaly probability mask is obtained.

[0020] ,

[0021] Here, σ represents the Sigmoid activation function, and each output value M0(i,j) represents the probability that pixel (i,j) in the image is an anomalous region. Finally, a mask optimizer based on the SAM framework is introduced to generate a fine boundary mask by using the original image detail information, language cue vectors and the initial mask.

[0022] Furthermore, the hierarchical Transformer based on the multimodal capsule mechanism performs cross-modal feature learning, including utilizing a hierarchical Transformer architecture that integrates modality-specific-shared structures and multimodal capsule mechanisms to extract modality-difference features and shared high-order semantics in stages, achieving hierarchical understanding and dynamic aggregation across modalities. Specifically, for image data, a local attention map is constructed based on the mask Mfinal to obtain image modality embeddings; for text data, a pre-trained text encoder is used to generate word vector sequences; a gating mechanism is introduced to divide input features into modality-specific features and shared features to construct a modality-specific-shared structure; simultaneously, a multimodal capsule mechanism based on dynamic routing is introduced, and an initial low-order capsule vector U={u1,...,u...} is obtained based on linear mapping. n}, based on the trainable projection matrix W ij Obtain the prediction vector of lower-order capsule i for higher-order capsule j. Through routing coefficient c ij For higher-order capsule vectors Dynamic weighted aggregation is performed, and the vector magnitude is compressed using the squash function to generate a probabilistic semantic entity representation v. j :

[0023] ,

[0024] Among them, s j The vector before aggregation is represented, and the final output consists of shared semantic fusion features. Model-specific characteristics and With the higher-order capsule output V={v1,...,v k Cascaded structures form multi-level semantic representations: .

[0025] Furthermore, the image-semantic feature alignment optimization based on the semantic path-guided attention mechanism includes optimizing the word embedding sequence h obtained after the input text has been encoded. T ={e1,...,e k By leveraging event extraction and methods to identify spatial relationships and causal expressions within the text, a semantic path set P is constructed.

[0026] ,

[0027] Among them, e i This represents the embedding vector of the i-th word, and each path p i Represented by semantic relation r j (i) For a series of connected entities, a path representation vector is constructed using weighted aggregation of the node embeddings in each path:

[0028] ,

[0029] in, Here, PE(j) represents the attention weight of the j-th node in the path, and PE(j) is the positional encoding term, ultimately resulting in the path embedding set. in, Representing the pth k The vector representing each path.

[0030] Furthermore, the image-semantic feature alignment optimization based on the semantic path-guided attention mechanism also includes introducing path embedding into the fused semantic space output by the previous module for attention adjustment, so that the semantic path serves as a reasoning clue to guide the model to focus on logical key points. The fused features output by multimodal hierarchical modeling and capsule aggregation are... , representing the features of each visual / text fusion unit, are used to calculate the degree of matching between each feature and all semantic paths, thus obtaining the guiding attention weights:

[0031] ,

[0032] Among them, W nIt is a learnable projection matrix, β i,j Indicates fusion features Path The degree of semantic influence is determined by constructing a guided fusion representation Z based on the attention distribution. fused : .

[0033] Furthermore, the multimodal knowledge distillation of the optimized features includes fusing image features and semantic features into a feature vector Z. fused The detailed text description T generated by the prompt word-driven module LLM The input teacher model is modeled end-to-end based on a large-scale Transformer structure. The encoder part consists of stacked multi-head self-attention and feedforward networks, and its output of the l-th layer is represented as:

[0034] ,

[0035] Where FFN represents feedforward network, MHAtt represents multi-head self-attention mechanism, and H (0) Z represents the initial feature. fused T represents the feature vector fused from image features and semantic features. LLM The BERT represents the detailed text description generated by the prompt word-driven module, and in the final layer, the teacher model outputs a high-order semantic vector h. teacher .

[0036] Furthermore, the multimodal knowledge distillation of the optimized features also includes, based on multivariate distillation loss, simplifying the student model structure to a small number of attention layers and a small feedforward network during the distillation process, taking the same features as input and outputting predicted features h. student The distillation loss includes soft target distillation, feature alignment distillation, and inter-layer attention alignment, and is expressed as:

[0037]

[0038] Among them, h teacher h represents the predicted features generated by the teacher model. student The predicted features generated by the student model represent the KL divergence, σ represents softmax, τ is the temperature coefficient, A(l) is the attention weight matrix of the l-th layer, and the total loss is:

[0039] ,

[0040] in, , and This represents the weighting coefficient.

[0041] Furthermore, the student model based on knowledge distillation of fused multimodal information performs marine red tide anomaly prediction, including anomaly scoring based on the fused feature sequence output by the student model. For the feature z output by the student model at time t... s (t) Unsupervised modeling is performed using an autoencoder structure, and the reconstruction output of the autoencoder is defined as... Using reconstruction error as the basis for anomaly scoring, the anomaly score is represented as follows: , where A (t) For the current moment, an anomaly score is given. The fused features output by the student model are used to reconstruct the feature vectors from the autoencoder. The initial exception text description generated by the prompt word-driven module The large input model is used to generate updated abnormal text descriptions, which are then processed by the large language model G. θ(⋅) The system completes semantic understanding and abnormal language generation of the input, ultimately producing an enhanced red tide anomaly warning text. ,

[0042] in, This represents context-aware, feature-driven anomaly description text. This represents the initial exception text description. The fusion feature represents the output of the student model, and Prompt represents the prompt text.

[0043] Secondly, a red tide anomaly detection system based on an improved multimodal Transformer includes:

[0044] The data acquisition module is configured to acquire remote sensing image and text data;

[0045] The preprocessing module is configured to perform data preprocessing based on the acquired remote sensing images and text data respectively;

[0046] The selection module is configured to perform visual positioning and text selection based on preprocessed data;

[0047] The alignment module is configured to perform cross-modal feature learning and feature alignment based on image-text feature encoding;

[0048] The optimization module is configured to perform image-semantic feature alignment optimization based on a prompt word-driven generation mechanism;

[0049] The distillation module is configured to perform multimodal knowledge distillation on the optimized features;

[0050] The prediction module is configured to predict marine red tide anomalies based on a student model derived from knowledge distillation that incorporates multimodal information.

[0051] Thirdly, the present invention provides a computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device of the red tide anomaly detection method based on an improved multimodal Transformer.

[0052] Fourthly, the present invention provides a terminal device, including a processor and a computer-readable storage medium, wherein the processor is used to implement various instructions; the computer-readable storage medium is used to store multiple instructions, the instructions being adapted to be loaded and executed by the processor to provide the red tide anomaly detection method based on an improved multimodal Transformer.

[0053] In summary, the present invention has the following beneficial technical effects:

[0054] Compared with existing technologies, the multimodal large-scale red tide anomaly detection method proposed in this invention, which integrates semantic and visual features, has the following significant advantages: First, the system integrates multi-source heterogeneous data such as remote sensing images and red tide historical records, and combines image and text preprocessing, visual positioning, keyword extraction and other modules to achieve multi-angle accurate perception of complex red tide scenes, breaking through the bottleneck of traditional methods that are difficult to accurately identify red tide areas under the condition of single data and limited information dimensions.

[0055] Secondly, a dual-channel image-text feature encoding structure is constructed, and a shared attention mechanism and cross-modal feature alignment strategy are introduced, which significantly improves the semantic consistency and complementarity between image spatial structure and text semantic labels, enabling the system to still have stable multimodal representation capabilities under weak supervision.

[0056] Third, the system enhances the model's understanding of red tide semantic concepts and its ability to express language through a prompt word-driven generation module. Furthermore, by combining the teacher-student model architecture, it completes lightweight distillation of knowledge from large models, effectively balancing prediction performance and edge deployment requirements.

[0057] Finally, based on the fusion features, an anomaly detection and trend prediction model is constructed, which, together with the visualization early warning output module, can not only realize dynamic monitoring and real-time early warning of red tide anomalies, but also has good generalization ability and spatiotemporal adaptability, providing key support for intelligent monitoring and emergency response in complex marine environments. Attached Figure Description

[0058] Figure 1 This is a schematic diagram of a red tide anomaly detection method based on an improved multimodal Transformer according to Embodiment 1 of the present invention.

[0059] Figure 2 A comparison diagram of ACC and F1 models in Embodiment 1 of the present invention.

[0060] Figure 3 Accuracy variation graphs for different Dropout ratios in Embodiment 1 of the present invention. Detailed Implementation

[0061] The present invention will be further described in detail below with reference to the accompanying drawings.

[0062] Example 1

[0063] Reference Figure 1 This embodiment of a red tide anomaly detection method based on an improved multimodal Transformer includes:

[0064] Acquire remote sensing images and text data;

[0065] Data preprocessing is performed on the acquired remote sensing images and text data respectively;

[0066] Visual localization and text selection based on preprocessed data;

[0067] Cross-modal feature learning and feature alignment based on image-text feature encoding;

[0068] Image-semantic feature alignment optimization based on prompt word-driven generation mechanism;

[0069] Multimodal knowledge distillation is performed on the optimized features;

[0070] Predicting marine red tide anomalies using a student model derived from knowledge distillation that incorporates multimodal information.

[0071] Specifically:

[0072] S1. Image and text data preprocessing module.

[0073] In the anomaly detection of marine red tides, remote sensing images and related textual records constitute a dual source of information input. Remote sensing images can provide large-scale, multi-temporal visualization information, while textual data such as monitoring logs, forecast records, and expert analysis reports contain rich semantic and empirical knowledge. Due to modal, structural, and quality differences between image and textual data, standardization and structuring must be performed using scientific preprocessing methods to ensure the accuracy and consistency of subsequent model training and feature fusion. This module mainly includes the following three steps:

[0074] 1) Remote sensing image preprocessing: Remote sensing images, as the core information source for red tide detection, have the advantages of high resolution and wide spatiotemporal coverage. However, they also face noise and distortion introduced by factors such as atmospheric disturbance, cloud cover, and sensor drift. A wavelet transform (DWT)-based method is used to remove high-frequency noise components. Wavelet transform is a tool that simultaneously possesses temporal and frequency local analysis capabilities, suitable for detail extraction and noise removal in remote sensing images. Let the input image be I(x,y), performing a two-dimensional discrete wavelet transform (DWT) on it yields four sub-bands:

[0075] ,

[0076] Here, LL represents the low-frequency approximation subband, preserving the main structural information of the image; LH, HL, and HH represent the high-frequency detail subbands, containing edge and noise information. To remove noise, a threshold λ is set, and soft thresholding compression is performed on the high-frequency coefficients.

[0077] ,

[0078] in, Represents the high-frequency subband coefficient. This represents the compressed coefficients, and sign(⋅) denotes the sign function. The compressed coefficients are then used to reconstruct the image using inverse wavelet transform (IDWT) to obtain the denoised image I. denoise Remote sensing images often exhibit geometric shifts due to factors such as changes in sensor viewpoint and terrain variations. To achieve image alignment, Scale Invariant Feature Transform (SIFT) is used to extract key points and their descriptors, constructing an image feature set:

[0079] ,

[0080] Among them, (x i ,y i Let d be the location of the i-th key point. i Describe its subvector. Since remote sensing images often come from different times, sensors, or angles during actual acquisition, there may be spatial offsets. Registration can "align" two images, ensuring a one-to-one correspondence between pixels. After extracting feature point sets from two images I1 and I2 respectively, Euclidean distance is used for matching to obtain the initial corresponding point set. To eliminate matching errors and outliers, the Random Sample Consensus Algorithm (RANSAC) perspective transformation matrix H is used, with the optimization objective being:

[0081] ,

[0082] Where, p i p is the matching point in image I1. iLet ′ be the corresponding point in I2, and H represent the transformation matrix. The final registered image I is obtained. aligned Different remote sensing images exhibit significant variations in pixel value dynamic range due to differences in imaging sensors and lighting conditions. To enhance the model's adaptability, min-max normalization is employed to unify pixel values to the [0,1] interval.

[0083] ,

[0084] Among them, I aligned I represents the pixel value of the registered image. min and I max These represent the minimum and maximum pixel values of the image. Normalization can improve the convergence speed and stability of the feature extraction stage, ensuring the consistency of different batches of images in the model input space.

[0085] 2) Red Tide Text Data Cleaning and Structuring: Red tide-related text data mainly includes marine monitoring reports, research paper abstracts, news reports, etc. Due to the diverse sources, inconsistent formats, and high levels of noise (such as redundant expressions, inconsistent terminology, and chaotic spatiotemporal elements), these texts require cleaning and structuring to improve the efficiency of subsequent model processing and semantic understanding. The goal of text cleaning is to eliminate noise information in unstructured text, including redundant punctuation, formatting symbols, and redundant descriptions, and to standardize terminology. Named entity recognition is used to identify key red tide-related entities from the text, such as: Time, Location, Parameter, Value, and Event type. A BERT-based sequence labeling model is used, with BIO encoding for annotation training. Let the sentence be:

[0086] ,

[0087] Among them, w i For the i-th word, the corresponding context representation h is obtained after BERT encoding. i The optimal labeled path is then obtained through decoding using the CRF layer.

[0088] ,

[0089] in, The transition probability function is represented by y, which is a sequence of labels including B-LOC (start of location), I-LOC (middle of location), O (other), etc. The goal of relation extraction is to extract semantic relationships between identified entities. A structured relation classifier based on a multi-head attention mechanism is used, and the model represents two entities as:

[0090] ,

[0091] in, The concatenation of the context vectors of the first entity is input into the relation classifier:

[0092] ,

[0093] The final result is a structured triple. To facilitate subsequent multimodal alignment with image data, the text is structured into a unified JSON format.

[0094] 3) Time Alignment and Missing Data Completion: To achieve effective fusion of remote sensing images and text information in multimodal red tide detection, it is necessary to first solve the time alignment problem between different data sources, and then improve the detection stability and robustness in the case of missing samples through completion strategies. This requires two parts of processing: timestamp unification and standardization, and image feature completion. Timestamp unification and standardization: Let the remote sensing image sequence be:

[0095] ,

[0096] The text data sequence is as follows:

[0097] ,

[0098] Among them, t img , t txt This represents the timestamps of the image and text. By constructing a time window Δt, image and text data with a time difference less than the threshold of this window are paired.

[0099] ,

[0100] Image feature completion: When remote sensing images are missing (e.g., due to cloud cover), pixel-level interpolation is performed based on neighboring pixels in the local image space to estimate the content of the missing area.

[0101] ,

[0102] in, This represents the estimated pixel value at position (x, y) in the image, i.e., the pixel that was interpolated and padded. Ω represents the pixel value of the known neighboring pixels at position (x+i, y+j), Ω represents the interpolation window range, and w ij Z represents the interpolation weight coefficient, and Z represents the weight normalization factor.

[0103] S2. Visual positioning and text selection module.

[0104] In the task of detecting red tide anomalies in the ocean, remote sensing images contain rich spectral information of the ocean surface, which can capture precursor signals of red tides such as seawater discoloration, plankton aggregation, and water turbidity. However, due to the wide coverage of images, the varying scales and blurred boundaries of target areas, and the fact that red tides often exhibit regional and sparse distribution, directly using the entire image for anomaly identification is easily interfered with by irrelevant backgrounds (such as coastlines, fishing boats, and clouds). Therefore, it is urgent to design a precise and interpretable visual localization mechanism for anomaly areas, focusing attention on spatial areas with potential red tide characteristics, thereby improving detection efficiency and accuracy. This module proposes a visual localization and mask prediction method based on a multimodal pre-trained model. A mask prediction head is constructed to achieve pixel-level segmentation of suspected red tide areas in remote sensing images, and a keyword selector is introduced to filter semantic tags that are significantly related to red tides in text sequences, enhancing the model's semantic perception and interpretability.

[0105] 1) Construct a cross-modal attention map to discover potential anomaly regions. The deep visual features of remote sensing image I are extracted using an image encoder (ViT), and the output is an embedding tensor.

[0106] ,

[0107] Where h and w are the spatial dimensions of the downsampled feature map, and d is the number of feature channels. The monitored text sequence T = {w1, w2, ..., w...} n Input the Language Encoder (BERT) and extract the global text semantic embeddings:

[0108] ,

[0109] To establish the association between image spatial regions and text semantics, a cross-modal attention mechanism is used to construct an attention graph A, where each value A... i,j The degree of semantic relevance between position (i,j) in the image and the red tide text is calculated as follows:

[0110] ,

[0111] Among them, W q W k It is a learnable linear mapping matrix, and softmax ensures that the sum of the attention weights is 1. Let be the visual feature of point (i,j) in the image. Note that Figure A can be viewed as a low-resolution "anomaly heatmap".

[0112] 2) Coarse-grained mask generation for locating suspected regions. Note that the resolution of Figure A is usually low and cannot be directly used for fine-grained region annotation. A lightweight U-Net decoding network is designed to restore the spatial mask to the original image size. This decoder gradually restores the spatial resolution through multiple upsampling (deconvolution) layers, outputting the mask logits:

[0113] ,

[0114] Further normalization using the Sigmoid function yields a preliminary anomaly probability mask:

[0115] ,

[0116] Here, σ represents the Sigmoid activation function, and each output value M0(i,j) represents the probability that pixel (i,j) in the image is an abnormal region.

[0117] 3) Refine the boundary mask by combining semantics and image details. The initial mask M0 is often blurry and imprecise; therefore, a mask optimizer based on the SAM framework is introduced to generate a high-quality, fine-grained boundary mask using original image detail information, language cue vectors, and the initial mask. The input image is encoded using a Vision Transformer, and the output image feature tensor E is... I The prompt encoder receives prompt information provided by the user and converts it into a prompt vector E. P Image features E I , prompt to embed E P The coarse mask M0 is input together with the coarse mask M0, and fusion modeling is performed to output a high-resolution mask logits:

[0118] ,

[0119] Among them, E I E represents image features P M0 represents the feature vector of the prompt, and M0 represents the coarse mask.

[0120] 4) Keyword selector to identify significant semantic tags related to red tides in the text. Monitoring text often contains semantic signals such as "seawater turning red," "algal bloom," and "high density of phytoplankton," which are important semantic bases for judging the occurrence of red tides. Therefore, a lightweight keyword selector is designed to identify target nouns on a token-by-token basis in the input text. Assume the hidden state sequence output after the text sequence is processed by the encoder is:

[0121] ,

[0122] Where h represents the text encoding feature. A linear classifier with a sigmoid activation function is used to output the probability that each token is the target keyword:

[0123] ,

[0124] Among them, W s Represents weight, b s This represents the bias. Given a threshold τ, the final set of selected keywords is:

[0125] ,

[0126] S3. Hierarchical multimodal feature modeling module.

[0127] In the red tide anomaly detection task, there are significant differences and potential complementarities between the spatial texture information of the image modality and the event semantics of the text modality. To accurately model the deep correlation between the two and improve the structured representation of anomaly patterns, this module designs a hierarchical Transformer architecture that integrates modality-specific-shared structures and multimodal capsule mechanisms. It extracts modality-discrepancy features and shared high-order semantics in stages, achieving hierarchical understanding and dynamic aggregation across modalities.

[0128] 1) Based on mask M final Construct a local attention map to obtain image modality embeddings. Let the original remote sensing image be I, and the high-resolution mask be M. final Constructing a local attention map based on a salient region enhancement strategy:

[0129] ,

[0130] Where ⊙ represents pixel-wise multiplication, Blur(⋅) represents Gaussian blur operation, and λ∈[0,1] controls the degree of background degradation. This processing emphasizes the model's ability to significantly focus on suspected anomalous regions and weakens perturbations in non-target regions. The processed image is input into the CLIP-ViT visual encoder with frozen parameters to extract high-order visual embeddings:

[0131] ,

[0132] Projected onto cross-modal shared space:

[0133] ,

[0134] Among them, W I Represents the weight matrix, b I This represents the bias vector.

[0135] 2) Let the original text input be X. T ={w1,...,w nThe high-confidence semantic label extracted by the keyword selector is K. select ={k1,...,k m Generate word vector sequences using a pre-trained text encoder (CLIP-TextEncoder):

[0136] ,

[0137] For a keyword set K, a focused representation is constructed in the word vector space using an attention mechanism:

[0138] ,

[0139] in, It is word embedding, α j This assigns a semantic importance weight to each keyword. Then, a linear projection is used to unify the dimensions.

[0140] ,

[0141] Among them, W T Represents the weight matrix, b T This represents the bias vector.

[0142] 3) Modality-Specific-Shared Structure Construction. A gating mechanism is introduced to divide the input features into modality-specific features and shared features. Taking image modality as an example, the gating process is as follows:

[0143] ,

[0144] Among them, g I Represents the image feature Z I Gated networks, and Represents learnable weights and biases, σ represents the Sigmoid function, and ⊙ represents element-wise multiplication. , These represent the specific and shared features of the image modality, respectively. Similarly, the specific and shared features of the text modality are obtained. and .Will and Feed the data into a collaborative Transformer structure for shared semantic modeling:

[0145] ,

[0146] 4) Multimodal capsule mechanism for dynamic aggregation of higher-order semantic structures. To further model the higher-order semantic structures and spatial aggregation patterns between text and image modalities, a multimodal capsule mechanism based on dynamic routing is introduced. Perform a linear mapping to obtain the initial low-order capsule vector U={u1,...,u n Based on the trainable projection matrix W ij Obtain the prediction vector of lower-order capsule i for higher-order capsule j. :

[0147] ,

[0148] By routing coefficient c ij For higher-order capsule vectors Dynamic weighted aggregation is performed, and the vector magnitude is compressed using the squash function to generate a probabilistic semantic entity representation v. j :

[0149] ,

[0150] Among them, s j The vector before aggregation is represented by the squash function, which preserves the direction and only compresses the magnitude to (0,1), making the vector length interpretable as the "probability of feature entity existence". The final output is composed of shared semantic fusion features. Model-specific characteristics and With the higher-order capsule output V={v1,...,v k Cascaded structures form multi-level semantic representations:

[0151] .

[0152] S4. Semantic path guided attention module,

[0153] To enhance the understanding and detection accuracy of complex anomaly patterns, a semantic path-guided attention mechanism is proposed. This mechanism aims to mine spatial orientation information and causal chain structures in red tide description texts, construct structured semantic path embeddings, and guide the model to achieve anomaly detection logic modeling that is more in line with human cognition in a multimodal semantic fusion space. This provides a more interpretable and task-oriented feature foundation for subsequent knowledge compression and lightweight deployment.

[0154] 1) The word embedding sequence h obtained after the input text is encoded T ={e1,...,e k By employing event extraction, dependency parsing, or knowledge template methods, spatial relationships and causal expressions within the text are identified, and a semantic path set P is constructed.

[0155] ,

[0156] Among them, e i This represents the embedding vector of the i-th word, and each path pi Represented by semantic relation r j (i) A series of connected entities or events (e.g., "temperature rise → eutrophication of water bodies → red tide outbreak") embodies the reasoning chain of red tide occurrence.

[0157] 2) To embed these paths into the model's representation space, a path representation vector is constructed using a weighted aggregation method for the node embeddings in each path:

[0158] ,

[0159] in, PE(j) is the attention weight of the j-th node in the path (reflecting its semantic importance in the path), and PE(j) is the positional encoding term used to preserve order information. The final result is a set of path embeddings.

[0160] ,

[0161] in, Representing the pth k The vector representing each path.

[0162] 3) The path embedding is introduced into the fused semantic space output by the previous module for attention adjustment, so that the semantic path acts as a "reasoning clue" to guide the model to focus on the logical key points. The fusion features of multimodal hierarchical modeling and capsule aggregation output are:

[0163] ,

[0164] This represents the features of each visual / text fusion unit. We calculate the degree of matching between each feature and all semantic paths to obtain the guided attention weights:

[0165] ,

[0166] Among them, W n It is a learnable projection matrix, β i,j Indicates fusion features Path The degree of semantic influence. Based on this attention distribution, we construct the guided fusion representation Z. fused :

[0167] ,

[0168] S5. Multimodal knowledge distillation and lightweight deployment module.

[0169] To enable rapid deployment and efficient operation of the red tide anomaly detection model on devices, this module is based on image-semantic joint representation. It aims to significantly compress the parameter scale and reduce computational complexity while maintaining the model's capabilities in image anomaly perception and semantic understanding, thereby achieving the model's usability and real-time performance in real-world environments.

[0170] 1) Teacher model design and training. The feature vector Z, which fuses image features and semantic features. fused The input teacher model employs a large-scale Transformer architecture to perform end-to-end modeling of the input. The encoder part consists of stacked multi-head self-attention and feedforward networks, and its output at layer l can be expressed as:

[0171] ,

[0172] Where FFN represents feedforward network, MHAtt represents multi-head self-attention mechanism, and H (0) Z represents the initial feature. fused This represents the feature vector resulting from the fusion of image and semantic features. In the final layer, the teacher model outputs a high-order semantic vector h. teacher .

[0173] 2) Student Model Distillation and Lightweight Deployment. To transfer the aforementioned high-dimensional capabilities to a lightweight model, we designed a multivariate distillation loss. During the distillation process, the student model structure is simplified to a small number of attention layers and a small feedforward network, taking the same features as input and outputting predicted features h. student The distillation loss comprises three parts: soft target distillation, feature alignment distillation, and inter-layer attention alignment, specifically:

[0174]

[0175] Among them, h teacher h represents the predicted features generated by the teacher model. student The predicted features generated by the student model are represented by KL, KL represents the KL divergence, σ represents softmax, τ is the temperature coefficient, and A(l) is the attention weight matrix of the l-th layer. The total loss is:

[0176] ,

[0177] in, , and This represents the weighting coefficient.

[0178] S6. Red Tide Anomaly Detection and Intelligent Early Warning Module

[0179] This module aims to achieve accurate identification of marine red tide anomalies, temporal risk evolution modeling, and multi-dimensional early warning output based on the output of a student model that integrates multimodal information. Given the complex marine ecological environment and the multi-source heterogeneity of red tide triggering factors, single-modal information is easily affected by local observation errors or anomalous disturbances, making it difficult to support stable and interpretable early warning judgments. Therefore, the system relies on the semantic fusion feature vector sequence generated by the aforementioned student model to construct an end-to-end anomaly detection and intelligent early warning mechanism.

[0180] 1) Anomaly Score Generation and Judgment Mechanism. To accurately identify whether a red tide anomaly exists at the current moment, the module first performs anomaly scoring based on the fused feature sequence output by the student model. Let z be the feature output by the student model at time t. s (t) Unsupervised modeling is performed using an autoencoder structure. The autoencoder consists of an encoder function ϕ(⋅) and a decoder function ψ(⋅), and its goal is to learn a low-dimensional compression-reconstruction mapping on a normal data distribution. The reconstructed output is defined as:

[0181] ,

[0182] And using reconstruction error as the basis for anomaly scoring, anomaly scoring can be further defined as:

[0183] ,

[0184] Among them, A (t) For the current moment, an anomaly score is given. This represents the feature vector reconstructed by the autoencoder. When A(t) exceeds a preset threshold... That is, A(t)> If so, it is determined that there is a potential red tide anomaly at the current moment.

[0185] 2) Anomaly Type Interpretation and Multi-dimensional Early Warning Output. While implementing anomaly detection, the system further analyzes the semantic structure and temporal evolution characteristics of anomaly generation. This is based on the fusion feature z output by the student model. s (t) A feature-semantic mapping mechanism is established by combining key factor tags extracted from red tide monitoring texts (such as "rising water temperature," "nutrient enrichment," and "frontal stability"). An attention-weighted approach is used to calculate the contribution of each semantic factor to the anomaly score, enabling interpretive modeling of potential causes of abnormal events. The candidate semantic factor set is defined as {l1, l2, ..., l...} K The attention score of the fused features in the factor embedding space is:

[0186] ,

[0187] Among them, elk Represents semantic factor l k The embedding vector, W a These are learnable weight parameters. Attention score α k (t) can be used to explain the risk contribution of each semantic factor in the current anomaly, thereby assisting in the analysis of the anomaly causes, evolution path and possible scope of impact.

[0188] 3) Structured early warning information release and visualization. The system will display the anomaly score A(t) and the anomaly type interpretation weight {α}. k The system organizes information such as (t)}, corresponding spatiotemporal location, risk level, and historical evolution trend into a unified whole to generate structured intelligent early warning output. The output includes anomaly detection results (whether it is abnormal, anomaly intensity), explanations of key triggers (high-weight semantic factors and their contributions), and risk level classification (dynamically graded according to scoring thresholds). The system supports visual output, allowing information to be displayed in the form of charts, time-series curves, etc.

[0189] Experimental verification:

[0190] To verify the effectiveness of the proposed red tide anomaly detection method and system based on an improved multimodal Transformer, this paper constructs a real-world experimental platform covering multimodal information in typical nearshore red tide-prone sea areas in my country. Data sources include medium- and high-resolution remote sensing images, historical red tide observation records, measured factors from monitoring buoys (temperature, salinity, pH, chlorophyll concentration, etc.), and manually annotated anomaly reports, forming multiple sets of samples that comprehensively cover the entire evolutionary stage of red tides (initial appearance—expansion—outbreak—recession). To enhance the experiment's adaptability and challenge to real-world scenarios, complex situations such as cloud cover in remote sensing images, redundancy in observation record text, and missing monitoring indicators were specifically simulated during data construction, fully testing the system's robustness and generalization ability under uncertain conditions.

[0191] Comparative experiments selected mainstream multimodal anomaly detection and marine intelligent sensing models, including the Transformer (CMMT) for cross-modal matching, ConvLSTM using a space-time modeling structure, ASTGCN combining graph neural networks and attention mechanisms, BLIP-2 based on a graph-text alignment structure, CLIP, a multimodal fusion representation model, and the method proposed in this paper. Training and testing were conducted under a unified data partition (training set:validation set:test set = 6:2:2), consistent optimizer, and learning rate policy to ensure fairness in the model comparison. Performance evaluation metrics included accuracy, F1 score, spatial localization error of anomaly areas, lead time, false alarm rate, and inference latency. The performance of each method was comprehensively evaluated from three dimensions: detection accuracy, response time, and deployment efficiency. Experimental results show that the proposed method performs best on all evaluation metrics, fully verifying the significant advantages of the system in multimodal data processing and red tide detection tasks, and demonstrating good application prospects and promotion value.

[0192] Table 1. Comparison of data from different methods under six major indicators.

[0193] Model Name ACC F1 Spatial positioning error Early warning lead time False alarm rate Inference delay CMMT 86.7% 84.2% 5.3km 1.8 days 11.5% 480s ConvLSTM 82. 4% 79.6% 6.8km 1.2 days 15.8% 210s ASTGCN 84.1% 81.3% 6.1km 1.4 days 13.2% 320s BLIP-2 88.3% 85.7% 4.9km 2.0 days 10.3% 920s CLIP 85.6% 83.1% 5.6km 1.7 days 12.1% 690s Method of the present invention 91.4% 89.2% 4.3km 2.4 days 8.4% 180s

[0194] From Table 1 and Figure 2 As can be seen, mainstream methods such as CMMT, ConvLSTM, ASTGCN, BLIP-2, and CLIP all demonstrate a certain level of performance in red tide anomaly detection tasks, but they still have significant shortcomings in key performance indicators. ConvLSTM has strong memory capabilities in temporal modeling, but its detection accuracy is low due to the lack of spatial dependency structure modeling. Its accuracy and F1 score are 82.4% and 79.6%, respectively, the lowest among all models, and its spatial positioning error reaches 6.8 km. ASTGCN models the temporal and spatial correlations through a graph attention mechanism, and its early warning lead time (1.4 days) is better than ConvLSTM. However, due to the static graph structure, it is difficult to adapt to the complex and ever-changing red tide propagation paths, and its false alarm rate remains at 13.2%. BLIP-2 and CLIP introduced a joint image-text learning mechanism, enhancing the model's ability to understand remote sensing images and text information, achieving accuracies of 88.3% and 85.6% respectively. They also performed well in controlling spatial positioning errors (both below 5.6 km). However, both suffer from high inference latency (> 600 s), making it difficult to meet real-time early warning requirements. CMMT employed a cross-modal matching mechanism, outperforming the aforementioned models in F1 score (84.2%) and early warning capability (1.8 days). However, it still has limitations such as a high false alarm rate (11.5%) and unstable fusion features.

[0195] In contrast, the method proposed in this invention significantly improves accuracy (91.4%), F1 score (89.2%), spatial positioning error (4.3 km), and early warning lead time (2.4 days) by introducing visual localization and mask prediction, multimodal dual-channel coding feature mechanism, and distillation optimization deployment strategy. Moreover, the inference latency is controlled within 180s, demonstrating excellent detection accuracy, response speed, and deployment feasibility. This verifies the strong robustness and practical value of this method in complex multimodal environments.

[0196] To verify the model's robustness under input perturbation conditions, the system was tested with different dropout rates to simulate real-world scenarios of data loss or sensor malfunction. The results are as follows: Figure 3 As shown in the figure, the accuracy (ACC) of each model changes as the dropout ratio gradually increases from 0% to 40%. The method of this invention (solid red line) maintains high accuracy under all levels of perturbation, with its ACC only slightly decreasing from 91.4% to 86%, and the overall fluctuation controlled within 6%, demonstrating significant robustness. In contrast, CLIP and BLIP-2 are more sensitive to perturbations, with their accuracy decreasing to 78% and 80%, respectively, while ConvLSTM and ASTGCN experience more severe performance degradation under high dropout conditions, with their accuracy dropping to as low as around 73%. These experimental results demonstrate that the proposed method has strong fault tolerance and can effectively resist performance degradation caused by input anomalies, making it suitable for uncertainty early warning tasks in complex marine monitoring scenarios.

[0197] A computer-readable storage medium storing a plurality of instructions adapted for loading and execution by a processor of a terminal device, the red tide anomaly detection method based on an improved multimodal Transformer.

[0198] A terminal device includes a processor and a computer-readable storage medium, the processor being configured to implement various instructions; the computer-readable storage medium being configured to store multiple instructions adapted for loading and execution by the processor of the aforementioned red tide anomaly detection method based on an improved multimodal Transformer.

[0199] The above are all preferred embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Therefore, all equivalent changes made in accordance with the structure, shape and principle of the present invention should be covered within the scope of protection of the present invention.

Claims

1. A red tide anomaly detection method based on an improved multimodal Transformer, characterized in that, include: Acquire remote sensing images and text data; Data preprocessing is performed on the acquired remote sensing images and text data respectively; Visual localization and text selection based on preprocessed data; Cross-modal feature learning using a hierarchical Transformer based on a multimodal capsule mechanism; Image-semantic feature alignment optimization based on semantic path-guided attention mechanism; Multimodal knowledge distillation is performed on the optimized features; Predicting marine red tide anomalies using a student model derived from knowledge distillation that incorporates multimodal information; The hierarchical Transformer based on the multimodal capsule mechanism performs cross-modal feature learning. This includes utilizing a hierarchical Transformer architecture that integrates modality-specific-shared structures and the multimodal capsule mechanism to extract modality-difference features and shared high-order semantics in stages, achieving hierarchical understanding and dynamic aggregation across modalities. Specifically, for image data, a local attention map is constructed based on the mask Mfinal to obtain image modality embeddings; for text data, a pre-trained text encoder is used to generate word vector sequences; a gating mechanism is introduced to divide input features into modality-specific and shared features to construct a modality-specific-shared structure; simultaneously, a multimodal capsule mechanism based on dynamic routing is introduced, obtaining initial low-order capsule vectors U={u1,...,u...} based on linear mapping. n }, based on the trainable projection matrix W ij Obtain the prediction vector of lower-order capsule i for higher-order capsule j. Through routing coefficient c ij For higher-order capsule vectors Dynamic weighted aggregation is performed, and the vector magnitude is compressed using the squash function to generate a probabilistic semantic entity representation v. j : , Among them, s j The vector before aggregation is represented, and the final output consists of shared semantic fusion features. Model-specific characteristics and With the higher-order capsule output V={v1,...,v k Cascaded structures form multi-level semantic representations: ; The image-semantic feature alignment optimization based on the semantic path-guided attention mechanism includes optimizing the word embedding sequence h obtained after the input text has been encoded. T ={e1,...,e k By leveraging event extraction and methods to identify spatial relationships and causal expressions within text, a semantic path set P is constructed. Among them, e i This represents the embedding vector of the i-th word, and each path p i Represented by semantic relation r j (i) For a series of connected entities, a path representation vector is constructed using weighted aggregation of the node embeddings in each path: in, Here, PE(j) represents the attention weight of the j-th node in the path, and PE(j) is the positional encoding term, ultimately resulting in the path embedding set. in, Representing the pth k The vector representing each path; The image-semantic feature alignment optimization based on the semantic path-guided attention mechanism also includes introducing path embedding into the fused semantic space output by the previous module for attention adjustment. This allows the semantic path to act as a reasoning clue, guiding the model to focus on logical key points. The fused features output by multimodal hierarchical modeling and capsule aggregation are... , representing the features of each visual / text fusion unit, are used to calculate the degree of matching between each feature and all semantic paths, thus obtaining the guiding attention weights: , Among them, W n It is a learnable projection matrix, β i,j Indicates fusion features Path The degree of semantic influence is determined by constructing a guided fusion representation Z based on the attention distribution. fused : ; The process of performing multimodal knowledge distillation on the optimized features includes fusing image features and semantic features into a feature vector Z. fused The detailed text description T generated by the prompt word-driven module LLM The input teacher model is modeled end-to-end based on a large-scale Transformer structure. The encoder part consists of stacked multi-head self-attention and feedforward networks, and its output of the l-th layer is represented as: Where FFN represents a feedforward network, MHAtt represents a multi-head self-attention mechanism, and H... (0) Z represents the initial feature. fused T represents the feature vector fused from image features and semantic features. LLM The BERT represents the detailed text description generated by the prompt word-driven module, and in the final layer, the teacher model outputs a high-order semantic vector h. teacher ; The multimodal knowledge distillation of the optimized features also includes simplifying the student model structure to a small number of attention layers and a small feedforward network during the distillation process, based on multivariate distillation loss, with the same input features and output predicted features h. student The distillation loss includes soft target distillation, feature alignment distillation, and inter-layer attention alignment, and is expressed as: , Among them, h teacher h represents the predicted features generated by the teacher model. student The predicted features generated by the student model represent the KL divergence, σ represents softmax, τ is the temperature coefficient, A(l) is the attention weight matrix of the l-th layer, and the total loss is: , in, , and Represents the weighting coefficient; The student model, based on knowledge distillation of fused multimodal information, predicts marine red tide anomalies, including anomaly scoring based on the fused feature sequence output by the student model. For the feature z output by the student model at time t... s (t) Unsupervised modeling is performed using an autoencoder structure, and the reconstruction output of the autoencoder is defined as... Using reconstruction error as the basis for anomaly scoring, the anomaly score is represented as follows: , where A (t) For the current moment, an anomaly score is given. The fused features output by the student model are used to reconstruct the feature vectors from the autoencoder. The initial exception text description generated by the prompt word-driven module The large input model is used to generate updated abnormal text descriptions, which are then processed by the large language model G. θ(·) The system completes semantic understanding and abnormal language generation of the input, ultimately producing an enhanced red tide anomaly warning text. ,in, This represents context-aware, feature-driven anomaly description text. This represents the initial exception text description. The fusion feature represents the output of the student model, and Prompt represents the prompt text.

2. The method according to claim 1, wherein, The process involves preprocessing the acquired remote sensing images and text data, including removing high-frequency noise components from the remote sensing images using wavelet transform (DWT), extracting key points and their descriptions using scale-invariant feature transform (SIFT), and constructing an image feature set. For the red tide text data, noise information in unstructured text is removed through text cleaning, semantic relationships are extracted from the identified entities, and structured triples are obtained using a structured relationship classifier based on a multi-head attention mechanism. The process involves two parts: timestamp unification and standardization, and image feature completion. Pixel-level interpolation is then performed based on neighboring pixels in the local image space to estimate the content of the missing region, as shown below: , in, This represents the estimated pixel value at position (x, y) in the image, i.e., the pixel that was interpolated and padded. Ω represents the pixel value of the known neighboring pixels at position (x+i, y+j), wij represents the interpolation window range, and Z represents the weighting weight normalization factor.

3. The red tide anomaly detection method based on an improved multimodal Transformer according to claim 2, characterized in that, The visual localization and text selection based on preprocessed data includes constructing a cross-modal attention map based on remote sensing images, extracting deep visual features using the image encoder ViT, extracting global text semantics using the language encoder BERT, and constructing an attention map A using a cross-modal attention mechanism, where each value A... i,j The attention map A is used to represent the correlation between the position (i,j) in the image and the semantics of the red tide text. Then, the U-Net decoding network is used to restore the attention map A to a spatial mask of the original image size. After normalization by the Sigmoid function, a preliminary anomaly probability mask is obtained. Here, σ represents the Sigmoid activation function, and each output value M0(i,j) represents the probability that pixel (i,j) in the image is an anomalous region. Finally, a mask optimizer based on the SAM framework is introduced to generate a fine boundary mask by using the original image detail information, language cue vectors and the initial mask.

4. A red tide anomaly detection system based on an improved multimodal Transformer, executing the red tide anomaly detection method based on an improved multimodal Transformer as described in claim 1, characterized in that, include: The data acquisition module is configured to acquire remote sensing image and text data; The preprocessing module is configured to perform data preprocessing based on the acquired remote sensing images and text data respectively; The selection module is configured to perform visual positioning and text selection based on preprocessed data; The alignment module is configured to perform cross-modal feature learning and feature alignment based on image-text feature encoding; The optimization module is configured to perform image-semantic feature alignment optimization based on a prompt word-driven generation mechanism; The distillation module is configured to perform multimodal knowledge distillation on the optimized features; The prediction module is configured to predict marine red tide anomalies based on a student model derived from knowledge distillation that incorporates multimodal information.