A Batch Multimodal Data Alignment Method and System Based on CLIP Model
By using a batch multimodal data alignment method based on the CLIP model, the problems of low efficiency and poor scalability in multimodal data processing in big data communication are solved. This enables efficient identification and monitoring of harmful information, improves resource utilization and system flexibility, and supports real-time monitoring and blocking of the propagation links of harmful industries.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NANJING UNIV OF SCI & TECH
- Filing Date
- 2025-07-02
- Publication Date
- 2026-06-30
Smart Images

Figure CN120744583B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of information processing technology, specifically relating to a batch multimodal data alignment method and system based on the CLIP model. Background Technology
[0002] In the field of artificial intelligence, understanding and aligning multimodal data (such as images and text) has always been an important research direction. With the rapid development of internet and communication technologies, the scale of multimodal data has exploded, including user behavior logs in communication networks, text and image content on social media, massive amounts of product images and descriptions on e-commerce platforms, and audio and video data transmitted in network traffic. How to efficiently process and analyze this data is crucial for network optimization by telecommunications operators, content security governance, intelligent customer service response, and for cybersecurity agencies to conduct real-time monitoring and intelligent prevention of malicious industry behavior.
[0003] Traditional multimodal data processing methods typically rely on models designed for specific tasks, such as image classification models, text classification models, and image caption generation models. These models often require large amounts of manually labeled data for training, which is costly and difficult to scale to new data modalities or task types. Furthermore, for batch processing of large-scale data, traditional methods suffer from bottlenecks in computational efficiency and resource utilization, especially in communication big data scenarios, where the conflict between the concurrent processing needs of massive real-time data and the limitations of traditional architectures becomes increasingly prominent.
[0004] In recent years, contrastive learning has become an important research direction in the field of multimodal representation learning. This method effectively maps data from different modalities to a shared semantic space by learning the correlations between them, thereby achieving cross-modal understanding and alignment. The Contrastive Language-Image Pre-trained Model (CLIP) is a representative achievement in this field. CLIP achieves powerful cross-modal semantic representation capabilities and exhibits significant zero-shot transfer potential through contrastive learning on large-scale image-text pair datasets. This allows CLIP to be directly applied to various downstream tasks without task-specific fine-tuning, greatly improving the model's flexibility and generalization ability. In communication big data scenarios, CLIP's zero-shot capability is particularly suitable for quickly identifying new types of malicious language, malicious link templates, and other characteristics of malicious industry variations.
[0005] Currently, some solutions exist for multimodal data processing using CLIP, but they still have shortcomings in feature alignment and efficient processing of large-scale batch data. For example:
[0006] CLIP-based feature extraction (either line-by-line or small-batch): In communication network log analysis, processing user behavior data line by line can lead to excessive computational latency, making it difficult to meet real-time requirements.
[0007] Batch processing of specific tasks based on fine-tuning CLIP: For the task of classifying non-performing industries, data needs to be labeled and the model needs to be adjusted frequently, which is difficult to adapt to the rapid iterative changes of non-performing industries.
[0008] Simple parallelized CLIP feature extraction: When processing image-text mixed data in communication traffic, the lack of a dynamic resource scheduling mechanism can easily lead to GPU / CPU load imbalance and affect system throughput.
[0009] Currently, traditional methods face the following core challenges when processing big data in communications and analyzing problematic industries:
[0010] Efficiency bottleneck: The processing efficiency of massive multimodal data (such as billions of user messages and millions of defective images per day) is low, making it difficult to meet the needs of real-time monitoring.
[0011] Extensibility limitations: The defect feature library, which relies on manual annotation, is updated slowly and cannot quickly adapt to new defect methods.
[0012] Insufficient cross-modal association: Existing methods struggle to establish cross-modal association networks among malicious text, malicious app icons, and malicious links, impacting risk tracing capabilities;
[0013] Resource waste: Traditional batch processing does not fully utilize the distributed computing resources of telecommunications operators (such as edge node computing power), resulting in high computing costs. Summary of the Invention
[0014] To address the problems mentioned in the background, this invention proposes a batch multimodal data alignment method and system based on the CLIP model. Aimed at communication big data and malpractice industry analysis, it uses the image encoder and text encoder of the CLIP model to jointly learn representations of mixed-modal data such as malpractice icons, malpractice language, and malpractice recorded speech in communication networks. This constructs a cross-modal semantic space covering implicit features of communication protocols and malpractice industry ecosystem behaviors, providing end-to-end awareness capabilities for communication network attack and defense drills and malpractice industry crackdown operations.
[0015] Technical Solution: To solve the above-mentioned technical problems, the present invention adopts the following technical solution:
[0016] A batch multimodal data alignment method based on the CLIP model is proposed. By batch processing multimodal data in communication networks and combining the cross-modal alignment capability of the CLIP model, it identifies malicious text, malicious images, and cross-modal risk links, supporting communication operators in real-time monitoring and intelligent blocking of malicious industry propagation links. The method includes the following steps:
[0017] S1: Receive batch multimodal data;
[0018] S2: Data preprocessing;
[0019] S3: Feature alignment is achieved through batch feature extraction based on the CLIP model;
[0020] S4: Batch classification based on Prompt template;
[0021] S5: Result generation and output, and visualization processing.
[0022] As a preferred option, in S1, images and text information on the Internet are collected and analyzed in batches, and harmful information is identified and blocked through dynamic keyword database and image feature matching.
[0023] As a preferred approach, in S2, the collected multimodal data is standardized to effectively filter out normal communication noise, retain key characteristics of problematic industries, and provide high-quality input for subsequent feature alignment.
[0024] As a preferred option, in S3, the specific process of batch feature extraction based on the CLIP model to achieve feature alignment is as follows:
[0025] S31: Use the CLIP model image encoder and text encoder to batch extract features from preprocessed images and text;
[0026] S32: Dynamically schedule and allocate computing resources for the extracted image and text features;
[0027] S33: Batch comparative learning optimizes and stores feature vectors.
[0028] Preferably, in S31, a pre-trained CLIP model image encoder is used to extract features from the batch image data to obtain the batch image feature matrix. ;
[0029] in, This represents the batch of image feature matrices, where N represents the batch size. Dimensions representing image features The dimension of the image feature matrix;
[0030] The pre-trained CLIP model text encoder is used to extract features from a batch of text data, resulting in a batch text feature matrix. ;
[0031] in, This represents the batch size of the text feature matrix, where N represents the batch size. Dimensions representing text features This represents the dimension of the text feature matrix.
[0032] Preferably, in S33, a contrastive loss function is constructed for batch data, and the contrastive loss function L is expressed as:
[0033] ,
[0034] in, and Let these represent the feature vectors of the j-th image and the text, respectively. τ represents the cosine similarity function; τ represents a temperature hyperparameter; N represents the number of image-text pairs in a batch; Represents an exponential function. Indicates the index of the image-text pair in the batch; Indicates the first Feature vectors of an image Feature vectors of all texts within the batch The sum of exponential functions of the similarity between them; Indicates the first Feature vectors of texts Feature vectors of all images within the batch The sum of exponential functions of the similarity between them; Indicates the first Feature vectors of an image and the corresponding text feature vector The exponential function of the similarity between them.
[0035] As a preferred option, in S4, the specific process of batch classification based on the Prompt template is as follows:
[0036] S41: Predefined bulk Prompt template library;
[0037] S42: Dynamically schedule and allocate computing resources for the extracted image and text features;
[0038] S43: Calculate the similarity between image features and Prompt features to generate classification results.
[0039] As a preferred option, in S41, the specific contents of the predefined batch Prompt template library are as follows:
[0040] For batch image data that needs to be classified, image features are first extracted using CLIP's image encoder; then, based on a predefined batch Prompt template library, Prompt features are extracted using CLIP's text encoder, and corresponding text features are generated in batches according to predefined category labels.
[0041] Preferably, in S43, the similarity between the image features and the Prompt features is calculated, and the specific content of the classification result is as follows:
[0042] Suppose we want to classify an image into k possible categories and generate a corresponding Prompt text feature vector for each category. ;
[0043] Traversal calculation: Prompt feature vector for each category Based on the cosine similarity between the feature vector of each image and the feature vector of the text, the image feature vector is calculated. With the current Prompt feature vector Similarity score between ;
[0044] Compare the fractions: Similarity scores ;
[0045] Determine Category: Found this The index corresponding to the highest similarity score. ;
[0046] Output: The image is predicted to be the [number]th [image]. One category;
[0047] This process is performed on each image in the batch, thus achieving batch classification.
[0048] A batch multimodal data alignment system based on the CLIP model implements the batch multimodal data alignment method based on the CLIP model described above, including a batch data input and preprocessing module, a batch feature extraction module, a batch contrastive learning optimization module, a dynamic task scheduling and concurrent processing module, a batch classification module based on the Prompt template, a strategy matching and analysis module, and a result generation and output module.
[0049] Batch data input and preprocessing module: This module receives batch input image and text data and filters out noise from normal communication traffic through dynamic denoising and normalization.
[0050] Batch feature extraction module: Combining the cross-modal alignment capability of the CLIP model, it performs joint feature representation on image-text pairs in communication big data;
[0051] Batch Comparative Learning Optimization Module: Used to calculate and optimize the similarity between image features and text features;
[0052] Dynamic task scheduling and concurrent processing module: used to decompose batch-aligned tasks into multiple subtasks that can be executed in parallel; dynamically allocate subtasks to available computing resources according to the real-time status of system resources and predefined scheduling strategies, so as to realize real-time monitoring and management of tasks.
[0053] The batch classification module based on the Prompt template is used to transform multi-class classification problems into image-text template matching problems, thereby achieving batch classification.
[0054] Strategy matching and analysis module: Built-in knowledge base of bad industry characteristics, combined with Prompt template matching technology, automatically identifies bad text, bad images and abnormal behavior patterns, and generates risk scores and associated network topologies;
[0055] Result generation and output module: Used to output the results of batch data alignment.
[0056] Beneficial effects: Compared with the prior art, the present invention has the following advantages:
[0057] (1) The batch multimodal data alignment method based on CLIP model proposed in this invention can effectively solve the problems of low efficiency, poor scalability, weak generalization ability and insufficient resource utilization in the existing technology when processing large-scale bad information.
[0058] (2) The present invention significantly improves the efficiency of identifying harmful information: By batch processing and using the GPU-accelerated CLIP model for feature extraction and similarity calculation, it can quickly locate potential harmful information in large-scale multimodal data. Compared with traditional methods of processing one item or small batches, the processing speed is greatly improved, effectively addressing the characteristics of fast information dissemination and large quantity in this field.
[0059] (3) This invention enhances the ability to identify novel and variant malicious methods: With CLIP's powerful cross-modal semantic understanding and zero-sample transfer capabilities, as well as flexible Prompt Engineering techniques, it can quickly adapt to constantly evolving network malicious operation methods. There is no need for time-consuming data annotation and model retraining for each new variant method. By updating the Prompt template or keyword library, the accuracy and recall of identifying novel malicious information can be improved, effectively making up for the deficiency of insufficient generalization ability of traditional fixed models.
[0060] (4) This invention improves the ability to analyze cross-modal harmful information. This invention can perform feature alignment on data of multiple modalities such as images and text, thereby better understanding and associating semantic relationships between different modalities. For example, it can identify images containing harmful links, images related to specific harmful information statements, etc., improving the ability to identify and analyze complex and highly disguised harmful information, breaking the limitations of traditional methods in single-modal analysis.
[0061] (5) The present invention can optimize the utilization of computing resources and reduce operating costs: Through dynamic task scheduling and concurrent processing mechanism, computing resources can be intelligently allocated according to system load and task priority, making full use of hardware acceleration capabilities such as GPU, avoiding resource idleness and waste, and reducing the operating costs of large-scale data processing.
[0062] (6) This invention enhances the scalability and flexibility of the system: the CLIP-based pre-trained model and PromptEngineering technology enable the system to be more easily extended to new types of bad information and new data modalities without large-scale model reconstruction. Dynamic task scheduling also enables the system to flexibly respond to the growth of data volume and changes in business needs.
[0063] (7) The present invention can improve the accuracy and timeliness of early warning: Through efficient batch processing and intelligent risk assessment mechanism, it can identify and warn of potential adverse information more quickly and accurately, providing users and platforms with more timely security protection and reducing losses.
[0064] (8) This invention uses the image encoder and text encoder of the CLIP model to jointly represent mixed modal data such as bad icons, bad speech, and bad recorded speech in the communication network, and constructs a cross-modal semantic space covering the implicit features of communication protocols and the behavior of bad industry ecosystems; combined with the spatiotemporal distribution law of communication traffic (such as SMS bombing frequently occurring during peak business hours and bad emails surging in nighttime scenarios), it constructs a dynamic task scheduling strategy based on resource awareness to realize millisecond-level elastic computing resource allocation of GPU clusters; it pre-configures a professional bad industry classification Prompt library containing "features of bad APP launch pages" and "templates for impersonating bank transfers", and calculates the image-text-speech feature similarity matrix by combining the contrastive learning optimization module, which not only realizes zero-sample classification and recognition of new variant attack modes, but also automatically mines cross-modal risk propagation links such as "bad SMS - bad website - malicious APP download", providing full-link threat awareness capabilities for communication network attack and defense exercises and bad industry crackdown actions. Attached Figure Description
[0065] Figure 1 This is a flowchart of a batch multimodal data alignment method based on the CLIP model;
[0066] Figure 2 This is a flowchart of the backend implementation of the system of the present invention;
[0067] Figure 3 This is a front-end implementation logic diagram of the system of the present invention;
[0068] Figure 4 This is the network data flow logic diagram of the present invention. Detailed Implementation
[0069] The present invention will be further illustrated below with reference to specific embodiments. These embodiments are implemented based on the technical solutions of the present invention, and it should be understood that these embodiments are only used to illustrate the present invention and are not intended to limit the scope of the present invention.
[0070] Example 1
[0071] The batch multimodal data alignment method based on the CLIP model provided in this embodiment is mainly used for monitoring and combating malicious website behavior. Its core technical concept includes:
[0072] Batch processing: Organize large-scale image and text data into batches for processing, making full use of the parallel computing capabilities of modern computing devices, reducing redundant calculations, and improving overall processing efficiency.
[0073] Batch optimization of contrastive learning: Design and adopt contrastive loss function optimization strategy for batch data to efficiently calculate and optimize the similarity between image and text features during training or feature extraction, thereby improving feature discriminability and alignment.
[0074] Dynamic task scheduling and resource management: Combining CLIP's zero-shot inference capabilities, computing tasks are dynamically allocated and scheduled based on task priority, data complexity, and real-time status of system resources (such as GPU utilization and memory usage), achieving more efficient resource utilization and lower latency.
[0075] Prompt Engineering for Batch Classification: Utilizing CLIP's PromptEngineering technology, standardized text templates are generated in batches, transforming multi-class classification problems into image-text template matching problems, achieving efficient batch classification, and possessing good scalability.
[0076] In this embodiment, the definition and explanation of the custom terms are as follows:
[0077] CLIP: Contrastive Language-Image Pre-training. A neural network model trained on large-scale image-text pair datasets through contrastive learning, capable of learning the semantic relationships between images and text.
[0078] ViT: Vision Transformer, an image encoder based on the Transformer architecture, used as an image feature extractor by the CLIP model.
[0079] BERT: Bidirectional Encoder Representations from Transformers, a text encoder based on the Transformer architecture, used as a text feature extractor by the CLIP model.
[0080] InfoNCE: Information Noise-Contrastive Estimation, a commonly used contrastive loss function, is used to learn to distinguish between positive and negative samples.
[0081] GPU: Graphics Processing Unit, an electronic circuit specifically designed for parallel computing, often used to accelerate deep learning tasks.
[0082] Prompt Engineering: A technique that guides a pre-trained language model (such as CLIP's text encoder) to perform a specific task by designing specific text prompts.
[0083] Black and gray industries refer to industries that use the internet and other platforms to carry out illegal or irregular activities.
[0084] Anti-fraud: a shorthand for opposing and combating illicit activities.
[0085] API: Application Programming Interface, a set of definitions and protocols that allow different software systems to communicate and exchange data.
[0086] UI: User Interface, the interface through which users interact with software systems.
[0087] NLP: Natural Language Processing, a branch of computer science and artificial intelligence, studies how to enable computers to understand and process human language.
[0088] CNN: Convolutional Neural Network, a deep learning model commonly used for image processing and computer vision tasks.
[0089] Transformer: A neural network architecture based on self-attention mechanism, which has achieved great success in the fields of natural language processing and computer vision. The text encoder and part of the image encoder in the CLIP model adopt the Transformer architecture.
[0090] RoBERTa: A Robustly Optimized BERT Pretraining Approach.
[0091] ELECTRA: Efficiently Learning an Encoder that Classifies Token Replacements Accurately. A pre-trained language model that efficiently learns an encoder.
[0092] ALIGN: A Little Is Enough: Learning Highly Accurate Visual Embeddings for Product Search, is a multimodal pre-trained model focused on learning accurate visual embeddings for product search.
[0093] FILIP: Fine-grained Interactive Language-Image Pre-training, is a fine-grained interactive language-image pre-training model.
[0094] CUDA: Compute Unified Device Architecture, a parallel computing platform and programming model introduced by NVIDIA for general-purpose computing on its GPUs.
[0095] cuDNN: NVIDIA CUDA Deep Neural Network library, a GPU-accelerated library provided by NVIDIA for deep neural networks.
[0096] Kubernetes: An open-source container orchestration system for automating the deployment, scaling, and management of containerized applications.
[0097] Apache Mesos: An open-source cluster management platform for resource sharing and task scheduling.
[0098] Milvus: An open-source vector database specifically designed for storing, indexing, and searching large-scale feature vectors.
[0099] Faiss: Facebook AI Similarity Search, an open-source library from Facebook AI for efficient similarity search.
[0100] Flask: A lightweight Python web framework.
[0101] FastAPI: A modern, high-performance Python web framework for building APIs.
[0102] React: A JavaScript library for building user interfaces.
[0103] Vue.js: A progressive JavaScript framework for building user interfaces.
[0104] This invention is applicable to user behavior analysis, abnormal traffic detection, and network content security governance in the context of big data communication. By batch processing multimodal data such as images, text, and voice in communication networks and combining the cross-modal alignment capability of the CLIP model, it can efficiently identify malicious text, malicious images, and cross-modal risk links, supporting communication operators in real-time monitoring and intelligent blocking of malicious industry propagation links.
[0105] like Figures 1-4 As shown, the specific implementation steps are as follows:
[0106] S1: Receive batch multimodal data;
[0107] It can analyze images, texts, videos, text messages, chat logs, social media posts, web page content, and other text and image information on the Internet in batches to quickly identify and issue warnings about information involving harmful content.
[0108] By leveraging multimodal feature alignment, it's possible to identify images masquerading as official documents, text containing malicious links, and inappropriate content combining images and text; key information can be quickly extracted, and malicious clues can be discovered. For example, by analyzing image and text information on online social media, their identities and activity patterns can be correlated.
[0109] In communication big data scenarios, this method can be extended to multi-source data collection such as mobile app stores and social media platforms. Through dynamic keyword databases and image feature matching, it can accurately intercept malicious text messages, links, images, and other content spread by harmful industries.
[0110] S2: Data preprocessing;
[0111] Large-scale image and text data are processed in batches. Specifically, the collected data undergoes preprocessing operations such as cleaning, formatting, noise reduction, image size normalization, and text tokenization to meet the input requirements of the CLIP model. The parallel computing capabilities of modern computing devices are fully utilized to reduce redundant calculations and improve overall processing efficiency.
[0112] Features are extracted from various modalities, and cross-modal alignment and correlation analysis are performed. For example, images of specific locations are aligned with relevant text information to extract key time, location, and people information.
[0113] For multimodal data (such as images, text, and speech) in communication networks, this method effectively filters normal communication noise through denoising and standardization, while retaining key features of problematic industries (such as problematic speech templates and problematic APP icon features), providing high-quality input for subsequent feature alignment.
[0114] S3: Feature alignment is achieved through batch feature extraction based on the CLIP model;
[0115] S31: Using the CLIP model image encoder and text encoder, preprocessed image and text features are extracted in batches to obtain high-dimensional feature vectors, specifically:
[0116] A pre-trained CLIP model image encoder (e.g., ViT) is used to extract features from a batch of image data to obtain a batch image feature matrix. ;in, This represents the batch of image feature matrices, where N is the batch size. It is the dimension of image features. Indicates the dimension of the image feature matrix.
[0117] A pre-trained CLIP model text encoder (e.g., Transformer) is used to extract features from a batch of text data, resulting in a batch text feature matrix. .
[0118] in, This represents the batch of text feature matrices, where N is the batch size. It is a dimension of text features. This represents the dimension of the text feature matrix.
[0119] S32: Dynamically schedule and allocate computing resources for the extracted image and text features;
[0120] The received batch alignment task is decomposed into multiple subtasks that can be executed in parallel; the division of subtasks can be based on data volume, data complexity or other custom strategies.
[0121] Based on the real-time status of system resources (e.g., by monitoring GPU utilization, memory usage, CPU load, etc.) and predefined scheduling policies, subtasks in the task queue are dynamically allocated to available computing resources (e.g., different GPU cores or CPU threads).
[0122] Maintain a task queue to manage pending subtasks.
[0123] Scheduling strategies can include task priority-based scheduling, data complexity-based resource allocation (e.g., allocating more computing resources to more complex images or longer texts), and dynamically adjusting the number of subtasks to be processed concurrently based on resource utilization.
[0124] Enable real-time monitoring and management of tasks, such as monitoring the execution status of subtasks, handling errors, and performing necessary retries to ensure the stability and reliability of the overall task.
[0125] S33: Batch comparative learning optimizes and stores feature vectors;
[0126] Construct a contrastive loss function for batch data, such as a batch-optimized version of the InfoNCE loss. For N image-text pairs within a batch (… , The goal is to maximize the similarity between features of positive sample pairs (i.e., matching images and text) while minimizing the similarity between features of negative sample pairs (i.e., non-matching images and text).
[0127] The contrastive loss function L can be expressed as:
[0128] ,
[0129] in, and These are the feature vectors of the j-th image and the text, respectively. It is the cosine similarity function. The formula indicates that different samples in the batch are iterated over; Indicates the first The feature vectors of the images are extracted using an image encoder based on the CLIP model. Indicates the first The text is a feature vector. The text is extracted using the text encoder of the CLIP model. τ represents a temperature hyperparameter. It is used to adjust the steepness of the similarity distribution. A smaller τ makes the model focus more on distinguishing between positive and negative samples, potentially leading to overfitting. A larger τ makes the distribution smoother, potentially leading to insufficient model learning. L represents the value of the contrastive loss function. The smaller this value, the better the alignment between the image and text features learned by the model. N represents the number of image-text pairs in a batch. Batch processing is key to improving computational efficiency; here, N represents the batch size. Represents an exponential function. Equivalent to ,in This represents the Euler number, which has a value of approximately 2.71828. This indicates the index of the image-text pair in the batch. The summation symbol indicates that the calculation is performed on all image-text pairs within a batch. It is the first term in the denominator; this part calculates the first term. Feature vectors of an image Feature vectors of all texts within the batch The sum of exponential functions of the similarity between them. It is the second term in the denominator; this part calculates the first term. Feature vectors of texts Feature vectors of all images within the batch The sum of exponential functions of the similarity between them. It's the molecule part; the calculation is for the first... Feature vectors of an image and the corresponding text feature vector The exponential function of the similarity between them.
[0130] Optimize the computation of the contrastive loss function during large-scale data training through efficient matrix operations and parallel computing (e.g., using GPU acceleration), and update the parameters of the CLIP model (if fine-tuning is selected) through optimization algorithms such as gradient descent.
[0131] The extracted feature vectors are then stored in a high-performance vector database and indexed to enable rapid similarity search and matching.
[0132] In the scenario of analyzing malpractice industries, dynamic task scheduling optimizes the allocation of GPU / CPU resources, improves the efficiency of large-scale feature extraction in the context of big data communication, and provides high-precision feature representation for cross-modal risk link analysis (such as the linkage between malpractice calls and malpractice websites).
[0133] S4: Batch classification based on Prompt template;
[0134] S41: Predefine an extensible batch Prompt template library; templates can take a standardized form, such as "This is an image of {category}" or "Describes a scenario about {topic}", where "{category}" and "{topic}" are placeholders that can be dynamically replaced with actual category labels or topic words according to the specific classification task.
[0135] For batch image data that needs to be classified, the image features are first extracted using CLIP's image encoder.
[0136] Then, CLIP's text encoder is used to extract Prompt features, and corresponding text features are generated in batches based on predefined category labels. For example, if it is necessary to classify images into... In each category, then Substitute each category label into the Prompt template to obtain... We extract features from these text prompts using CLIP's text encoder to obtain a... The text feature matrix.
[0137] S42: Dynamically schedule and allocate computing resources for the extracted image and text features;
[0138] The received batch alignment task is decomposed into multiple subtasks that can be executed in parallel; the division of subtasks can be based on data volume, data complexity or other custom strategies.
[0139] Based on the real-time status of system resources (e.g., by monitoring GPU utilization, memory usage, CPU load, etc.) and predefined scheduling policies, subtasks in the task queue are dynamically allocated to available computing resources (e.g., different GPU cores or CPU threads).
[0140] Maintain a task queue to manage pending subtasks.
[0141] Scheduling strategies can include task priority-based scheduling, data complexity-based resource allocation (e.g., allocating more computing resources to more complex images or longer texts), and dynamically adjusting the number of subtasks to be processed concurrently based on resource utilization.
[0142] Enable real-time monitoring and management of tasks, such as monitoring the execution status of subtasks, handling errors, and performing necessary retries to ensure the stability and reliability of the overall task.
[0143] S43: Calculate the similarity between image features and Prompt features to generate classification results;
[0144] Calculate the feature vector of each image and this Cosine similarity between text feature vectors.
[0145] The specific formula and process for calculating the cosine similarity between the image feature vector and the text (Prompt) feature vector are as follows:
[0146] The formula for calculating cosine similarity.
[0147] Suppose we have two feature vectors, vector A (e.g., image feature vectors). ) and vector B (e.g., the text feature vector of a certain Prompt) p), their cosine similarity The calculation formula is:
[0148]
[0149] in, Representing vectors and The dot product. If and ,but . It is the dimension of the vector. and Representing vectors respectively and The L2 norm (L2 Norm), also known as the Euclidean norm or the magnitude of a vector, is calculated as follows:
[0150]
[0151]
[0152] The formula calculates the similarity between two vectors in terms of direction, with the result ranging from -1 to 1. A value closer to 1 indicates greater similarity in direction; closer to -1 indicates opposite directions; and closer to 0 indicates orthogonal (irrelevant) directions. In CLIP's feature space, feature vectors are typically normalized. The formula can be simplified to directly calculating the dot product: .
[0153] The specific calculation process (in the S43 batch classification step);
[0154] Suppose we need to extract an image (whose feature vector is...) The text is divided into k possible categories, and a corresponding Prompt text feature vector has been generated for each category. .
[0155] Traversal calculation: Prompt feature vector for each category (in From 1 to ):
[0156] Using the cosine similarity formula above, calculate the image feature vector. With the current Prompt feature vector Similarity score between .
[0157] Compare the fractions: Similarity scores: .
[0158] Determine Category: Found this The index of the largest score among the scores. .
[0159]
[0160] Output: The image is predicted to be the [number]th [image]. Each category, that is, with The category associated with this Prompt feature vector.
[0161] This process is performed on each image in the batch, thus achieving batch classification.
[0162] The category label corresponding to the text feature with the highest similarity to the image feature is selected as the predicted category for the image. The entire process is performed in batches, which significantly improves the efficiency of multi-class classification.
[0163] S5: Result generation and output, and visualization processing;
[0164] Based on the feature vector of S3 and the classification result of S4, the results are generated and output, and then post-processed (visualization, storage, etc.).
[0165] Depending on the specific application scenario, the output will show the results of batch data alignment. For example, for image retrieval tasks, the output will show the text description most relevant to the query image; for classification tasks, the output will show the predicted category label of the image; and for anomaly detection tasks, the output will show the data identified as anomalies and their anomaly scores.
[0166] The output results can be post-processed, such as visualized or stored.
[0167] This application also proposes a CLIP-based batch multimodal data alignment system for implementing a CLIP-based batch multimodal data alignment method, including a batch data input and preprocessing module, a batch feature extraction module, a batch contrastive learning optimization module, a dynamic task scheduling and concurrent processing module, a batch classification module based on a Prompt template, and a result generation and output module.
[0168] Batch data input and preprocessing module: This module receives batch input image data and related text data; it performs necessary preprocessing on the input image and text data, such as image size normalization and text tokenization, to meet the input requirements of the CLIP model.
[0169] It supports real-time access to massive amounts of multimodal data such as images, text, and voice in communication networks. Through dynamic noise reduction and standardization processing, it filters noise from normal communication traffic and accurately extracts characteristics of harmful industries such as inappropriate language and inappropriate APP icons.
[0170] Image data can include image files or image data streams in various formats; text data can include descriptions, labels, titles, etc., related to the image.
[0171] Batch feature extraction module: used to extract image and text features; combined with the cross-modal alignment capability of CLIP model, it performs joint feature representation on image-text pairs in communication big data, providing a high-precision feature foundation for cross-modal risk link analysis.
[0172] A pre-trained CLIP model image encoder is used to extract features from a batch of image data to obtain a batch image feature matrix.
[0173] A pre-trained CLIP model text encoder is used to extract features from batch text data to obtain a batch text feature matrix.
[0174] Batch contrastive learning optimization module: used to calculate and optimize the similarity between image and text features, improve the discriminativeness and alignment of features; construct and adopt a contrastive loss function optimization strategy for batch data, and calculate similarity during the training or feature extraction stage.
[0175] Dynamic task scheduling and concurrent processing module: This module decomposes received batch alignment tasks into multiple subtasks that can be executed in parallel. The division of subtasks can be based on data volume, data complexity, or other custom strategies.
[0176] To address the massive data processing demands during peak communication periods, intelligent scheduling of GPU / CPU resources enables batch feature extraction and comparative learning optimization with sub-second response times, supporting real-time blocking of the transmission chain of unethical industries.
[0177] Maintain a task queue to manage pending subtasks.
[0178] Based on the real-time status of system resources (e.g., by monitoring GPU utilization, memory usage, CPU load, etc.) and predefined scheduling policies, subtasks in the task queue are dynamically allocated to available computing resources (e.g., different GPU cores or CPU threads).
[0179] Scheduling strategies can include task priority-based scheduling, data complexity-based resource allocation (e.g., allocating more computing resources to more complex images or longer texts), and dynamically adjusting the number of subtasks to be processed concurrently based on resource utilization.
[0180] Enable real-time monitoring and management of tasks, such as monitoring the execution status of subtasks, handling errors, and performing necessary retries to ensure the stability and reliability of the overall task.
[0181] A batch classification module based on Prompt templates transforms the identification of multi-category malicious information into efficient batch image-text similarity calculation (transforming the classification problem into a matching problem between images and text templates). This eliminates the need for fine-tuning for specific categories, improving the efficiency and scalability of classification / recognition. It achieves efficient batch classification as an application of feature alignment.
[0182] Strategy Matching and Analysis Module: It has a built-in knowledge base of malpractice industry characteristics and combines Prompt template matching technology to automatically identify malpractice text, malpractice images and abnormal behavior patterns, generate risk scores and associated network topologies, and provide decision-making basis for security protection.
[0183] The strategy matching and analysis module includes:
[0184] Malpractice Feature Library: Maintain a knowledge base containing text keywords, visual pattern feature vectors, behavioral pattern rules, etc., related to malpractice information.
[0185] Similarity calculation: Perform efficient batch similarity calculation (e.g., using cosine similarity) between the feature vectors of the data to be analyzed and the feature vectors in the feature library.
[0186] Rule engine: Combines predefined rules (such as keyword combinations, specific behavioral sequences) to perform risk assessment on matching results.
[0187] Prompt template application: For classifying or identifying specific types of malicious information, PromptEngineering generates relevant text template features and matches them with the image features of the data to be analyzed.
[0188] Risk assessment and early warning: Based on the matching results and the assessment of the rule engine, the data is classified into risk levels and corresponding early warning information is generated.
[0189] Early warning and analysis: Based on the matching results and set thresholds, the system automatically generates early warnings for high-risk content and provides visual analysis reports that show the distribution, spread trends and related networks of harmful information, assisting security personnel in conducting in-depth analysis and tracing.
[0190] Rule management: Security personnel can flexibly add, edit, and update Prompt templates, keyword libraries, and visual feature libraries related to harmful information through the backend interface, and adjust risk assessment rules and thresholds.
[0191] Interception and Alerts: After the system identifies high-risk and harmful information, it can take interception measures (such as blocking SMS sending, marking risky links) and send security alerts to users.
[0192] Case Analysis and Model Updates: Experts can analyze past cases of misconduct through the platform, extract new characteristics of such misconduct, and update the system's feature library and recognition model. Prompt Engineering allows for the rapid creation of recognition rules for novel malicious language and tactics.
[0193] Result generation and output module: Used to output the results of batch data alignment.
[0194] Depending on the specific application scenario, the output will show the results of batch data alignment. For example, for image retrieval tasks, the output will show the text description most relevant to the query image; for classification tasks, the output will show the predicted category label of the image; and for anomaly detection tasks, the output will show the data identified as anomalies and their anomaly scores.
[0195] The analysis results, risk levels, and early warning information are displayed through a user interface or API. User feedback is received to optimize the feature library, rules, and models.
[0196] The output results can be post-processed, meaning the analysis results can be visualized through charts, relationship networks, and other forms.
[0197] Output risk scores, early warning signals, and visual reports.
[0198] UI / UX interaction:
[0199] Warning prompt: When the system detects potentially harmful information (such as text messages or chat content containing high-risk images), a prominent warning prompt will pop up on the user interface, such as "Suspected harmful information detected, please be careful!".
[0200] Risk level display: For suspicious information, the system can display different risk level labels (such as "high risk" or "suspected risk") according to its risk level.
[0201] Information details display: Users can view the detailed content of suspicious information, including text, images, etc. The system can highlight high-risk sections.
[0202] Reporting and Feedback: Users can report confirmed inappropriate information with one click and send the report information back to the backend system for model iteration and optimization.
[0203] Security knowledge dissemination: The system can regularly push relevant security knowledge and case studies to users to raise their awareness of security precautions.
[0204] The hardware and software environment of this application are as follows:
[0205] Table 1 Hardware Environment
[0206]
[0207] Table 2 Software Environment
[0208]
[0209] Example 2
[0210] Based on Example 1, this example provides other methods for feature extraction:
[0211] Use other multimodal pre-trained models: In addition to CLIP, other pre-trained models with cross-modal alignment capabilities can be considered, such as ALIGN and FILIP. Although these models may differ from CLIP in training methods and model structure, they can also extract joint feature representations of images and text.
[0212] Combining unimodal models: Consider using separate image encoders (such as ResNet, EfficientNet) and text encoders (such as RoBERTa, ELECTRA), and then learning cross-modal alignment through additional fusion layers (such as Transformer Encoder, attention mechanisms). While this may require more training data and tuning, it may yield better performance in certain domains.
[0213] Other methods of dynamic task scheduling:
[0214] Rule-based static resource allocation: This approach doesn't rely entirely on dynamic scheduling. Instead, it assigns different types of tasks to fixed computing resource pools based on predefined rules and task types. For example, feature extraction tasks can be assigned to GPU cluster A, and prompt matching tasks can be assigned to GPU cluster B. While less flexible, it's a viable alternative in scenarios where resources are relatively fixed and task types are clearly defined.
[0215] A priority-based simple queue: This approach maintains a simple task queue and schedules tasks based on their priority, allocating resources to higher-priority tasks first. While not as intelligent as dynamic scheduling, it simplifies implementation in scenarios where task priorities are clearly defined.
[0216] Other methods for batch Prompt template matching:
[0217] Similarity search using vector databases: Features of text descriptions (rather than Prompt templates) for all categories can be pre-stored in a vector database. Then, a Top-K similarity search is performed using the features of the image to be classified, and the category corresponding to the most similar text description is used as the prediction result. This method does not require explicitly building a Prompt template, but may result in slight differences in the accuracy of semantic matching.
[0218] Multi-label classification models: If harmful information has multiple labels or attributes, a multi-label classification model can be trained (with image and text features as input) to directly predict multiple relevant labels. This requires a multi-labeled dataset, but it can output classification results more directly.
[0219] Other ways to store feature vectors:
[0220] Traditional relational databases: Feature vectors can be stored in traditional relational databases and queried using indexes. However, for similarity searches of high-dimensional vectors, the performance may not be as good as dedicated vector databases.
[0221] In this embodiment, the present application can also perform batch analysis on the following data:
[0222] Content copyright protection: Batch identification of unauthorized images and text content on the internet, such as comparing whether product images on e-commerce platforms infringe on the brand's copyright.
[0223] Misinformation detection: Batch analysis of news reports, social media posts, and other text and image information to identify and flag potential misinformation or rumors. For example, comparing image content with text descriptions or matching it against known misinformation databases.
[0224] Intelligent marketing content generation and evaluation: Generate product-related advertising copy and image materials in batches, and evaluate their appeal or relevance to the target audience.
[0225] Educational Resource Management and Retrieval: Batch management and retrieval of text and image materials on the educational platform, such as quickly finding relevant learning resources based on students' questions (text or images).
[0226] Industrial quality inspection: Batch analysis of product defect images and alignment with standard product description text to determine the type and severity of defects.
[0227] Art Appreciation and Retrieval: Batch analysis of images and descriptions of artworks to identify style, authenticate authenticity, and retrieve similar works.
[0228] Multilingual content alignment and translation assistance: Batch align text and image content in different languages to assist machine translation and localization.
[0229] This invention utilizes batch processing and comparative learning optimization strategies for batch data. By combining the spatiotemporal distribution characteristics of communication network traffic (such as 5G base station signaling storms and abnormal connection patterns of IoT devices) with the patterns of malicious industry behavior (such as the temporal patterns of malicious emails and the propagation links of malicious software), a resource-aware dynamic task scheduling mechanism is designed to achieve millisecond-level elastic computing resource allocation for GPU clusters. This significantly shortens the processing cycle of large-scale cross-modal data (including base station logs, network traffic packets, and malicious text messages), thereby increasing system throughput, reducing response latency, and effectively solving the bottleneck of large-scale data processing in communication network attack and defense drills and malicious industry crackdown scenarios.
[0230] Based on CLIP's zero-sample transfer capability and Prompt Engineering technology, and by using a pre-built communication security-specific Prompt template library (such as "5G core network topology anomaly") and a bad industry analysis template library, cross-modal risk characteristics can be quickly identified without data annotation or model fine-tuning for new attack variants. This shortens the response time for new attack detection and solves the problem of insufficient scalability of traditional methods for unknown threats.
[0231] The CLIP model combines its powerful cross-modal semantic representation capabilities with batch feature optimization strategies. By constructing a joint semantic space of implicit features of communication protocols and adverse industry ecosystem behavior patterns, the model can maintain high recognition accuracy when dealing with zero-day vulnerability attacks (such as unknown signaling vulnerabilities in 5G slice networks) or new types of forged documents (such as dynamic face documents generated based on GANs), effectively solving the problem of insufficient generalization ability in complex network environments.
[0232] Through the intelligent resource scheduling engine and concurrent processing module, the system can monitor the load on communication links (such as traffic surges during DDoS attacks) and the flood of bad data in real time, dynamically adjust the priority of computing tasks, keep the GPU utilization at a high level for a long time, and significantly reduce the computing cost of large-scale cross-modal data analysis.
[0233] The batch classification strategy based on the Prompt template transforms the cross-modal classification of communication network attack patterns (such as the "SMS bombing-malicious website-malicious APP download" attack chain) and characteristics of malicious industry behaviors into efficient batch similarity calculation. This avoids independent feature matching for each attack sample, thereby improving the classification efficiency of datasets with millions of entries. It is particularly advantageous when dealing with composite attack data containing images, text, and voice.
[0234] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for batch multi-modal data alignment based on a CLIP model, characterized in that: By batch processing multimodal data in communication networks and combining the cross-modal alignment capability of the CLIP model, malicious text, malicious images, and cross-modal risk links are identified, supporting telecommunications operators in real-time monitoring and intelligent blocking of malicious industry propagation links. Specifically, this includes the following steps: S1: Receive batch multimodal data; S2: Data preprocessing; S3: Feature alignment is achieved through batch feature extraction based on the CLIP model; S4: Batch classification based on Prompt template; S41: Predefined bulk Prompt template library; The predefined batch Prompt template library contains the following: For batch image data that needs to be classified, firstly, the image encoder of CLIP is used to extract image features; then, based on the predefined batch Prompt template library, the text encoder of CLIP is used to extract Prompt features, and the corresponding text features are generated in batches according to the predefined category labels. S42: Dynamically schedule and allocate computing resources for the extracted image and text features; S43: Calculate the similarity between image features and Prompt features to generate classification results; Suppose an image is to be classified into k possible categories and a corresponding Prompt text feature vector is generated for each category , ,..., ; Traversal calculation: For each category, the Prompt text feature vector Based on the cosine similarity between the feature vector of each image and the feature vector of the text, the image feature vector is calculated. With the current Prompt text feature vector Similarity score between ; Compare the fractions: Similarity scores , ,..., ; Determine Category: Found this The index corresponding to the highest similarity score. ; Output: The image is predicted to be the [number]th [image]. One category; This process is performed on each image in the batch, thus achieving batch classification. S5: Result generation and output, and visualization processing.
2. The batch multimodal data alignment method based on the CLIP model according to claim 1, characterized in that: In S1, images and text information from the Internet are collected and analyzed in batches.
3. The batch multimodal data alignment method based on the CLIP model according to claim 1, characterized in that: In S2, the collected multimodal data is standardized to effectively filter out normal communication noise, retain key features of the bad industry, and provide high-quality input for subsequent feature alignment.
4. The batch multimodal data alignment method based on the CLIP model according to claim 1, characterized in that: In S3, the specific process of achieving feature alignment through batch feature extraction based on the CLIP model is as follows: S31: Use the CLIP model image encoder and text encoder to batch extract features from preprocessed images and text; S32: Dynamically schedule and allocate computing resources for the extracted image and text features; S33: Batch comparative learning optimizes and stores feature vectors.
5. The batch multimodal data alignment method based on the CLIP model according to claim 4, characterized in that: In S31, a pre-trained CLIP model image encoder is used to extract features from a batch of image data, resulting in a batch image feature matrix. ; in, This represents the batch of image feature matrices, where N represents the batch size. Dimensions representing image features The dimension of the image feature matrix; The text encoder of the pre-trained CLIP model is used to extract features from a batch of text data to obtain a batch text feature matrix. ; in, This represents the batch size of the text feature matrix, where N represents the batch size. Dimensions representing text features This represents the dimension of the text feature matrix.
6. The batch multimodal data alignment method based on the CLIP model according to claim 4, characterized in that: In S33, a contrastive loss function is constructed for batch data. The contrastive loss function L is expressed as: ; in, and Let these represent the feature vectors of the j-th image and the text, respectively. τ represents the cosine similarity function; τ represents a temperature hyperparameter; N represents the number of image-text pairs in a batch; Represents an exponential function. Indicates the index of the image-text pair in the batch; Indicates the first Feature vectors of an image Feature vectors of all texts within the batch The sum of exponential functions of the similarity between them; Indicates the first Feature vectors of texts Feature vectors of all images within the batch The sum of exponential functions of the similarity between them; Indicates the first Feature vectors of an image and the corresponding text feature vector The exponential function of the similarity between them.
7. A batch multimodal data alignment system based on the CLIP model, implementing the batch multimodal data alignment method based on the CLIP model as described in any one of claims 1 to 6, characterized in that: It includes a batch data input and preprocessing module, a batch feature extraction module, a batch contrastive learning optimization module, a dynamic task scheduling and concurrent processing module, a batch classification module based on the Prompt template, a strategy matching and analysis module, and a result generation and output module. Batch data input and preprocessing module: This module receives batch input image and text data and filters out noise from normal communication traffic through dynamic denoising and normalization. Batch feature extraction module: Combining the cross-modal alignment capability of the CLIP model, it performs joint feature representation on image-text pairs in communication big data; Batch Comparative Learning Optimization Module: Used to calculate and optimize the similarity between image features and text features; Dynamic task scheduling and concurrent processing module: used to decompose batch alignment tasks into multiple subtasks that can be executed in parallel; Based on the real-time status of system resources and predefined scheduling strategies, subtasks are dynamically allocated to available computing resources to achieve real-time monitoring and management of tasks. The batch classification module based on the Prompt template is used to transform multi-class classification problems into image-text template matching problems, thereby achieving batch classification. Strategy matching and analysis module: Built-in knowledge base of bad industry characteristics, combined with Prompt template matching technology, automatically identifies bad text, bad images and abnormal behavior patterns, and generates risk scores and associated network topologies; Result generation and output module: Used to output the results of batch data alignment.