Multi-modal identification method and apparatus for hazardous liquid, and computer program product

By generating text data and training a multimodal model, combining visual and text vectors to identify hazardous liquids, this approach addresses the problem of existing technologies failing to consider the influence of container material and liquid composition, thereby improving recognition accuracy and providing intuitive recognition results.

WO2026137768A1PCT designated stage Publication Date: 2026-07-02NUCTECH JIANGSU CO LTD +2

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NUCTECH JIANGSU CO LTD
Filing Date
2025-06-27
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing methods for identifying hazardous liquids fail to adequately consider the impact of container material and liquid composition on CT scan images, resulting in high rates of false positives and false negatives.

Method used

By generating text data describing the liquid composition and container material, visual vectors and text vectors are obtained using an image encoder and a text encoder, respectively. A multimodal model is trained, and recognition is performed by combining the visual vectors and text vectors, taking into account the influence of container material and liquid composition.

Benefits of technology

It improves the accuracy of hazardous liquid identification, and the generated identification results include descriptive text, intuitively displaying properties such as liquid composition and container material, reducing false detections and missed detections.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN2025104971_02072026_PF_FP_ABST
    Figure CN2025104971_02072026_PF_FP_ABST
Patent Text Reader

Abstract

The present application relates to a multi-modal identification method and apparatus for a hazardous liquid, and a computer program product. The multi-modal identification method for a hazardous liquid comprises: for image data of CT imaging, generating text data describing attributes of a hazardous liquid, the attributes comprising at least a liquid composition and a container material; using an image encoder to encode the image data to obtain a visual vector; using a text encoder to encode each attribute in the text data to obtain a text vector, and storing the text vector in a text vector library; and, for image data for training, using the visual vector and the text vectors to train a multi-modal model for identifying a hazardous liquid. The multi-modal identification method and apparatus for a hazardous liquid of the present application can take into account the influences of container material, liquid composition, etc., in a CT scan image, thereby improving the accuracy of identifying a hazardous liquid.
Need to check novelty before this filing date? Find Prior Art

Description

Multimodal identification methods, devices, and computer programs for hazardous liquids

[0001] Cross-references to related applications

[0002] This disclosure claims priority to Chinese Patent Application No. 202411918511.1, filed on December 24, 2024, entitled "Multimodal Identification Method, Apparatus and Computer Program Product for Hazardous Liquids", the entire contents of which are incorporated herein by reference. Technical Field

[0003] This application relates to the field of CT image processing, and in particular to a method, apparatus, and computer program product for the multimodal identification of hazardous liquids. Background Technology

[0004] With rapid societal development, public safety is receiving increasing attention, placing higher demands on the accuracy and efficiency of security detection of hazardous liquids in luggage. In recent years, the use of intelligent security inspection algorithms to classify and locate CT images has enabled relatively accurate and rapid identification of hazardous liquids under current conditions. However, most existing image detection methods are based on learning from projection images of CT data at different angles, primarily relying on container shape and color information for identification, failing to fully learn the physical feature values ​​of CT scan images.

[0005] To address these limitations, Patent Document 1 proposes an explosive identification system for achieving rapid and accurate identification of hazardous liquids.

[0006] Patent Document 1: CN115452870A Summary of the Invention

[0007] However, the inventors of this application have found through research that the aforementioned patent document 1 still has the following technical problems: it only uses the three-dimensional shape information of the container and the hazardous liquid database to determine whether it is a hazardous liquid, and fails to consider the influence of different container materials and different liquid components on CT scan images, which can easily lead to false detection and missed detection of hazardous liquids in luggage.

[0008] This application provides a multimodal identification method, apparatus, and computer program product for hazardous liquids that takes into account the influence of container material, liquid composition, etc. in CT scan images, thereby improving the accuracy of hazardous liquid identification.

[0009] One aspect of this application provides a multimodal identification method for hazardous liquids. The method includes: generating text data describing the properties of the hazardous liquid from CT imaging image data, wherein the properties include at least liquid composition and container material; encoding the image data using an image encoder to obtain visual vectors; encoding each property in the text data using a text encoder to obtain text vectors, and storing the text vectors in a text vector library; and training a multimodal model for identifying hazardous liquids using the visual vectors and text vectors from the image data for training. By using both visual vectors and text vectors in the training of the multimodal model, and ensuring that the text vectors include at least liquid composition and container material properties, hazardous liquid identification considering properties such as liquid composition and container material can be achieved, thereby improving the accuracy of hazardous liquid identification.

[0010] In some embodiments of the aforementioned multimodal recognition method, the method further includes: for the image data to be tested, calculating the similarity between visual vectors and text vectors based on the trained multimodal model, and generating a hazardous liquid recognition result containing descriptive text based on the similarity judgment. By utilizing the correlation between visual vectors and text vectors for recognition in this application, it is possible to recognize hazardous liquids that take into account the liquid composition, container material, and other attributes of the hazardous liquid as reflected in the text vectors, thus resulting in more accurate recognition results. Furthermore, since the recognition result contains descriptive text, and this descriptive text contains keywords with attributes matching each container in the image data, the liquid composition, container material, and other attributes of the hazardous liquid can be intuitively determined.

[0011] In some embodiments of the aforementioned multimodal recognition method, the similarity between visual vectors and text vectors is calculated based on the trained multimodal model. Based on the similarity judgment, a hazardous liquid recognition result containing descriptive text is generated. This includes: using the multimodal model to calculate the similarity between visual vectors and text vectors of each attribute in the text vector library; performing a similarity judgment between visual vectors and text vectors; and outputting a recognition result containing descriptive text based on the similarity judgment result. By utilizing the correlation between visual vectors and text vectors for recognition, the accuracy of hazardous liquid recognition results can be further improved.

[0012] In some embodiments of the aforementioned multimodal recognition method, the similarity judgment between visual vectors and text vectors includes: setting a preset similarity threshold; and if the similarity is greater than the preset similarity threshold, then the text vector of that attribute is recorded as a valid match, and the descriptive text contains keywords corresponding to the validly matched text vector. By setting thresholds and judging the validly matched text vectors for each attribute, the descriptive text contains keywords corresponding to the validly matched text vectors, thus generating accurate and rich recognition results.

[0013] In some embodiments of the multimodal recognition method described above, using an image encoder to encode image data to obtain visual vectors includes: acquiring container regions in each image of the image data, and generating a visual vector for each container region. By generating a visual vector for each container region, it is possible to handle situations where the same image contains multiple containers.

[0014] In some embodiments of the multimodal recognition method described above, using a text encoder to encode each attribute in the text data to obtain a text vector includes: segmenting the text data into words to obtain keywords for different attributes; and using a text encoder to encode the keywords, generating a text vector corresponding to each keyword. This allows for the acquisition of an accurate text vector corresponding to each attribute.

[0015] In some embodiments of the multimodal recognition method described above, keywords are grouped according to different attributes in the text vector library, and text vectors belonging to the same attribute are stored as a set. This facilitates the management and retrieval of text vectors.

[0016] In some embodiments of the aforementioned multimodal recognition method, training a multimodal model for recognizing hazardous liquids using visual and text vectors includes: designing a multimodal fusion alignment network that maps visual and text vectors to a common multidimensional feature space; constructing an image-text contrast loss function to optimize the cross-modal alignment of visual and text vectors, where the dimension of the loss function is related to the number of attributes; and training the multimodal fusion alignment network using the loss function to obtain the multimodal model. Since attributes are considered during the construction of the loss function and its dimension is related to the number of attributes, a multimodal model that considers multiple attributes such as the liquid composition and container material of hazardous liquids can be obtained, resulting in higher recognition accuracy using this multimodal model.

[0017] In some embodiments of the aforementioned multimodal recognition method, training a multimodal model for recognizing hazardous liquids using visual vectors and text vectors includes: independently learning the association between the text vector for each attribute and the visual vector. By independently learning the association between the text vector for each attribute and the visual vector, mutual interference between different attribute vectors is avoided, thereby improving the recognition accuracy of the multimodal model.

[0018] In some embodiments of the multimodal recognition method described above, the attributes also include at least one of liquid volume and container wall thickness. By considering liquid volume and container wall thickness, the accuracy of hazardous liquid identification can be further improved.

[0019] Another aspect of this application provides a multimodal recognition device for hazardous liquids. The device includes: a data generation unit that generates text data describing the properties of a hazardous liquid from CT imaging image data, the properties including at least liquid composition and container material; a visual vector encoding unit that encodes the image data using an image encoder to obtain visual vectors; a text vector encoding unit that encodes each property in the text data using a text encoder to obtain text vectors and stores the text vectors in a text vector library; and a training unit that trains a multimodal model for recognizing hazardous liquids using the visual vectors and text vectors from image data for training. By using both visual vectors and text vectors in the training of the multimodal model, and ensuring that the text vectors include at least liquid composition and container material properties, hazardous liquid recognition considering properties such as liquid composition and container material can be achieved, thereby improving the accuracy of hazardous liquid recognition.

[0020] In some embodiments of the aforementioned multimodal recognition device, a recognition unit is further included, which calculates the similarity between visual vectors and text vectors based on a trained multimodal model for the image data to be tested, and generates a hazardous liquid recognition result containing descriptive text based on the similarity judgment. By utilizing the correlation between visual vectors and text vectors for recognition in this application, it is possible to recognize hazardous liquids that take into account the liquid composition, container material, and other attributes of the hazardous liquid as reflected in the text vectors, thus resulting in more accurate recognition results. Furthermore, since the recognition result contains descriptive text, and this descriptive text contains keywords with attributes matching each container in the image data, the liquid composition, container material, and other attributes of the hazardous liquid can be intuitively determined.

[0021] In some embodiments of the aforementioned multimodal recognition device, the recognition unit includes: a similarity calculation module, which uses a multimodal model to calculate the similarity between visual vectors and text vectors of various attributes in a text vector library; a similarity judgment module, which performs similarity judgment on visual vectors and text vectors; and a recognition result output module, which outputs a recognition result containing descriptive text based on the similarity judgment result. By utilizing the correlation between visual vectors and text vectors for recognition, the accuracy of hazardous liquid recognition results can be further improved.

[0022] In some embodiments of the aforementioned multimodal recognition device, the similarity judgment module includes: a threshold setting unit for setting a preset similarity threshold; and an effective match determination unit for recording the text vector of the attribute as a valid match if the similarity is greater than the preset similarity threshold, and declaring that the descriptive text contains keywords corresponding to the effectively matched text vector. By setting thresholds and determining the effectively matched text vectors for each attribute, the descriptive text contains keywords corresponding to the effectively matched text vectors, thus generating accurate and rich recognition results.

[0023] In some embodiments of the multimodal recognition device described above, the visual vector encoding unit acquires container regions in each image of the image data and generates a visual vector for each container region. By generating a visual vector for each container region, it is possible to handle situations where the same image contains multiple containers.

[0024] In some embodiments of the aforementioned multimodal recognition device, the text vector encoding unit includes: a keyword acquisition module, which segments the text data to obtain keywords with different attributes; and a text vector generation module, which encodes the keywords using a text encoder to generate a text vector corresponding to each keyword. Thus, an accurate text vector corresponding to each attribute can be obtained.

[0025] In some embodiments of the aforementioned multimodal recognition device, the text vector encoding unit further includes a text vector library storage module. This module groups keywords according to different attributes within the text vector library and stores text vectors belonging to the same attribute as a set. This facilitates the management and retrieval of text vectors.

[0026] In some embodiments of the aforementioned multimodal recognition device, the training unit includes: a multimodal fusion alignment network design module, which designs a multimodal fusion alignment network to map visual vectors and text vectors to a common multidimensional feature space; a loss function construction module, which constructs an image-text contrast loss function to optimize the cross-modal alignment of visual vectors and text vectors, wherein the dimension of the loss function is related to the number of attributes; and a model training module, which trains the multimodal fusion alignment network using the loss function to obtain a multimodal model. Because attributes are considered during the construction of the loss function and the dimension of the loss function is related to the number of attributes, a multimodal model that can consider multiple attributes such as the liquid composition and container material of hazardous liquids can be obtained, thereby increasing the recognition accuracy using this multimodal model.

[0027] In some embodiments of the aforementioned multimodal recognition device, the training unit independently learns the association between the text vector and the visual vector for each attribute. By independently learning the association between the text vector and the visual vector for each attribute, mutual interference between different attribute vectors is avoided, thereby improving the recognition accuracy of the multimodal model.

[0028] In some embodiments of the aforementioned multimodal recognition device, the attributes also include at least one of the liquid volume and the container wall thickness. By taking into account the liquid volume and the container wall thickness, the accuracy of identifying hazardous liquids can be further improved.

[0029] Another aspect of this application provides a computer program product including a computer program that causes a computer to perform the steps in any embodiment of the multimodal recognition method described above. Attached Figure Description

[0030] Figure 1 is a flowchart schematically illustrating an embodiment of a multimodal identification method for hazardous liquids.

[0031] Figure 2 is a schematic diagram showing how image data is encoded to obtain visual vectors.

[0032] Figure 3 is a flowchart schematically illustrating the text data encoding process.

[0033] Figure 4 is a schematic diagram illustrating the text data encoding process.

[0034] Figure 5 is a flowchart illustrating an example of the process of training a multimodal model to identify hazardous liquids.

[0035] Figure 6 is a flowchart illustrating another embodiment of the multimodal identification method for hazardous liquids of this application.

[0036] Figure 7 is a flowchart illustrating an example of a process for identifying hazardous liquids using a trained multimodal model.

[0037] Figure 8 is a schematic diagram illustrating another embodiment of the multimodal identification method for hazardous liquids.

[0038] Figure 9 is a structural schematic diagram illustrating an embodiment of a multimodal identification device for hazardous liquids.

[0039] Figure 10 is a structural schematic diagram illustrating another embodiment of a multimodal identification device for hazardous liquids.

[0040] Figure 11 is a schematic diagram showing an example of the text vector encoding unit of a multimodal recognition device.

[0041] Figure 12 is a schematic diagram showing an example of the training section of a multimodal recognition device.

[0042] Figure 13 is a schematic diagram showing an example of the recognition unit of a multimodal recognition device.

[0043] Figure 14 shows a schematic diagram of the structure of an electronic device according to this application.

[0044] Symbol Explanation: 10: Data Generation Unit; 20: Visual Vector Encoding Unit; 30: Text Vector Encoding Unit; 40: Training Unit; 50: Recognition Unit; 31: Keyword Acquisition Module; 32: Text Vector Generation Module; 33: Text Vector Library Storage Module; 41: Multimodal Fusion Alignment Network Design Module; 42: Loss Function Construction Module; 43: Model Training Module; 51: Similarity Calculation Module; 52: Similarity Judgment Module; 53: Recognition Result Output Module; 801: Processor; 802: Memory; 803: Communication Interface; 810: Bus. Detailed Implementation

[0045] Exemplary embodiments or examples of this application will now be described in more detail with reference to the accompanying drawings. While exemplary embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments or examples set forth herein. Rather, these embodiments or examples are provided to enable a clearer understanding of this application.

[0046] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects, not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments or examples of this application described herein can be implemented in a sequence other than that illustrated or described. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such as a process, method, system, product, or apparatus that includes a series of steps or units, not limited to those explicitly listed, but may include other steps or units not explicitly listed. Identical or similar reference numerals throughout the document denote constituent elements having the same or similar functions.

[0047] In existing technologies, the impact of different container materials and liquid compositions on CT scan images is not taken into account, which can easily lead to false positives and false negatives in the detection of dangerous liquids in luggage, thus affecting the accuracy of dangerous liquid identification.

[0048] This application provides a multimodal identification method, apparatus, and computer program product for hazardous liquids that takes into account the influence of container material, liquid composition, etc. in CT scan images, thereby improving the accuracy of hazardous liquid identification.

[0049] One embodiment of this application provides a multimodal identification method for hazardous liquids.

[0050] Figure 1 is a flowchart schematically illustrating an embodiment of a multimodal identification method for hazardous liquids.

[0051] The following, with reference to Figure 1, details the process of the multimodal identification method for hazardous liquids in this application.

[0052] In step S10, textual data describing the properties of the hazardous liquid is generated from the CT imaging data. These properties include at least the liquid composition and the container material.

[0053] The reason why the liquid composition and container material are included as attributes at least is that the inventors of this application have found through research and experiments that the liquid composition and container material have the most significant impact on image recognition of CT scan images. The liquid composition can also be expressed as atomic number, density, etc.

[0054] Specifically, regarding container material, the liquid contained in each type of container will exhibit different characteristics in CT image data, and there is a unique relationship between this material and the characteristics of the CT image data. For example, gasoline in a glass container and gasoline in a plastic container will have different characteristic values ​​in CT scan image data. Similarly, regarding liquid composition, each liquid composition will exhibit different characteristics in CT image data, and there is a unique relationship between this liquid composition and the characteristics of the CT image data.

[0055] This application utilizes this principle to design a multimodal recognition method and device that takes into account the liquid composition and container material.

[0056] The inventors of this application discovered through further research that other attributes can also affect image recognition results. Therefore, as attributes, in addition to the liquid composition and container material mentioned above, at least one of the liquid volume and container wall thickness can also be included.

[0057] Different attributes can also be referred to as different dimensions.

[0058] For example, when the attributes include liquid composition, container material, and liquid volume, the text data could be "300ml plastic bottle of gasoline".

[0059] In some embodiments, three-dimensional image data of CT scans related to liquid composition and container material can be acquired through uniform sampling. Then, the three-dimensional CT image data is cropped to obtain individual image data. For each image data, text data describing the hazardous liquid properties in the image is generated. For example, multimodal annotation can be used to annotate the aforementioned multiple properties, and text data is generated based on the annotated text.

[0060] In some embodiments, each image can also be paired with text data to form an image-text pair dataset.

[0061] In step S20, an image encoder is used to encode the image data to obtain a visual vector.

[0062] In some embodiments, an image encoder can be designed to extract core visual structures from CT images. The image encoder may include a three-dimensional visual detection network to achieve accurate recognition of different container geometric features. Alternatively, existing image encoders may be used, and this application does not limit the application to this approach.

[0063] In some embodiments, as shown in Figure 2, a three-dimensional vision detection network can be used to process CT image data, obtain container regions in each image, and generate a visual vector for each container region. By generating a visual vector for each container region, it is possible to handle situations where the same image contains multiple containers.

[0064] Alternatively, each container region can be transformed into a high-dimensional visual vector. Encoding image data into high-dimensional visual vectors can contain richer information, which is beneficial for learning the correlation between visual vectors and text vectors.

[0065] In step S30, a text encoder is used to encode each attribute in the text data to obtain a text vector, and the text vector is stored in a text vector library.

[0066] In some embodiments, a text encoder can be designed to understand and encode text information. The text encoder can employ a natural language model with a text Transformer architecture. Alternatively, existing text encoders can be used, and other natural language models can also be employed in these text editors; this application does not limit the scope of the application.

[0067] Figure 3 is a flowchart schematically illustrating the text data encoding process.

[0068] As shown in Figure 3, step S30 can specifically include steps S31, S32, and S33.

[0069] In step S31, the text data is segmented to obtain keywords with different attributes.

[0070] In some embodiments, text data in an image-text pair dataset can be segmented to obtain keywords in different dimensions.

[0071] For example, “500ml plastic bottle gasoline” can be segmented into keywords [gasoline, plastic, 500ml, ...].

[0072] For example, different attributes (dimensions) and the keywords corresponding to each attribute can include: [gasoline, plastic, 500ml, ...], [sulfuric acid, glass, 300ml, ...], [water, plastic, 1L, ...].

[0073] In this case, the following keywords can be extracted:

[0074] The keywords for the liquid component dimension are: [gasoline], [sulfuric acid], [water];

[0075] The keywords for container material are: [plastic], [glass], [plastic];

[0076] The keywords for liquid volume dimension are: [500ml], [300ml], [1L].

[0077] By segmenting the text data, keywords with different attributes such as liquid composition, container material, and liquid volume were obtained.

[0078] In step S32, the keywords are encoded using a text encoder, generating a text vector corresponding to each keyword.

[0079] As an application example of the above steps S31 and S32, as shown in Figure 4, the phrase "300ml of plastic-packaged gasoline" can be segmented to obtain keywords such as "gasoline", "plastic", and "300ml". The above keywords such as "gasoline", "plastic", and "300ml" can be encoded using a text encoder, and each keyword generates a corresponding text vector.

[0080] By segmenting text data to obtain keywords with different attributes, and encoding the keywords to generate a corresponding text vector, we can obtain an accurate text vector corresponding to each attribute.

[0081] In step S33, the text vector is stored in the text vector library.

[0082] In a text vector library, keywords can be grouped according to different attributes, and text vectors belonging to the same attribute can be stored as a collection.

[0083] For example, keywords representing liquid components, such as "gasoline," "sulfuric acid," and "water," can be grouped together and stored as a set.

[0084] As an example, a collection of text vectors describing the components of a liquid: Embed 液体成分 [Gasoline, kerosene, sulfuric acid, hydrochloric acid, water, beverages, ...]

[0085] For example, keywords representing container materials such as "glass," "plastic," and "ceramic" can be grouped together and stored as a collection.

[0086] As an example, here is a collection of text vectors describing the container's material: Embed 容器材质 [Glass, plastic, ceramic, metal, rubber, wood, ...]

[0087] In addition to the liquid composition and container material mentioned above, keywords representing liquid volume can also be grouped together and stored as a set.

[0088] As an example, a set of text vectors describing liquid volume: Embed 液体体积 [100ml, 200ml, 500ml, 1L, 2L, 3L,…]

[0089] The text vector library can be pre-built or built during the training of the multimodal model; this application does not impose any limitations on this. Furthermore, the text vector library can be updated during the training of the multimodal model and the recognition of the image data to be tested, thereby enabling the text vector library to recognize text with more attributes through continuous learning.

[0090] Grouping keywords according to different attributes and storing text vectors belonging to the same attribute as a set facilitates the management and retrieval of text vectors.

[0091] Next, let's return to Figure 1 for further explanation. In step S40, a multimodal model for recognizing hazardous liquids is trained using visual and text vectors on the image data used for training.

[0092] Figure 5 is a flowchart illustrating an example of the process of training a multimodal model to identify hazardous liquids.

[0093] As shown in Figure 5, the specific process of training a multimodal model may include:

[0094] In step S41, a multimodal fusion alignment network is designed to map visual vectors and text vectors to a common multidimensional feature space.

[0095] In some embodiments, by designing a multimodal fusion alignment network, the visual vectors generated by the image encoder and the text vectors generated by the text encoder can be mapped to a common multidimensional feature space, thereby enabling the matching and correspondence between visual vectors and text vectors.

[0096] In step S42, an image-text contrast loss function is constructed to optimize the cross-modal alignment of visual and text vectors.

[0097] By constructing an image-text contrast loss function, we can further optimize the cross-modal alignment of visual and text vectors, and effectively achieve image-text pairing learning.

[0098] The dimension of the loss function is related to the number of attributes. That is, with three attributes, a three-dimensional loss function can be constructed, and the weight parameters of the three dimensions are determined through continuous learning.

[0099] By considering the number of attributes in the construction of the loss function, the relationship between each attribute in the text vector and the visual vector can be reflected. This allows for a comprehensive consideration of each attribute to obtain the recognition result, thereby improving the accuracy of hazardous liquid recognition.

[0100] In step S43, the multimodal fusion alignment network is trained using a loss function to obtain a multimodal model.

[0101] In this application, a multimodal model was trained by designing a multimodal fusion alignment network, constructing an image-text contrast loss function, and training the multimodal fusion alignment network using the loss function, thereby obtaining a model that matches visual vectors and text vectors.

[0102] In addition, since its text vectors take into account multiple attributes such as the liquid composition and container material of hazardous liquids, and these attributes are also considered in the construction of the loss function, a multimodal model that can take into account multiple attributes such as the liquid composition and container material of hazardous liquids can be obtained, thereby making the recognition accuracy of the multimodal model higher.

[0103] In addition, in some embodiments, during the training of the multimodal model, the text vector for each attribute is independently associated with the visual vector.

[0104] Specifically, the text attributes of each dimension are associated with image features through learning, and the models that learn the associations for different dimensions are independent of each other. When identifying hazardous liquids, by describing their characteristics using different dimensions without interference between them, a more accurate judgment can be made by comprehensively considering multiple dimensions.

[0105] After training the multimodal recognition model, it can also be used to identify hazardous liquids.

[0106] In this embodiment, by using both visual vectors and text vectors in the training of the multimodal model, wherein the attributes of the text vectors include at least the liquid composition and container material, it is possible to identify hazardous liquids that take into account attributes such as liquid composition and container material, thereby improving the accuracy of hazardous liquid identification.

[0107] Figure 6 is a flowchart illustrating another embodiment of the multimodal identification method for hazardous liquids according to this application. In Figure 6, in addition to steps S10 to S40 included in Figure 1, step S50 is also included, which involves identifying hazardous liquids using a trained multimodal model.

[0108] In step S50, for the image data to be tested, the similarity between the visual vector and the text vector is calculated based on the trained multimodal model. Based on the similarity judgment, a dangerous liquid identification result containing descriptive text is generated.

[0109] Figure 7 is a flowchart illustrating an example of a process for identifying hazardous liquids using a trained multimodal model.

[0110] As shown in Figure 7, the specific process of identifying hazardous liquids using the trained multimodal model can include steps S51 to S53.

[0111] In step S51, the similarity between the visual vector and the text vectors of each attribute in the text vector library is calculated.

[0112] In some embodiments, as shown in Figure 7, visual vectors and text vectors can be input into a trained multimodal model, and the multimodal model can be used to calculate the similarity between the visual vectors and the text vectors of each attribute in the text vector library.

[0113] In step S52, the similarity between visual vectors and text vectors is determined.

[0114] In some embodiments, similarity can be determined for the visual vector of each container region of each image and the text vector of each keyword.

[0115] In some embodiments, the specific process for similarity determination may include:

[0116] Set a preset similarity threshold;

[0117] When the similarity calculated in step S52 is greater than the preset similarity threshold, the corresponding keyword, i.e. the corresponding text vector, is recorded as a valid match of the visual vector.

[0118] In each set of text vectors for the same attribute, at most one valid match is retained. That is, a valid text vector can only be unique for each attribute.

[0119] If there is no valid match in the text vector set of a certain attribute, the keyword of the attribute corresponding to that set can be omitted when generating the description text later.

[0120] In step S53, based on the similarity judgment result, the recognition result containing the descriptive text is output.

[0121] In some embodiments, keywords related to the effectively matched text vectors can be included in the description text for each set of attributes. The description text is generated by combining keywords that are related to the effectively matched text vectors for each set of attributes.

[0122] The identification results of hazardous liquids usually also include the spatial location of the hazardous liquid. Therefore, it can be understood that the above identification results include the spatial location of the hazardous liquid and descriptive text related to the hazardous liquid.

[0123] By setting thresholds and determining the valid matching text vectors for each attribute, the descriptive text contains keywords corresponding to the valid matching text vectors, thus generating accurate and rich recognition results.

[0124] Figure 8 shows a schematic diagram of another embodiment of the multimodal identification method for hazardous liquids.

[0125] Figure 8 illustrates the entire process of generating text vectors and visual vectors from generated text data, performing text vector encoding (text encoding) and visual vector encoding (image encoding) based on text data and CT image data respectively, and inputting the text vectors and visual vectors into a trained multimodal recognition model to obtain the dangerous liquid recognition result.

[0126] In this application, because the correlation between visual vectors and text vectors is used for recognition, it is possible to identify hazardous liquids that take into account the liquid composition, container material, and other attributes conveyed by the text vectors, thus resulting in more accurate recognition results. Furthermore, since the recognition results include descriptive text containing keywords that match the attributes of each container in the image data, the liquid composition, container material, and other attributes of the hazardous liquid can be intuitively determined.

[0127] In addition, the identification results mentioned above can also be sent to baggage inspection systems for subsequent comprehensive inspection.

[0128] Another embodiment of this application provides a multimodal identification device for hazardous liquids.

[0129] Figure 9 is a structural schematic diagram illustrating an embodiment of a multimodal identification device for hazardous liquids.

[0130] As shown in Figure 9, the multimodal identification device for hazardous liquids may include: a data generation unit 10, a visual vector encoding unit 20, a text vector encoding unit 30, and a training unit 40.

[0131] The data generation unit 10 generates text data describing the properties of a hazardous liquid from CT imaging image data, the properties including at least the liquid composition and container material.

[0132] The visual vector encoding unit 20 uses an image encoder to encode the image data to obtain a visual vector.

[0133] The visual vector encoding unit 20 can acquire container regions in each image of the image data and generate visual vectors for each container region.

[0134] The text vector encoding unit 30 uses a text encoder to encode each attribute in the text data to obtain a text vector, and stores the text vector in a text vector library.

[0135] The training unit 40 trains a multimodal model for recognizing hazardous liquids using the visual vectors and text vectors for the image data used in the training process.

[0136] Figure 10 is a schematic diagram illustrating another embodiment of a multimodal identification device for hazardous liquids. In addition to the data generation unit 10, visual vector encoding unit 20, text vector encoding unit 30, and training unit 40 shown in Figure 9, the multimodal identification device for hazardous liquids shown in Figure 10 also includes an identification unit 50.

[0137] For the image data to be tested, the recognition unit 50 calculates the similarity between the visual vector and the text vector based on the trained multimodal model, and generates a dangerous liquid recognition result containing descriptive text based on the similarity judgment.

[0138] The data generation unit 10, visual vector encoding unit 20, text vector encoding unit 30, training unit 40, and recognition unit 50 described above perform the processing in steps S10 to S50 of the above-described multimodal recognition method for hazardous liquids. Therefore, for details and technical effects, please refer to the above-described implementation method for multimodal recognition.

[0139] Figure 11 is a schematic diagram showing an example of the text vector encoding unit 30.

[0140] As shown in Figure 11, the text vector encoding unit 30 may include: a keyword acquisition module 31, a text vector generation module 32, and a text vector library storage module 33.

[0141] The keyword acquisition module 31 performs word segmentation on the text data to obtain keywords with different attributes.

[0142] The text vector generation module 32 uses the text encoder to encode the keywords, generating a text vector corresponding to each keyword.

[0143] The text vector library storage module 33 groups keywords according to different attributes in the text vector library and stores text vectors belonging to the same attribute as a set.

[0144] The keyword acquisition module 31, text vector generation module 32, and text vector library storage module 33 described above perform the processing in steps S31 to S33 of the above-mentioned multimodal identification method for hazardous liquids. Therefore, the details and technical effects can be referred to the above-described implementation method for multimodal identification.

[0145] Figure 12 is a structural schematic diagram showing an example of the training unit 40.

[0146] As shown in Figure 12, the training unit 40 may include: a multimodal fusion alignment network design module 41, a loss function construction module 42, and a model training module 43.

[0147] The multimodal fusion alignment network design module 41 designs a multimodal fusion alignment network that maps the visual vector and the text vector to a common multidimensional feature space;

[0148] The loss function construction module 42 constructs an image-text contrast loss function to optimize the cross-modal alignment of the visual vector and the text vector. The dimension of the loss function is related to the number of attributes.

[0149] The model training module 43 trains the multimodal fusion alignment network using a loss function to obtain a multimodal model.

[0150] The training unit 40 also independently learns the association between the text vector of each attribute and the visual vector.

[0151] The multimodal fusion alignment network design module 41, loss function construction module 42, and model training module 43 described above perform the processing in steps S41 to S43 of the multimodal identification method for hazardous liquids described above. Therefore, the details and technical effects can be referred to the above implementation method for the multimodal identification method.

[0152] Figure 13 is a structural schematic diagram showing an example of the identification unit 50.

[0153] As shown in Figure 13, the recognition unit 50 may further include: a similarity calculation module 51, a similarity judgment module 52, and a recognition result output module 53.

[0154] The similarity calculation module 51 calculates the similarity between the visual vector and the text vectors of each attribute in the text vector library.

[0155] The similarity judgment module 52 performs a similarity judgment between the visual vector and the text vector.

[0156] The recognition result output module 53 outputs the recognition result containing descriptive text based on the similarity judgment result.

[0157] The similarity judgment module 51 may further include:

[0158] The threshold setting unit allows setting a preset similarity threshold; and

[0159] The effective matching determination unit records the text vector of the attribute as a valid match if the similarity is greater than a preset similarity threshold.

[0160] The descriptive text contains keywords that correspond to the validly matched text vectors.

[0161] The similarity calculation module 51, similarity judgment module 52, and recognition result output module 53 described above perform the processing in steps S51 to S53 of the above-described multimodal recognition method for hazardous liquids. Therefore, the details and technical effects can be referred to the above-described implementation method for multimodal recognition.

[0162] One embodiment of this application may also provide an electronic device.

[0163] Figure 14 shows a schematic diagram of the structure of an electronic device according to this application. As shown in Figure 14, the electronic device may include a processor 801 and a memory 802 storing computer programs or instructions.

[0164] Specifically, the processor 801 may include a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits that can be configured to implement the embodiments of this application.

[0165] Memory 802 may include mass storage for data or instructions. For example, and not limitingly, memory 802 may include a hard disk drive (HDD), floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 802 may include removable or non-removable (or fixed) media. Where appropriate, memory 802 may be internal or external to the integrated gateway disaster recovery device. In a particular embodiment, memory 802 is non-volatile solid-state memory. In a particular embodiment, memory 802 includes read-only memory (ROM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.

[0166] The processor 801 reads and executes computer program instructions stored in the memory 802 to implement any of the multimodal identification methods for hazardous liquids in the above embodiments.

[0167] In one example, the electronic device may also include a communication interface 803 and a bus 810. As shown in Figure 14, the processor 801, memory 802, and communication interface 803 are connected via bus 810 and communicate with each other.

[0168] The communication interface 803 is mainly used to realize communication between various modules, devices, units and / or devices in the embodiments of this application.

[0169] Bus 810 includes hardware, software, or both, that couples components of an electronic device together. For example, and not limitingly, the bus may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 810 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, this application contemplates any suitable bus or interconnect.

[0170] The electronic device can execute the multimodal identification method for hazardous liquids of this application, thereby realizing the multimodal identification device for hazardous liquids of this application.

[0171] In addition, in conjunction with the above-described multimodal identification method for hazardous liquids, this application can also provide a readable storage medium for implementation. This readable storage medium stores program instructions; when executed by a processor, these program instructions implement any of the multimodal identification methods for hazardous liquids described in the above embodiments.

[0172] In conjunction with the above-described multimodal identification method for hazardous liquids, this application can also provide a computer program product for implementation. This computer program product includes a computer program whose instructions, when executed by a processor, implement any of the multimodal identification methods for hazardous liquids described in the above embodiments.

[0173] It should be clarified that this application is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of this application is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of this application.

[0174] The functional blocks shown in the above-described structural diagram can be implemented as hardware, software, firmware, or a combination thereof. When implemented in hardware, they can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this application are programs or code segments used to perform the required tasks. Programs or code segments can be stored on a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried on a carrier wave. "Machine-readable medium" can include any medium capable of storing or transmitting information. Examples of machine-readable media include electronic circuits, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio frequency (RF) links, etc. Code segments can be downloaded via computer networks such as the Internet, intranets, etc.

[0175] It should also be noted that the exemplary embodiments mentioned in this application describe methods or apparatuses based on a series of steps or devices. However, this application is not limited to the order of the above steps; that is, the steps can be performed in the order mentioned in the embodiments, or in a different order, or several steps can be performed simultaneously.

[0176] The above description is merely a specific implementation of this application. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices, modules, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0177] It should be understood that the scope of protection of this application is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the scope of the technology disclosed in this application, and such modifications or substitutions should be covered within the scope of protection of this application.

Claims

1. A multimodal identification method for hazardous liquids, comprising: Based on CT imaging image data, generate text data describing the properties of hazardous liquids, including at least the liquid composition and container material; The image data is encoded using an image encoder to obtain a visual vector; Using a text encoder, each attribute in the text data is encoded to obtain a text vector, and the text vector is stored in a text vector library; and Using the image data for learning, a multimodal model for recognizing hazardous liquids is trained using the visual vectors and the text vectors.

2. The multimodal recognition method according to claim 1 further includes: For the image data to be tested, the similarity between the visual vector and the text vector is calculated based on the trained multimodal model. Based on the similarity judgment, a dangerous liquid identification result containing descriptive text is generated.

3. The multimodal recognition method according to claim 2, wherein, Based on the trained multimodal model, the similarity between the visual vector and the text vector is calculated. Based on the similarity judgment, a hazardous liquid identification result containing descriptive text is generated, including: Using the multimodal model, the similarity between the visual vector and the text vectors of each attribute in the text vector library is calculated. Perform a similarity judgment between the visual vector and the text vector; and Based on the similarity judgment results, the output includes the recognition results containing the descriptive text.

4. The multimodal recognition method according to claim 3, wherein, The similarity determination between the visual vector and the text vector includes: Set a preset similarity threshold; and If the similarity is greater than a preset similarity threshold, then the text vector of that attribute is recorded as a valid match. The descriptive text contains keywords that correspond to the validly matched text vectors.

5. The multimodal recognition method according to claim 1 or 2, wherein, Encoding the image data using an image encoder to obtain visual vectors includes: Obtain the container region in each image of the image data, and generate the visual vector for each container region.

6. The multimodal recognition method according to claim 1 or 2, wherein, Using a text encoder, each attribute in the text data is encoded to obtain a text vector, including: The text data is segmented to obtain keywords with different attributes; and The keywords are encoded using the text encoder, generating a text vector corresponding to each keyword.

7. The multimodal recognition method according to claim 6, wherein, In the text vector library, keywords are grouped according to different attributes, and text vectors belonging to the same attribute are stored as a set.

8. The multimodal recognition method according to claim 1, wherein, Training a multimodal model for identifying hazardous liquids using the visual vectors and the text vectors includes: Design a multimodal fusion alignment network to map the visual vectors and the text vectors to a common multidimensional feature space; Construct an image-text contrast loss function to optimize the cross-modal alignment of the visual vector and the text vector; the dimension of the loss function is related to the number of attributes. The multimodal fusion alignment network is trained using a loss function to obtain a multimodal model.

9. The multimodal recognition method according to claim 1 or 8, wherein, Training a multimodal model for identifying hazardous liquids using the visual vectors and the text vectors includes: For each attribute of the text vector, the association between the text vector and the visual vector is learned independently.

10. The multimodal recognition method according to claim 1, wherein, The properties also include at least one of the following: liquid volume and container wall thickness.

11. A multimodal identification device for hazardous liquids, comprising: The data generation unit generates text data describing the properties of hazardous liquids from CT imaging image data, the properties including at least the liquid composition and container material; The visual vector encoding unit uses an image encoder to encode the image data to obtain visual vectors; A text vector encoding unit uses a text encoder to encode each attribute in the text data to obtain a text vector, and stores the text vector in a text vector library; and The training unit uses the visual vectors and text vectors to train a multimodal model for recognizing hazardous liquids based on the image data used for learning.

12. The multimodal recognition device according to claim 11, further comprising: The recognition unit, for the image data to be tested, calculates the similarity between the visual vector and the text vector based on the trained multimodal model, and generates a dangerous liquid recognition result containing descriptive text based on the similarity judgment.

13. The multimodal recognition device according to claim 12, wherein, The identification unit includes: The similarity calculation module uses the multimodal model to calculate the similarity between the visual vector and the text vectors of each attribute in the text vector library. The similarity determination module performs a similarity determination between the visual vector and the text vector; and The recognition result output module outputs the recognition result, which includes descriptive text, based on the similarity judgment result.

14. The multimodal recognition device according to claim 13, wherein, The similarity determination module includes: The threshold setting unit allows setting a preset similarity threshold; and The effective matching determination unit determines that if the similarity is greater than a preset similarity threshold, the text vector of that attribute is recorded as a valid match. The descriptive text contains keywords that correspond to the validly matched text vectors.

15. The multimodal recognition device according to claim 11 or 12, wherein, The visual vector encoding unit acquires the container region in each image of the image data and generates the visual vector for each container region.

16. The multimodal recognition device according to claim 11 or 12, wherein, The text vector encoding unit includes: The keyword acquisition module performs word segmentation on the text data to obtain keywords with different attributes; and The text vector generation module uses the text encoder to encode the keywords, generating a text vector corresponding to each keyword.

17. The multimodal recognition device according to claim 16, wherein, The text vector encoding unit also includes a text vector library storage module. The text vector library storage module groups keywords according to different attributes in the text vector library and stores text vectors belonging to the same attribute as a set.

18. The multimodal recognition device according to claim 11, wherein, The training unit includes: The multimodal fusion alignment network design module designs a multimodal fusion alignment network that maps the visual vectors and the text vectors to a common multidimensional feature space. The loss function construction module constructs an image-text contrast loss function to optimize the cross-modal alignment of the visual vector and the text vector. The dimension of the loss function is related to the number of attributes. The model training module trains the multimodal fusion alignment network using a loss function to obtain a multimodal model.

19. The multimodal recognition device according to claim 11 or 18, wherein, The training unit independently learns the association between the text vector and the visual vector for each attribute.

20. The multimodal recognition device according to claim 11, wherein, The properties also include at least one of the following: liquid volume and container wall thickness.

21. A computer program product comprising a computer program that causes a computer to perform the steps of the multimodal identification method for hazardous liquids as described in claims 1 to 10.